# Lab: Data-Centric vs Model-Centric approaches

This lab gives an introduction to data-centric vs model-centric approaches to machine learning problems, showing how data-centric approaches can outperform purely model-centric approaches.

In this lab, we'll build a classifier for product reviews (restricted to the magazine category), like:

> Excellent! I look forward to every issue. I had no idea just how much I didn't know.  The letters from the subscribers are educational, too.

Label: ⭐️⭐️⭐️⭐️⭐️ (good)

> My son waited and waited, it took the 6 weeks to get delivered that they said it would but when it got here he was so dissapointed, it only took him a few minutes to read it.

Label: ⭐️ (bad)

We'll work with a dataset that has some issues, and we'll see how we can squeeze only so much performance out of the model by being clever about model choice, searching for better hyperparameters, etc. Then, we'll take a look at the data (as any good data scientist should), develop an understanding of the issues, and use simple approaches to improve the data. Finally, we'll see how improving the data can improve results.

## Installing software

For this lab, you'll need to install [scikit-learn](https://scikit-learn.org/) and [pandas](https://pandas.pydata.org/). If you don't have them installed already, you can install them by running the following cell:

In [1]:
import os
!pip install scikit-learn pandas

Collecting scikit-learn
  Using cached scikit_learn-1.3.0-cp311-cp311-win_amd64.whl (9.2 MB)
Collecting pandas
  Using cached pandas-2.0.3-cp311-cp311-win_amd64.whl (10.6 MB)
Collecting numpy>=1.17.3
  Using cached numpy-1.25.0-cp311-cp311-win_amd64.whl (15.0 MB)
Collecting scipy>=1.5.0
  Using cached scipy-1.11.1-cp311-cp311-win_amd64.whl (44.0 MB)
Collecting joblib>=1.1.1
  Using cached joblib-1.3.1-py3-none-any.whl (301 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting pytz>=2020.1
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Installing collected packages: pytz, tzdata, threadpoolctl, numpy, joblib, scipy, pandas, scikit-learn
Successfully installed joblib-1.3.1 numpy-1.25.0 pandas-2.0.3 pytz-2023.3 scikit-learn-1.3.0 scipy-1.11.1 threadpoolctl-3.1.0 tzdata-2023.3



[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# Loading the data

First, let's load the train/test sets and take a look at the data.

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv('reviews_train.csv')
test = pd.read_csv('reviews_test.csv')

test.sample(5)

Unnamed: 0,review,label
401,"Harper's is a must for anyone into fashion, or...",good
428,Love it. Always good.,good
479,this also has been a favorite magazine of mine...,good
474,new yorker magazine is the best!,good
993,Just not the Bazaar that it was years ago. Wil...,bad


# Training a baseline model

There are many approaches for training a sequence classification model for text data. In this lab, we're giving you code that mirrors what you find if you look up [how to train a text classifier](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), where we'll train an SVM on [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) features (numeric representations of each text field based on word occurrences).

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

In [5]:
sgd_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

In [6]:
_ = sgd_clf.fit(train['review'], train['label'])

## Evaluating model accuracy

In [7]:
from sklearn import metrics

In [21]:
def evaluate_func(clf):
    pred = clf.predict(test['review'])
    acc = metrics.accuracy_score(test['label'], pred)
    print(f'Accuracy: {100*acc:.1f}%')

In [22]:
evaluate_func(sgd_clf)

Accuracy: 76.7%


## Trying another model

76% accuracy is not great for this binary classification problem. Can you do better with a different model, or by tuning hyperparameters for the SVM trained with SGD?

# Exercise 1

Can you train a more accurate model on the dataset (without changing the dataset)? You might find this [scikit-learn classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) handy, as well as the [documentation for supervised learning in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

One idea for a model you could try is a [naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

You could also try experimenting with different values of the model hyperparameters, perhaps tuning them via a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

Or you can even try training multiple different models and [ensembling their predictions](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier), a strategy often used to win prediction competitions like Kaggle.

**Advanced:** If you want to be more ambitious, you could try an even fancier model, like training a Transformer neural network. If you go with that, you'll want to fine-tune a pre-trained model. This [guide from HuggingFace](https://huggingface.co/docs/transformers/training) may be helpful.

In [10]:
!pip install transformers datasets evaluate

Collecting transformers
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting datasets
  Using cached datasets-2.13.1-py3-none-any.whl (486 kB)
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
     -------------------------------------- 81.4/81.4 kB 757.4 kB/s eta 0:00:00
Collecting filelock
  Using cached filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.14.1
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
     -------------------------------------- 268.8/268.8 kB 3.3 MB/s eta 0:00:00
Collecting regex!=2019.12.17
  Using cached regex-2023.6.3-cp311-cp311-win_amd64.whl (268 kB)
Collecting requests
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.3-cp311-cp311-win_amd64.whl (3.5 MB)
Collecting safetensors>=0.3.1
  Using cached safetensors-0.3.1-cp311-cp311-win_amd64.whl (263 kB)
Collecting tqdm>=4.27
  Using cached tqdm-4.


[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm.auto import tqdm
# YOUR CODE HERE
dataset = load_dataset("csv", data_files={ "train": "review_train.csv", "test": "review_test.csv"})

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True, padding="max_length")

dataset = dataset.map(tokenize_function, batched=True, remove_columns="review")

label_mapping = {'bad': 0, 'good': 1}

# Map the labels to numerical values
dataset = dataset.map(lambda example: {'label': label_mapping[example['label']]})

dataset.rename_column("label", "labels")
dataset.set_format("torch")

train_loader = DataLoader(dataset["train"], shuffle=True, batch_size=16, num_workers=os.cpu_count())
test_loader = DataLoader(dataset["test"], batch_size=16, num_workers=os.cpu_count())

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

optimizer = AdamW(model.parameters(), lr=3e-5)

num_epochs = 3
total_steps = num_epochs * len(train_loader)
scheduler = get_scheduler("linear", optimizer, num_warmup_steps=0, num_training_steps=total_steps)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

from tqdm.auto import tqdm

progress_bar = tqdm(range(total_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

# evaluate your model and see if it does better
# than the ones we provided


In [None]:
!pip install evaluate
import evaluate

metric = evaluate.load("accuracy")
model.eval()
with torch.inference_mode():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits

        preds = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=preds, references=batch["labels"])

results = metric.compute()

results

## Taking a closer look at the training data

Let's actually take a look at some of the training data:

In [13]:
train.head()

Unnamed: 0,review,label
0,Based on all the negative comments about Taste...,good
1,I still have not received this. Obviously I c...,bad
2,</tr>The magazine is not worth the cost of sub...,good
3,This magazine is basically ads. Kindve worthle...,bad
4,"The only thing I've recieved, so far, is the b...",bad


Zooming in on one particular data point:

In [14]:
print(train.iloc[0].to_dict())

{'review': "Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>", 'label': 'good'}


This data point is labeled "good", but it's clearly a negative review. Also, it looks like there's some funny HTML stuff at the end.

# Exercise 2

Take a look at some more examples in the dataset. Do you notice any patterns with bad data points?

In [15]:
# YOUR CODE HERE
train.sample(20)

Unnamed: 0,review,label
3559,this truly is the best car magazine i have eve...,good
5736,No longer meets my satisfaction.\nWish to canc...,good
1265,"Guess when you rate an article low, you try to...",bad
1452,</dt>I canceled the subscription because I no ...,good
2522,A must for the runner! Whether your an amateur...,good
1295,does not work on windows or windows tablet. W...,good
5802,"<dt class=""hdlist1"">Set</dt>I give this as a g...",bad
3414,"Excellent Service, Quality & Price!</div>",bad
153,"</tr>It is for younger people, I am 70.",good
5177,</p><dl>THIS IS AN EXCELLENT MAGAZINE!! IT GIV...,bad


## Issues in the data

It looks like there's some funny HTML tags in our dataset, and those datapoints have nonsense labels. Maybe this dataset was collected by scraping the internet, and the HTML wasn't quite parsed correctly in all cases.

# Exercise 3

To address this, a simple approach we might try is to throw out the bad data points, and train our model on only the "clean" data.

Come up with a simple heuristic to identify data points containing HTML, and filter out the bad data points to create a cleaned training set.

In [16]:
def is_bad_data(review: str) -> bool:
    # YOUR CODE HERE
    return '<' in review

## Creating the cleaned training set

In [17]:
train_clean = train[~train['review'].map(is_bad_data)]

## Evaluating a model trained on the clean training set

In [18]:
from sklearn import clone

In [19]:
sgd_clf_clean = clone(sgd_clf)

In [20]:
_ = sgd_clf_clean.fit(train_clean['review'], train_clean['label'])

This model should do significantly better:

In [23]:
evaluate_func(sgd_clf_clean)

Accuracy: 97.1%
