# Lab: Data-Centric vs Model-Centric approaches

This lab gives an introduction to data-centric vs model-centric approaches to machine learning problems, showing how data-centric approaches can outperform purely model-centric approaches.

In this lab, we'll build a classifier for product reviews (restricted to the magazine category), like:

> Excellent! I look forward to every issue. I had no idea just how much I didn't know.  The letters from the subscribers are educational, too.

Label: ⭐️⭐️⭐️⭐️⭐️ (good)

> My son waited and waited, it took the 6 weeks to get delivered that they said it would but when it got here he was so dissapointed, it only took him a few minutes to read it.

Label: ⭐️ (bad)

We'll work with a dataset that has some issues, and we'll see how we can squeeze only so much performance out of the model by being clever about model choice, searching for better hyperparameters, etc. Then, we'll take a look at the data (as any good data scientist should), develop an understanding of the issues, and use simple approaches to improve the data. Finally, we'll see how improving the data can improve results.

## Installing software

For this lab, you'll need to install [scikit-learn](https://scikit-learn.org/) and [pandas](https://pandas.pydata.org/). If you don't have them installed already, you can install them by running the following cell:

In [4]:
!pip install -q scikit-learn pandas

# Loading the data

First, let's load the train/test sets and take a look at the data.

In [2]:
import pandas as pd

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

In [3]:
train = pd.read_csv('reviews_train.csv')
test = pd.read_csv('reviews_test.csv')

test.sample(5)

Unnamed: 0,review,label
880,Way too many ads and most articles read like t...,bad
54,"Boyfriend loves these, will renew",good
558,So sad that a historical magazine has declined,bad
583,"Purchased this item, but it wouldn't download.",bad
817,"Although the magazine is great, it takes Amazo...",bad


# Training a baseline model

There are many approaches for training a sequence classification model for text data. In this lab, we're giving you code that mirrors what you find if you look up [how to train a text classifier](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), where we'll train an SVM on [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) features (numeric representations of each text field based on word occurrences).

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

In [10]:
sgd_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

In [14]:
_ = sgd_clf.fit(train['review'], train['label'])

## Evaluating model accuracy

In [15]:
from sklearn import metrics

In [16]:
def evaluate(clf):
    pred = clf.predict(test['review'])
    acc = metrics.accuracy_score(test['label'], pred)
    print(f'Accuracy: {100*acc:.1f}%')

In [17]:
evaluate(sgd_clf)

Accuracy: 76.3%


## Trying another model

76% accuracy is not great for this binary classification problem. Can you do better with a different model, or by tuning hyperparameters for the SVM trained with SGD?

# Exercise 1

Can you train a more accurate model on the dataset (without changing the dataset)? You might find this [scikit-learn classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) handy, as well as the [documentation for supervised learning in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

One idea for a model you could try is a [naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

You could also try experimenting with different values of the model hyperparameters, perhaps tuning them via a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

Or you can even try training multiple different models and [ensembling their predictions](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier), a strategy often used to win prediction competitions like Kaggle.

**Advanced:** If you want to be more ambitious, you could try an even fancier model, like training a Transformer neural network. If you go with that, you'll want to fine-tune a pre-trained model. This [guide from HuggingFace](https://huggingface.co/docs/transformers/training) may be helpful.

In [19]:
vectorizer1 = CountVectorizer()
vectorizer1.fit(train['review'])
voc = vectorizer1.vocabulary_
len(voc)

6666

In [20]:
from itertools import islice

vectorizer = CountVectorizer()
vectorizer.fit(train['review'])
vocabs = vectorizer.vocabulary_
voc_head = dict(islice(vocabs.items(), 5))
voc_head

{'based': 666, 'on': 4131, 'all': 346, 'the': 5921, 'negative': 3944}

In [21]:
train_counts = vectorizer.transform(train['review'])
train_counts.shape

(6666, 6666)

In [None]:
train_counts.toarray()[0].shape

(6666,)

In [None]:
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(train_counts)
X_tfidf = tfidf_transformer.transform(train_counts)
X_tfidf.toarray()[:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# Create a pipeline for text preprocessing
text_preprocessing = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer())
])

In [None]:
# YOUR CODE HERE
from sklearn.naive_bayes import MultinomialNB 
nb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
# evaluate your model and see if it does better
# than the ones we provided

In [None]:
train.shape

(6666, 2)

In [None]:
nb_clf.fit(train['review'], train['label'])
evaluate(nb_clf)

In [None]:
len(nb_clf['vect'].get_feature_names_out())

6666

In [None]:
nb_clf['vect'].get_feature_names_out()[500:550]

array(['arrived', 'arrives', 'arriving', 'arrogant', 'art', 'arthritis',
       'artical', 'articales', 'articals', 'articiles', 'article',
       'articled', 'articles', 'artisan', 'artist', 'artists', 'artlcles',
       'arts', 'artwork', 'as', 'asap', 'ashamed', 'asia', 'asimonv',
       'ask', 'asked', 'asking', 'asks', 'asleep', 'asomea', 'aspect',
       'aspects', 'aspires', 'aspirin', 'ass', 'asserttext', 'assessable',
       'asset', 'assign', 'assist', 'assistance', 'assistant',
       'associated', 'assortment', 'assume', 'assumed', 'assumptive',
       'assurance', 'assured', 'astronomy'], dtype=object)

In [None]:
len(nb_clf['tfidf'].get_feature_names_out())

6666

In [None]:
param_grid = {
    'vect__ngram_range': [(1, 1), (1, 2)],  # unigrams or bigrams
    'vect__stop_words': [None, 'english'], # use or don't use stop words
    'vect__max_df': [0.5, 0.75, 1.0], # max document frequency for words
    'vect__min_df': [1, 5], # min document frequency for words
    'tfidf__use_idf': [True, False], # enable or disable inverse document frequency
    'tfidf__norm': ['l1', 'l2'], # l1 or l2 normalization
    'clf__alpha': [0.1, 1], # smoothing parameter
}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(nb_clf, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(train['review'], train['label'])

Fitting 5 folds for each of 192 candidates, totalling 960 fits


In [None]:
print('Best hyperparameters: ', grid_search.best_params_)
print('Best score: ', grid_search.best_score_)

Best hyperparameters:  {'clf__alpha': 0.1, 'tfidf__norm': 'l2', 'tfidf__use_idf': False, 'vect__max_df': 0.5, 'vect__min_df': 1, 'vect__ngram_range': (1, 2), 'vect__stop_words': 'english'}
Best score:  0.655715653051194


In [None]:
best_params = grid_search.best_params_
nb_clf.set_params(**best_params)

In [None]:
nb_clf.fit(train['review'], train['label'])
evaluate(nb_clf)

## Ensembles test

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = MultinomialNB()

eclf = VotingClassifier(
    estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
    voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X_tfidf, train['label'], scoring='accuracy', cv=5)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))


Accuracy: 0.63 (+/- 0.02) [Logistic Regression]
Accuracy: 0.66 (+/- 0.01) [Random Forest]
Accuracy: 0.63 (+/- 0.01) [naive Bayes]
Accuracy: 0.64 (+/- 0.02) [Ensemble]


In [None]:

# Create the ensemble classifier with individual classifiers
eclf = VotingClassifier(
    estimators=[('lr', clf1), ('rf', clf2), ('mnb', clf3)],
    voting='hard'
)

# Create a pipeline with nested text preprocessing and the VotingClassifier
voting_clf = Pipeline([
    ('text_preprocessing', text_preprocessing),
    ('eclf', eclf)
])

In [None]:
voting_clf.fit(train['review'], train['label'])
evaluate(voting_clf)

Accuracy: 84.1%


## Taking a closer look at the training data

Let's actually take a look at some of the training data:

In [None]:
train.head(10)

Unnamed: 0,review,label
0,"Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>",good
1,I still have not received this. Obviously I can't review something I haven't seen. Where is my order???????,bad
2,</tr>The magazine is not worth the cost of subscription.</tr>,good
3,This magazine is basically ads. Kindve worthless to me really.,bad
4,"The only thing I've recieved, so far, is the bill.\nI know this is [ or was when I last subscribed ] a ten month magazine - but so far not even the usual promotional papers.......",bad
5,"The magazines are great, but I never received the golf shoe bag that was supposed to accompany my subscription...<body>",good
6,This is one magazine I really love. It has primitive country decor and the articales are really great. I get so many ideas from this magazine.,good
7,Did not. Open.,bad
8,Forever the best magazine! Love it!!,good
9,Very disappointed. It's nothing more than an advertisement for international resorts.....I will not renew.,bad


Zooming in on one particular data point:

In [None]:
print(train.iloc[0].to_dict())

{'review': "Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>", 'label': 'good'}


This data point is labeled "good", but it's clearly a negative review. Also, it looks like there's some funny HTML stuff at the end.

# Exercise 2

Take a look at some more examples in the dataset. Do you notice any patterns with bad data points?

In [None]:
# YOUR CODE HERE
train.head()

Unnamed: 0,review,label
0,"Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>",good
1,I still have not received this. Obviously I can't review something I haven't seen. Where is my order???????,bad
2,</tr>The magazine is not worth the cost of subscription.</tr>,good
3,This magazine is basically ads. Kindve worthless to me really.,bad
4,"The only thing I've recieved, so far, is the bill.\nI know this is [ or was when I last subscribed ] a ten month magazine - but so far not even the usual promotional papers.......",bad


## Issues in the data

It looks like there's some funny HTML tags in our dataset, and those datapoints have nonsense labels. Maybe this dataset was collected by scraping the internet, and the HTML wasn't quite parsed correctly in all cases.

# Exercise 3

To address this, a simple approach we might try is to throw out the bad data points, and train our model on only the "clean" data.

Come up with a simple heuristic to identify data points containing HTML, and filter out the bad data points to create a cleaned training set.

In [22]:
import re

def is_bad_data(review: str) -> bool:
    # YOUR CODE HERE
    pattern = re.compile(r'[<>]')
    if pattern.findall(review):
        return True
    else:   
        return False

## Creating the cleaned training set

In [23]:
train_clean = train[~train['review'].map(is_bad_data)]
train_clean.shape

(3998, 2)

## Evaluating a model trained on the clean training set

In [24]:
from sklearn import clone

In [25]:
sgd_clf_clean = clone(sgd_clf)

In [26]:
_ = sgd_clf_clean.fit(train_clean['review'], train_clean['label'])

This model should do significantly better:

In [None]:
evaluate(sgd_clf_clean)

Accuracy: 97.0%


In [7]:
train.sample(10)

Unnamed: 0,review,label
5885,All they do is bash on politics or policies th...,bad
2807,</li>Not worth the huge price tag.</TABLE>,good
3835,</div>I love it! Just how I remember it from ...,bad
3700,<TITLE>Thought this was a technology magazine....,good
2831,"I enjoyed Details Magazine, similar to GQ but ...",good
4339,<tr>I needed Vanity Fair for Kindle with the a...,good
5727,Since only my paper magazine was renewed when ...,good
5607,I love Time Magazine the most informative n tr...,good
3254,Great magazine! Looks amazing on my Kindle!</...,bad
3349,My 8 year old granddaughter takes riding lesso...,good


In [6]:
train.head()

Unnamed: 0,review,label
0,Based on all the negative comments about Taste...,good
1,I still have not received this. Obviously I c...,bad
2,</tr>The magazine is not worth the cost of sub...,good
3,This magazine is basically ads. Kindve worthle...,bad
4,"The only thing I've recieved, so far, is the b...",bad
