# Text classification

## *"Words. I know words. I have the best words!"*
*- Noam Chomsky*

# Overview

In order to train a machine learning model to classify text, we need:
1. a way to preprocess text
2. a label for each text, represented as number
3. a way to represent each text as vector input
4. a model to learn  a function $f(input) = label$
5. a way to evaluate how well the model works
6. a way to predict new data

As an example, we will use reviews data and try to classify the rating into $positive$ or $negative$, only based on the text they use.

The same method can be used for any other data, including more labels and other dependent variables (e.g., age or gender of the text author, social constructs expressed in the text, etc...). 

# 1. Data

In [None]:
import pandas as pd

data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/sa_train.csv', quoting=0)
print(len(data), data['output'].unique())
data.head(2)

1800 ['neg' 'pos']


Unnamed: 0,input,output
0,shakespeare in love is quite possibly the most...,neg
1,wizards is an animated feature that begins wit...,neg


## Preprocessing

Text is messy. The goal of preprocessing is to reduce the amount of noise (= unnecessary variation), while maintaining the signal. There is no one-size-fits-all solution, but a good approximation is the following:

In [None]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

In [None]:
def clean_text(text):
    '''reduce text to lower-case lexicon entry'''
    lemmas = [token.lemma_ for token in nlp(text) 
              if token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'}]
    return ' '.join(lemmas)

clean_text('This is a test sentence. And here comes another one... Go me!')

'test sentence here come one go'

Let's clean up the input data. This can take a while, so it's good to save it.

In [None]:
data['clean_text'] = data['input'].apply(clean_text)
data['clean_text'].head()

0    shakespeare love quite possibly most enjoyable...
1    wizard animate feature begin narration epic pr...
2    gun wielding arnold schwarzenegger change hear...
3    keep jane austen sense sensibility pride preju...
4    hollywood pimp fat cigar smoking chump wear fu...
Name: clean_text, dtype: object

In [None]:
data.head()

Unnamed: 0,input,output,clean_text
0,shakespeare in love is quite possibly the most...,neg,shakespeare love quite possibly most enjoyable...
1,wizards is an animated feature that begins wit...,neg,wizard animate feature begin narration epic pr...
2,gun wielding arnold schwarzenegger has a chang...,neg,gun wielding arnold schwarzenegger change hear...
3,"if this keeps up , jane austen ( sense and sen...",pos,keep jane austen sense sensibility pride preju...
4,"hollywood is a pimp . a fat , cigar - smoking ...",pos,hollywood pimp fat cigar smoking chump wear fu...


# 2. Labels

Here, we assume that we already have the labels. (In your task, you will have to label them yourself! Hint: use `input()` or a spreadsheet).

However, in order for the machine learning model to work with the labels, we need to translate them into a vector of numbers. We can use `sklearn.LabelEncoder`

In [None]:
from sklearn.preprocessing import LabelEncoder

# transform labels into numbers
labels2numbers = LabelEncoder()

y = labels2numbers.fit_transform(data['output'])
print(data['output'][:10], y[:10], len(y))

# keep in mind that at this time we already fit a model, so
# if we are fitting a new label encoder on new data could be that
# the encoding can be different 0 -> 1  1 -> 0

0    neg
1    neg
2    neg
3    pos
4    pos
5    neg
6    pos
7    pos
8    neg
9    neg
Name: output, dtype: object [0 0 0 1 1 0 1 1 0 0] 1800


To get the original names back, use `inverse_transform()`:

In [None]:
labels2numbers.inverse_transform([1,1,1,0,0,1])

array(['pos', 'pos', 'pos', 'neg', 'neg', 'pos'], dtype=object)

# 3. Representing text

First, we need to transform the texts into a matrix, where each row represents one text instance. The columns are the **features**


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), # at least 1, 2 or 1, 3
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

X = vectorizer.fit_transform(data['clean_text'])
print(X.shape)
# here 66808 features are quite a lot, we may set max_feature setting
# on the TfidfVectorizer like 5000 or 1000

(1800, 66808)


We can now translate back and forth between columns and words:

In [None]:
vectorizer.vocabulary_['bad']

3786

In [None]:
vectorizer.get_feature_names()[3786]

'bad'

Let's see how often that word is in the data:

In [None]:
len(data[data.clean_text.str.contains('bad')])

895

In [None]:
data[data.clean_text.str.contains('bad')] # search rows where a given substring appears

Unnamed: 0,input,output,clean_text
0,shakespeare in love is quite possibly the most...,neg,shakespeare love quite possibly most enjoyable...
1,wizards is an animated feature that begins wit...,neg,wizard animate feature begin narration epic pr...
4,"hollywood is a pimp . a fat , cigar - smoking ...",pos,hollywood pimp fat cigar smoking chump wear fu...
6,films adapted from comic books have had plenty...,pos,film adapt comic book have plenty success supe...
8,to watch ` battlefield earth ' is to wallow in...,neg,watch battlefield earth wallow misery most lud...
...,...,...,...
1794,"ladies and gentlemen , 1997 ' s independence d...",pos,lady gentleman s independence day here title s...
1795,terrence malick made an excellent 90 minute fi...,neg,terrence malick make excellent minute film ada...
1796,"as you should know , this summer has been less...",neg,should know summer less memorable total decent...
1798,a movie about divorce and custody in 1995 seem...,neg,movie divorce custody seem about as timely mov...


# 4. Learning a classification model

A classification model is simply a function that takes a text representation as input, and returns an output label.

Inside that function is normally a set of weights. By multiplying the weight vector with the input vector, we get the label.

## 4.1: Fitting a model

Fitting a model is the process of finding the right weights to map the training inputs to the training outputs. Fitting to data in `sklearn` is easy: we use the `fit()` function, giving it the input matrix and output vector.

In [None]:
from sklearn.linear_model import LogisticRegression

# classifier is the model
classifier = LogisticRegression(n_jobs=-1, class_weight='balanced') # n_jobs: parallelized among CPU cores
%time classifier.fit(X, y)
print(classifier)

CPU times: user 36.1 ms, sys: 34.1 ms, total: 70.2 ms
Wall time: 968 ms
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


The resulting fitted model has coefficients (betas) for each word/feature in our vocabulary

In [None]:
coefs = classifier.coef_
coefs

array([[ 0.00986428, -0.06329059, -0.03779013, ...,  0.06591883,
         0.02520406, -0.00039513]])

In [None]:
X.shape[1] == len(coefs[0, :])

True

We can now examine the weights/coefficients/betas for the individual words (note that each word has an ID):

In [None]:
k = vectorizer.vocabulary_['bad'] # column position for the word
print(vectorizer.get_feature_names()[k], classifier.coef_[0, k])

bad -3.493747454624923


NB: in a two-class problem, our coefficents are in a vector: positive values indicate the positive class, negative values the other class.
In a multi-class problem, we have one **row** of coefficients for each class: positive values indicate that this feature contributes to the class, negative values indicate that it contributes to other classes.

# 5. Evaluating models

Having a model is great, but how well does it do? Can it classify what it has seen? We need a way to estimate how well the model will work on new data.

We need a metric to measure performance and a way to simulate new data.

## 5.1: Metrics

We use three measure:
1. precision
2. recall
3. F1

### Precision

Precision measures how many of our model's predictions were correct. We divide the number of true positives by the number of all positives

$$
p = \frac{tp}{tp+fp}
$$

### Recall

Recall measures how many of the correct answers in the data our model managed to find. We divide the number of true positives by the number of true positives (the instances our model got) and false negatives (the instances our model *should* have gotten)

$$
r = \frac{tp}{tp+fn}
$$

### F1

A model that classified everything as, say, "positive" would get a perfect recall (it does, after all, find all positive examples). However, such a model would obviously be useless, since its precision is bad.

We want to balance the two against each other. F1 does exactly that, by taking the harmonic mean.

$$
F_1 = \frac{p\cdot r}{p+r}
$$

Luckily, all of these metrics are implemented in `sklearn`. All we have to provide are the predictions of our model, and the actual correct answers (called the *gold standard*). 

In [None]:
from sklearn.metrics import classification_report
# also accuracy is used but it is better to show all of them
# the best one is the F1 because it takes into account precision
# and recall at the same time

## 5.2: Cross-validation

How do we measure performance on new data, if we don't know what the correct outputs for those new data points are?

In **$k$-fold cross-validation**, we simulate new data, by fitting our model on parts of the data, and evaluating on other. We can thereby measure the performance on the held-out part. 

However, we have now reduced the amount of data we used to fit the data. In order to address this, we simply repeat the process $k$ times.
We separate the data into $k$ parts, fit the model on $k-1$ parts, and evaluate on the $k$th part. In the end, we have performance scores from $k$ models. The average of them tells us how well the model would work on new data.



In [None]:
from sklearn.model_selection import cross_val_score

for k in [2,3,5,10]:
    cv = cross_val_score(LogisticRegression(), X, y=y, cv=k, n_jobs=-1, scoring="f1_micro")
    fold_size = X.shape[0]/k
    
    print("F1 with {} folds for bag-of-words is {}".format(k, cv.mean()))
    print("Training on {} instances/fold, testing on {}".format(fold_size*(k-1), fold_size))
    print()

F1 with 2 folds for bag-of-words is 0.8083333333333333
Training on 900.0 instances/fold, testing on 900.0

F1 with 3 folds for bag-of-words is 0.8172222222222222
Training on 1200.0 instances/fold, testing on 600.0

F1 with 5 folds for bag-of-words is 0.828888888888889
Training on 1440.0 instances/fold, testing on 360.0

F1 with 10 folds for bag-of-words is 0.8305555555555555
Training on 1620.0 instances/fold, testing on 180.0



In [None]:
cv # these are the means

array([0.84444444, 0.82777778, 0.78333333, 0.8       , 0.80555556,
       0.85      , 0.81666667, 0.85      , 0.86111111, 0.86666667])

## Baselines
So, is that performance good? Let's compare to a **baseline**, i.e., a null-hypothesis. The simplest one is that all instances belong to the most frequent class in the data.

In [None]:
from sklearn.dummy import DummyClassifier

most_frequent = DummyClassifier(strategy='most_frequent')

print(cross_val_score(most_frequent, X, y=y, cv=5, n_jobs=-1, scoring="f1_micro").mean())
# we are obtaining better results with k-fold

0.5061111111111111


# Exercise

See whether you can apply the previous steps to a new data sets, a description of wines. Choose any of the descriptor columns as target variable. The text is already preprocessed, to save time.

In [None]:
from sklearn.model_selection import train_test_split
wine = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/wine_reviews_small.xlsx')
wine.head() 

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,country,description,designation,points,price,province,region_1,region_2,variety,winery,description_cleaned
0,0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,tremendous varietal wine hail be age year oak ...
1,1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aroma fig blackberry cassis be soften swe...
2,2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,honor memory wine once make his mother tremend...
3,3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,spend month new french oak incorporate fruit v...
4,4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,be top wine name high point vineyard foot have...


In [None]:
prov_freq = wine['province'].value_counts() # Pandas.Serie of province names frequencies
prov_name = [prov_freq.index[pos] for pos in range(0, len(prov_freq)) if prov_freq[pos] >= 10] # freq >= 10
wine = wine[wine.province.isin(prov_name)] # filter rows
wine.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,country,description,designation,points,price,province,region_1,region_2,variety,winery,description_cleaned
0,0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,tremendous varietal wine hail be age year oak ...
1,1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aroma fig blackberry cassis be soften swe...


In [None]:
wine.description_cleaned

0        tremendous varietal wine hail be age year oak ...
1        ripe aroma fig blackberry cassis be soften swe...
2        honor memory wine once make his mother tremend...
3        spend month new french oak incorporate fruit v...
4        be top wine name high point vineyard foot have...
                               ...                        
19995    dark spicy tone clove cure meat teriyaki sauce...
19996    starts clean integration elegant aroma ripe ch...
19997    precise penetrate mix black cherry fruit cut t...
19998    fine expression be light style therefore more ...
19999    wine hide its alcohol well layer plush cassis ...
Name: description_cleaned, Length: 19470, dtype: object

In [None]:
wine_des_train, wine_des_test, wine_country_train, wine_country_test = train_test_split(wine.description_cleaned, wine.country, test_size = 0.3, random_state = 42)

In [None]:
type(wine_des_train)

pandas.core.series.Series

In [None]:
# your code here
vectorizer_wine = TfidfVectorizer(ngram_range=(1,2), 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

labels2numbers_wine = LabelEncoder()

X_wine = vectorizer_wine.fit_transform(wine_des_train.astype(str))
y_wine = labels2numbers_wine.fit_transform(wine_country_train)

In [None]:
vectorizer_wine.vocabulary_['aroma'] # a word from the description_cleaned column

178

In [None]:
vectorizer_wine.get_feature_names()[197]

'aroma fresh'

In [None]:
classifier_wine = LogisticRegression(n_jobs=-1, class_weight='balanced')
classifier_wine.fit(X_wine, y_wine)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
for k in [2,3,5,10]:
    cv = cross_val_score(LogisticRegression(), X_wine, y=y_wine, cv=k, n_jobs=-1, scoring="f1_micro")
    fold_size = X_wine.shape[0]/k
    
    print("F1 with {} folds for bag-of-words is {}".format(k, cv.mean()))
    print("Training on {} instances/fold, testing on {}".format(fold_size*(k-1), fold_size))
    print()

    # we got a warning because some classes have just one instance and it can be just in 
    # train or just into the test set and this is something problematic. It depends on the 
    # k used for the cross validation.
    # We could ignore these classes. 

    # solved, see the previous pre-processing phase on wine dataset.

F1 with 2 folds for bag-of-words is 0.7514123419027892
Training on 6814.5 instances/fold, testing on 6814.5

F1 with 3 folds for bag-of-words is 0.7629319832709663
Training on 9086.0 instances/fold, testing on 4543.0

F1 with 5 folds for bag-of-words is 0.7687282640155619
Training on 10903.2 instances/fold, testing on 2725.8





F1 with 10 folds for bag-of-words is 0.7744516555107018
Training on 12266.1 instances/fold, testing on 1362.9



In [None]:
most_frequent = DummyClassifier(strategy='most_frequent')
print(cross_val_score(most_frequent, X_wine, y=y_wine, cv=10, n_jobs=-1, scoring="f1_micro").mean())

# The F1 scores are actually very good even if they seem lows, this because we are
# working with 2000 classes so it is okay.

0.4240956988934532




# 6 Heldout data

Classifying new (**held-out**) data is called **prediction**. We reuse the weights we have learned before on a new data matrix to predict the new outcomes.
Important: the new data needs to have the same number of features!

In [None]:
# read in new data set
new_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/sa_test.csv')
print(len(new_data))
new_data.head()

200


Unnamed: 0,input,output
0,robert redford ' s a river runs through it is ...,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg
2,you think that these people only exist in the ...,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg
4,brian depalma needs a hit * really * badly . s...,pos


Don't forget to clean it!

In [None]:
%time new_data['clean_text'] = new_data.input.apply(clean_text)

CPU times: user 18.2 s, sys: 298 ms, total: 18.5 s
Wall time: 18.5 s


Let's see how well we do on this data:

In [None]:
# transform text into word counts
# IMPORTANT: use same vectorizer we fit on training data to create vectors!
new_X = vectorizer.transform(new_data['clean_text'])

# translate labels
new_y = labels2numbers.transform(new_data['output'])


# use the old classifier to predict and evaluate
new_predictions = classifier.predict(new_X)
print(new_predictions)

[1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1
 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0
 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0
 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1]


In [None]:
new_X.shape

(200, 66808)

In [None]:
print(classification_report(new_y, new_predictions))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83       111
           1       0.80      0.79      0.79        89

    accuracy                           0.81       200
   macro avg       0.81      0.81      0.81       200
weighted avg       0.81      0.81      0.81       200



Instead, we can also predict the probabilities of belonging to each class

In [None]:
new_probabilities = classifier.predict_proba(new_X)
print(new_probabilities)

# predict_proba is usefull to check which instances are more
# balance, it is also a way to check the confidence that we have
# on predicting a class

[[0.30745711 0.69254289]
 [0.55629129 0.44370871]
 [0.5346982  0.4653018 ]
 [0.74681023 0.25318977]
 [0.36336666 0.63663334]
 [0.63447098 0.36552902]
 [0.52763496 0.47236504]
 [0.53041422 0.46958578]
 [0.62989898 0.37010102]
 [0.52398668 0.47601332]
 [0.39571583 0.60428417]
 [0.27701891 0.72298109]
 [0.36529535 0.63470465]
 [0.6180369  0.3819631 ]
 [0.37767803 0.62232197]
 [0.34803811 0.65196189]
 [0.30380553 0.69619447]
 [0.3705448  0.6294552 ]
 [0.50506119 0.49493881]
 [0.46744111 0.53255889]
 [0.71648016 0.28351984]
 [0.48074602 0.51925398]
 [0.34768242 0.65231758]
 [0.40778423 0.59221577]
 [0.51954813 0.48045187]
 [0.58126258 0.41873742]
 [0.69281727 0.30718273]
 [0.67484025 0.32515975]
 [0.46197409 0.53802591]
 [0.5516414  0.4483586 ]
 [0.29534742 0.70465258]
 [0.64829551 0.35170449]
 [0.71811139 0.28188861]
 [0.81374881 0.18625119]
 [0.66188463 0.33811537]
 [0.73474834 0.26525166]
 [0.30462163 0.69537837]
 [0.58988807 0.41011193]
 [0.59001609 0.40998391]
 [0.64077973 0.35922027]


For each instance (=row), we get a probability distribution over the classes (=columns)

## 6.1 Regularization

Typically, performance is lower on unseen data, because our model **overfit** the training data: it expects the new data to look *exactly* the same as the training data. That is almost never true.

In order to prevent the model from overfitting, we need to **regularize** it. Essentially, we make it harder to learn the training data.

A simple example of regularization is to "corrupt" the training data by adding a little bit of noise to each training instance. Since the noise is irregular, it becomes harder for the model to learn any patterns.

In [None]:
from scipy.sparse import random

num_instances, num_features = X.shape

for i in range(5):
    X_regularized = X + random(num_instances, num_features, density=0.01)
    # since it is random we cannot trust this, what is usually done is to use
    # a regolarization parameter, it is the c value in the model

    print(cross_val_score(LogisticRegression(), X_regularized, y=y, cv=k, n_jobs=-1, scoring="f1_micro").mean())

0.5233333333333333
0.5172222222222222
0.505
0.5333333333333332
0.5166666666666666


If you run the previous cell several times, you see different results (it gets even more varied if you change `density`). This variation arises because we add **random** noise. Not good...

Instead, it makes sense to force the model to spread the weights more evenly over all features, rather than bet on a few feature, which might not be present in future data.

We can do this by training the model with the `C` parameter. The default is `1`. Lower values mean stricter regularization.

In [None]:
from sklearn.metrics import f1_score

best_c = None
best_f1_score = 0.0

for c in [50, 20, 10, 1.0, 0.5, 0.1, 0.05, 0.01]:
    clf = LogisticRegression(C=c, n_jobs=-1) # here, c is the regularization param
    cv_reg = cross_val_score(clf, X, y=y, cv=5, n_jobs=-1, scoring="f1_micro").mean()

    print("5-CV on train at C={}: {}".format(c, cv_reg.mean()))
    print()

    if cv_reg > best_f1_score:
        best_f1_score = cv_reg
        best_c = c
        
print("best C parameter: {}".format(best_c))

5-CV on train at C=50: 0.8477777777777777

5-CV on train at C=20: 0.8488888888888889

5-CV on train at C=10: 0.8488888888888889

5-CV on train at C=1.0: 0.828888888888889

5-CV on train at C=0.5: 0.8183333333333334

5-CV on train at C=0.1: 0.788888888888889

5-CV on train at C=0.05: 0.7311111111111112

5-CV on train at C=0.01: 0.5077777777777778

best C parameter: 20


In [None]:
reg_clf = LogisticRegression(C=best_c, n_jobs=-1)
reg_clf.fit(X, y)
reg_preds = reg_clf.predict(new_X)

print(classification_report(new_y, reg_preds))

              precision    recall  f1-score   support

           0       0.86      0.85      0.85       111
           1       0.81      0.83      0.82        89

    accuracy                           0.84       200
   macro avg       0.84      0.84      0.84       200
weighted avg       0.84      0.84      0.84       200



# Better features = better performance


We now have **a lot** of features! More than we have actual examples...

Not all of them will be helpful, though. Let's select the top 1500 based on how well they predict they outcome of the training data.

We use two libraries from `sklearn`, `SelectKBest` (the selection algorithm) and `chi2` (the selection criterion).

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# we take the features more related with the output class
selector = SelectKBest(chi2, k=1500).fit(X, y) 
X_sel = selector.transform(X)
print(X_sel.shape)

(1800, 1500)


In [None]:
X.shape

(1800, 66808)

Let's see how well this new representation performs, by looking at the 5-fold cross-validation. We keep the best regularization value from before.

In [None]:
clf = LogisticRegression(C=best_c, n_jobs=-1)

cv_reg = cross_val_score(clf, X_sel, y=y, cv=5, n_jobs=-1, scoring="f1_micro")
print("5-CV on train: {}".format(cv_reg.mean()))

5-CV on train: 0.8955555555555555


Not too bad! We have handily beaten our previous best! Let's fit a classifier on the whole data now.

In [None]:
clf.fit(X_sel, y)

LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, let's apply it to the held-out data set. 
We need to 
* vectorize the data with our vectorizer from before (otherwise, we get different features)
* select the top features (using our previously fitted selector)

In [None]:
# select features for new data
new_X_sel = selector.transform(new_X)
print(new_X_sel.shape)

(200, 1500)


Finally, we can use our new classifier to predict the new data labels, and compare them to the truth.

In [None]:
new_predictions_regularized = clf.predict(new_X_sel)
prediction_df = pd.DataFrame(data={'input': new_data['input'], 'prediction': labels2numbers.inverse_transform(new_predictions_regularized), 'truth':new_data['output']})
prediction_df

Unnamed: 0,input,prediction,truth
0,robert redford ' s a river runs through it is ...,pos,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg,neg
2,you think that these people only exist in the ...,neg,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg,neg
4,brian depalma needs a hit * really * badly . s...,pos,pos
...,...,...,...
195,i won  t even pretend that i have seen the ot...,pos,neg
196,the cartoon is way better . that ' s the botto...,neg,neg
197,"dr . alan grant ( sam neill , "" jurassic park ...",neg,neg
198,of course i knew this going in . why is it tha...,neg,neg


In [None]:
print(classification_report(new_y, new_predictions_regularized))

              precision    recall  f1-score   support

           0       0.83      0.80      0.82       111
           1       0.76      0.80      0.78        89

    accuracy                           0.80       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.80      0.80       200



## Getting insights

In order to explore which features are most indicative, we need some code

In [None]:
features = vectorizer.get_feature_names() # get the names of the features
top_scores = selector.scores_.argsort()[-1500:] # get the indices of the selection
best_indicator_terms = [features[i] for i in sorted(top_scores)] # sort feature names

top_indicator_scores = pd.DataFrame(data={'feature': best_indicator_terms, 'coefficient': clf.coef_[0]})
top_indicator_scores.sort_values('coefficient')

Unnamed: 0,feature,coefficient
84,bad,-12.780473
1449,waste,-8.680148
73,attempt,-8.436183
1307,suppose,-8.139754
154,boring,-7.687207
...,...,...
1004,perfectly,6.340445
311,definitely,6.399306
1005,performance,6.453414
579,hilarious,6.495110


# Exercise

Try to test the model trained on the sentiment analysis dataset on the wine reviews.

In [None]:
X_rev = vectorizer_wine.transform(wine_des_test.astype(str))

In [None]:
y_rev = labels2numbers_wine.transform(wine_country_test)

In [None]:
y_pred = classifier_wine.predict(X_rev)

In [None]:
print(classification_report(y_rev, y_pred))

              precision    recall  f1-score   support

           0       0.39      0.45      0.42       193
           1       0.38      0.69      0.49       115
           2       0.40      0.72      0.51       158
           3       1.00      0.14      0.25         7
           4       0.05      0.10      0.07        10
           5       0.34      0.42      0.38       127
           6       0.81      0.54      0.65       963
           7       0.49      0.84      0.62       174
           8       0.58      0.61      0.60        23
           9       1.00      0.33      0.50         3
          10       0.29      0.57      0.38         7
          11       0.91      0.91      0.91       937
          12       0.62      0.56      0.59         9
          13       0.24      0.37      0.29        46
          14       0.33      0.65      0.44       181
          15       0.25      0.33      0.29         3
          16       0.49      0.71      0.58        48
          17       0.61    

# Italian classifier

In our lab, we developed a Italian emotion and sentiment classifier available at https://github.com/MilaNLProc/feel-it

In [None]:
! pip install -U feel-it

Collecting feel-it
  Downloading https://files.pythonhosted.org/packages/a0/12/88b3941faf5124899f1c139a3d83210c76b4db802b144451062d0a0aac5c/feel_it-1.0.3-py2.py3-none-any.whl
Collecting transformers==4.3.3
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 4.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 20.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 30.4MB/s 
Installing collected packages:

In [None]:
from feel_it import EmotionClassifier, SentimentClassifier

emotion_classifier = EmotionClassifier()

emotion_classifier.predict(["sono molto felice", "ma che cazzo vuoi", "sono molto triste"])



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=899.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=793981.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1682192.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=299.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=414.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442587849.0, style=ProgressStyle(descri…




['joy', 'anger', 'sadness']

In [None]:
sentiment_classifier = SentimentClassifier()

sentiment_classifier.predict(["sono molto felice", "ma che cazzo vuoi", "sono molto triste"])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=847.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=793981.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1682192.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=299.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=414.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442581705.0, style=ProgressStyle(descri…




['positive', 'negative', 'negative']

# Exercise

Download a set of tweets with a specific hashtag in Italian and try to run the Emotion and Sentiment Classifier.

# Checklist: how to classify my data

1. label at ***least 2000*** tweets in your data set as `positive`, `negative`, or `neutral`
2. preprocess the text of *all* tweets in your data (labeled and unlabeled)
3. read in the labeled tweets and their labels
4. transform the labels into numbers
5. use `TfidfVectorizer` to extract the features and transform them into feature vectors
6. select the top $N$ features (where $N$ is smaller than the number of labeled tweets)
7. create a classifier
8. use 5-fold CV to find the best regularization parameter, top $N$ feature selection, and maybe feature generation and preprocessing steps

Once you are satisfied with the results:
9. read in the rest of the (unlabeled) tweets
10. use the `TfidfVectorizer` from 5. to transform the new data into vectors
11. use the `SelectKBest` selector from 6. to get the top $N$ features
12. use the classifier from 7. to predict the labels for the new data
13. save the predicted labels or probabilities to your database or an Excel file
