<h1>Fast.ai's Machine Learning Course - Lesson 5<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-dataset-and-the-sentiment-classification-task" data-toc-modified-id="IMDB-dataset-and-the-sentiment-classification-task-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB dataset and the sentiment classification task</a></span><ul class="toc-item"><li><span><a href="#Tokenizing-and-term-document-matrix-creation" data-toc-modified-id="Tokenizing-and-term-document-matrix-creation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Tokenizing and term document matrix creation</a></span></li></ul></li><li><span><a href="#Naive-Bayes" data-toc-modified-id="Naive-Bayes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Naive Bayes</a></span><ul class="toc-item"><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Logistic regression</a></span></li><li><span><a href="#N-grams-with-NB-features" data-toc-modified-id="N-grams-with-NB-features-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>N-grams with NB features</a></span></li></ul></li><li><span><a href="#fastai-NBSVM++" data-toc-modified-id="fastai-NBSVM++-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>fastai NBSVM++</a></span></li><li><span><a href="#References" data-toc-modified-id="References-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>References</a></span></li></ul></div>

**Important: This notebook will only work with fastai-0.7.x. Do not try to run any fastai-1.x code from this path in the repository because it will load fastai-0.7.x**

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
warnings.simplefilter('ignore')

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [2]:
PATH='data/imdb/'
names = ['neg','pos']

In [3]:
!ls {PATH}

imdb.vocab
imdbEr.txt
README
test
train


In [4]:
!ls {PATH}train

labeledBow.feat
neg
pos
unsup
unsupBow.feat
urls_neg.txt
urls_pos.txt
urls_unsup.txt


In [5]:
!ls {PATH}train/pos | head

0_9.txt
1_7.txt
10_9.txt
100_7.txt
1000_8.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt


ls: write error


In [6]:
def texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r', encoding='utf-8').read())
            labels.append(idx) # 0: neg, 1: pos
    return texts, np.array(labels).astype(np.int64)

In [7]:
trn, trn_y = texts_labels_from_folders(f'{PATH}train', names)
val, val_y = texts_labels_from_folders(f'{PATH}test', names)

Here is the text of the first review

In [8]:
trn[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [9]:
trn_y[0]

0

Yes, it was a negative review :P

[`sklearn.feature_extraction.text.CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents into a *term document matrix*; a matrix of token counts representing a **bag of words**. With that, we're getting rid of the words order, and only counting their appearance (moreover, having a rectangular matrix is good for linear modelling). In many NLP problems this is a really bad idea, but in this case it works well! (RNNs will be used in the DL course)

Note: tokenizer will take care with the punctuation signs and that kind of things.

In [10]:
veczr = CountVectorizer(tokenizer=tokenize)

We can see CountVectorizer as a class for modelling, where `fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to the validation set, the second line uses just the method `transform(val)`, which use the same word order for the columns. Both `trn_term_doc` and `val_term_doc` are **sparse matrices**. `trn_term_doc[i]` represents training document `i` and it contains a count of words for each document for each word in the vocabulary.

In [11]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val) # 

In [12]:
trn_term_doc # Docs x Unique words. 3rd number is something like Average unique words per doc x Number of docs

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [13]:
trn_term_doc[0] # 1 x Unique words. 3rd number is the number of words used

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 93 stored elements in Compressed Sparse Row format>

In [14]:
vocab = veczr.get_feature_names()
vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [15]:
w0 = set([o.lower() for o in trn[0].split(' ')])
w0

{'a',
 'absurd',
 'an',
 'and',
 'audience',
 'be',
 'better',
 'briefly.',
 'by',
 'can',
 'chantings',
 'cinematography',
 'comedy.',
 'crazy',
 'cryptic',
 'dialogue',
 'easy',
 'era',
 'even',
 'eventually',
 'example',
 'feelings',
 'for',
 'formal',
 'forrest',
 'frederic',
 'from',
 'future',
 'general',
 'good',
 'grader.',
 'great',
 'has',
 'insane,',
 'into',
 'is',
 'it',
 "it's",
 'just',
 'kirkland',
 'level',
 'make',
 'making',
 'man',
 'might',
 'mob',
 'narrative',
 'no',
 'of',
 'off',
 'off.',
 'on',
 'opening',
 'orchestra',
 'out',
 'pig.',
 'putting.',
 'sally',
 'scene',
 'seem',
 'seen',
 'shakespeare',
 'should',
 'singers.',
 'some',
 'stars',
 'starts',
 'stays',
 'story',
 'technical',
 'terrific',
 'than',
 'that',
 'the',
 'think',
 'third',
 'those',
 'time',
 'to',
 'too',
 'turned',
 'unfortunately',
 'unnatural',
 'vilmos',
 'violent',
 'who',
 'whole',
 'with',
 'would',
 'you',
 'zsigmond.'}

In [16]:
len(w0)

91

Less than 91, since we did not use a tokenizer.

We can get the ID for a word:

In [17]:
veczr.vocabulary_['absurd']

1297

And get the number of appearances in a document:

In [18]:
trn_term_doc[0,1297]

2

In [19]:
trn_term_doc[0,5000]

0

## Naive Bayes

This a technique very used for example spam filtering.

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature, divided by the number of positive documents. We take the log so we can sum instead of multiplying probabilities, and avoid ending with a very tiny number (maybe running out of floating point).

In [20]:
def pr(y_i):
    p = x[y==y_i].sum(0)              # y: (25000,) | x[y=1]: (12500,75132) | x[y=1].sum(0): (1,75132) -> p(f_i|1)
    return (p+1) / ((y==y_i).sum()+1)  

In [21]:
x=trn_term_doc
y=trn_y

r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

In [22]:
# p = x[y==1].sum(0)+1
# q = x[y==0].sum(0)+1
# r = np.log((p/p.sum())/(q/q.sum()))
# b = np.log(len(p)/len(q))

In [23]:
r.shape

(1, 75132)

In [24]:
b

0.0

Here is the formula for Naive Bayes, very similar to a logistic regression:

In [25]:
pre_preds = val_term_doc @ r.T + b  # r.T: r transposed so we have (25000,75132) @ (75132,1) = (25000,1)
preds = pre_preds.T > 0             # compare to 0 in log space = (25000,1), like val_y
(preds==val_y).mean()               # Probability!

0.81656

... and we try binarized Naive Bayes; since maybe doesn't matter the number of times the same word appears in a document

In [26]:
x = trn_term_doc.sign() # replace negatives by 0, positives by 1
r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T > 0
(preds==val_y).mean()

0.83016

So why don't we learn the r coefficients and the b instead of assuming anything like we did? Let's apply a logistic regression with sklearn :)

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

About the `dual=True`: anytime our term document matrix it’s wider than it’s tall, put dual=True and it’ll run much faster

In [27]:
m = LogisticRegression(C=1e8, dual=True, solver='warn')
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.83256

The binary version...

In [28]:
m = LogisticRegression(C=1e8, dual=True, solver='warn')
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.85504

...and we try the regularized version (L2 by default), because we have a lot of columns, more than rows... sure we're overfitting. We'll use the `C` parameter in sklearn's LogisticRegression. The smaller the parameter, the larger the regularization.

In [29]:
m = LogisticRegression(C=0.1, dual=True, solver='warn')
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.84872

In [30]:
m = LogisticRegression(C=0.1, dual=True, solver='warn')
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404

We could try writing a PyTorch version of this logistic regression :)

### N-grams with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use **bigrams** (2 words grouping) and **trigrams** (3 words) too (apart from **unigrams**). This works great with bag of words techniques. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [31]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [32]:
trn_term_doc.shape

(25000, 800000)

In [33]:
vocab = veczr.get_feature_names()

In [34]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [35]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [36]:
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [37]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.905

Now we do the same with calculations. Here is the $\text{log-count ratio}$ `r`: 

In [38]:
r.shape, r

((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

It has the same dimensions as the chosen features. We undo the log:

In [39]:
np.exp(r)

matrix([[0.94678, 0.85129, 0.78049, ..., 3.     , 0.5    , 0.5    ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [40]:
x_nb = x.multiply(r)                      # element-wise with broadcast multiplication

m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);                           # use x * r instead of x

val_x_nb = val_x.multiply(r)              # same multiplication for validation set
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768

The result is surprisingly better! Why are they different? Higher weights imply more penalty in the loss function to reduce these weights, and we want the minimum regularization possible. If we multiply our ones and zeros by the theoretical r ratio (our expectation) we're helping the model, since that's more realistic. So it'll fit better; with more variance we'll have lower weights, and less regularization / penalization required.

![imdb_nb.png.png](images/imdb_nb.png)

## fastai NBSVM++

There is a paper (see references) explaining the conjunction of SVM with NB features (using bi-grams) and its good results compared with other linear algorithms. SVM is similar to simple logistic regression, so the results are almost the same we got here.

But, Jeremy developed another version of NBSVM (with logistic regression btw), now included in fastai, obtaining better results :) The fundamental idea is to adjust the weights by adding a constant, so we are increasing the near 0 weights. The penalization will affect less to the original weights, because it won't push them to 0 thanks to the constant. 

The linear transformation will be a **embedding**. Embedding is: make a multiplication by a *one hot encoded* matrix faster by simply replacing it with an array lookup. We don't even have to build the matrix.

In [41]:
sl=2000 # max words per review

In [42]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [43]:
# Build the model and train it
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                              
    0      0.025065   0.119235   0.91732   



[0.11923515022158623, 0.917320000038147]

In [44]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                              
    0      0.021603   0.113393   0.92052   
    1      0.011467   0.111916   0.92104                                                                               



[0.11191634260177612, 0.921040000038147]

In [45]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                              
    0      0.018589   0.110478   0.92312   
    1      0.008871   0.109894   0.92228                                                                               



[0.10989382385492324, 0.9222800000381469]

What a great result! And most important of all is, that we can apply these embeddings to any kind of data, not only NLP; for example prediction of store sales.

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)