In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastaiClone.old.fastai.nlp import *
from sklearn.linear_model import LogisticRegression

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [2]:
PATH='data/aclImdb/'
names = ['neg','pos']

In [3]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review

In [None]:
for word in words:
    if has seen word:
        word_count for word += 1
    else:
        add word to word_count
        word_count[word] = 1

In [29]:
trn[1]

'Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we\'re from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to 

In [28]:
# negative label
trn_y[1]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [30]:
veczr = CountVectorizer(tokenizer=tokenize)

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [31]:
# makes the vectorizer on the train set and applies the learning to the test set
# we must vectorize the same way
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [32]:
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [33]:
# first document, uses 93 of the 75k words
trn_term_doc[0]

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>

In [36]:
# all the learned words
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [49]:
# a naive tokenizer that just splits by spaces
w0 = set([o.lower() for o in trn[0].split(' ')]); w0

{'/><br',
 '/>branagh',
 'a',
 'and',
 'appealing',
 'audience.<br',
 'be',
 'best',
 'cast',
 'creditable',
 'film',
 "fishburne's",
 'form.',
 'from',
 'good',
 "it's",
 'manages',
 'nose,',
 'of',
 'on',
 'one',
 'shakespeare',
 'source,',
 'sources,',
 'steals',
 'still',
 'talented',
 'the',
 "there's",
 'this',
 'to',
 'under',
 'whilst',
 'wider',
 'with',
 'working'}

In [50]:
# close to 93
len(w0)

36

In [77]:
# gets the number that corresponds to the word
veczr.vocabulary_['absurd']

1297

In [78]:
# gets the count of the word in the 0th doc (1297 specifies the word index which we saw above)
trn_term_doc[0,1297]

2

In [79]:
trn_term_doc[0,5000]

0

## Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [80]:
x=trn_term_doc
y=trn_y

# sums vertically, so each feature is summing its occurences for each label
p = x[y==1].sum(0) + 1 # +1 so each feature has at least one count
q = x[y==0].sum(0) + 1 # the other feature

# p and q are now 1x75k

# makes it so all items in p and q are relative and each feature's p is a relative fraction
# so p before might be [0.1, 0.2] but now it's [0.33, 0.66]
# logs because numbers are small so it stops floating point error
r = np.log((p/p.sum())/(q/(q.sum()))) # r is a 1x75k vector that stores of log of p(f|1)/p(f|0)
# this is the log of the counts of each, I think this is a mistake because len(p) = len(q) always
b = np.log(len(p)/len(q))

Here is the formula for Naive Bayes.

In [81]:
# works because each row of val is a doc, and reading left to right is each feature, and each
# feature gets multiplied by the corresponding Naive Bayes probability in r (which itself is 
# p(1) / p(0))
# see my notes for a walkthrough on the math of how this works
pre_preds = val_term_doc @ r.T + b # r had to be transposed bc it was 1x75k
# see my notes for why it's 0; uses broadcasting to do the >0
preds = pre_preds.T>0
(preds==val_y).mean()

0.8074

...and binarized Naive Bayes.

In [82]:
# this just says if I see a word it is 1, otherwise 0
x=trn_term_doc.sign()
p = x[y==1].sum(0) + 1 
q = x[y==0].sum(0) + 1

r = np.log((p/p.sum())/(q/(q.sum())))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.82844

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

In [83]:
x=trn_term_doc
y=trn_y

# learns the coefficients, C regularizes (it is currently high to turn off regularization)
# essentially trains a single layer NN on the 75k inputs onto the 2 outputs
m = LogisticRegression(C=1e8, dual=True) # use dual when wider than it is tall, runs faster
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()



0.85656

In [84]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()



0.85516

...and the regularized version

In [85]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()



0.88248

In [86]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [87]:
# counts not just words but pairs of words that come together (bigrams) and trigrams
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [88]:
trn_term_doc.shape

(25000, 800000)

In [89]:
vocab = veczr.get_feature_names()

In [90]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [91]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [92]:
p = x[y==1].sum(0) + 1 
q = x[y==0].sum(0) + 1

r = np.log((p/p.sum())/(q/(q.sum())))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [93]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

# does better due to the bigrams and trigrams

0.905

Here is the $\text{log-count ratio}$ `r`.  

In [94]:
r.shape, r

((1, 800000),
 matrix([[-0.04911, -0.15543, -0.24226, ...,  1.10419, -0.68757, -0.68757]]))

In [95]:
np.exp(r)

matrix([[0.95208, 0.85605, 0.78485, ..., 3.01678, 0.5028 , 0.5028 ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [96]:
# multiples term document matrix by Naive Bayes probabilities
x_nb = x.multiply(r)
# fits logistic regession to that
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768

## fastai NBSVM++

In [97]:
# up to 2000 words per review
sl=2000

In [98]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [99]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                 
    0      0.026013   0.11901    0.91704   



[0.11901039268612862, 0.91704]

In [100]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                 
    0      0.019801   0.113044   0.9212    
    1      0.01112    0.111959   0.92084                                                                  



[0.11195904880046845, 0.9208400000381469]

In [101]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                                                                 
    0      0.01753    0.11079    0.92284   
    1      0.010035   0.10998    0.92188                                                                  



[0.10997999302983284, 0.921880000038147]

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)