In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [2]:
PATH='data/aclImdb/'
names = ['neg','pos']

In [3]:
%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [4]:
%ls {PATH}train

labeledBow.feat  [0m[01;34mpos[0m/    unsupBow.feat  urls_pos.txt
[01;34mneg[0m/             [01;34munsup[0m/  urls_neg.txt   urls_unsup.txt


In [5]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt
ls: write error


In [6]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names) # len(trn): 25000
val,val_y = texts_labels_from_folders(f'{PATH}test',names) # len(val): 25000

Here is the text of the first review

In [7]:
trn[0]

'This is the first of these "8 Films To Die For" collection that I\'ve seen and it\'s certainly not made me want to see any of the rest...although I\'ve heard at least a couple of them are decent. I don\'t know, this wasn\'t terrible but it didn\'t really do much for me. Your basic dysfunctional cannibal family in suburbia kind of thing, mom & dad died, the family sold the farm & moved to San Francisco (?) where they continued to bring home stray food sources whenever possible. The best part of this was the creepy Goth sister, who of course invites a friend over from school that never leaves. Anyway, of course we have a butcher shop in the basement and so on and so on. This family is sort of like the white-bread version of the Sawyer Clan, they\'re nasty & they do bad things but they ain\'t go no soul. I see a lot of reviews from people that liked this, and I guess I don\'t know what I missed, but I found it to be very mediocre & I wouldn\'t recommend it to anyone, really. 4 out of 10.

In [8]:
trn_y[0]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [12]:
??tokenize

# def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [9]:
veczr = CountVectorizer(tokenizer=tokenize)

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [13]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [14]:
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [30]:
trn_term_doc[0]

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 93 stored elements in Compressed Sparse Row format>

In [23]:
len(veczr.vocabulary_) # vocabulary size

75132

In [24]:
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [25]:
w0 = set([o.lower() for o in trn[0].split(' ')]); w0

{'"8',
 '&',
 '(?)',
 '10.',
 '4',
 'a',
 "ain't",
 'and',
 'any',
 'anyone,',
 'anyway,',
 'are',
 'at',
 'bad',
 'basement',
 'basic',
 'be',
 'best',
 'bring',
 'but',
 'butcher',
 'cannibal',
 'certainly',
 'clan,',
 'collection',
 'continued',
 'couple',
 'course',
 'creepy',
 'dad',
 'decent.',
 "didn't",
 'die',
 'died,',
 'do',
 "don't",
 'dysfunctional',
 'family',
 'farm',
 'films',
 'first',
 'food',
 'for',
 'for"',
 'found',
 'francisco',
 'friend',
 'from',
 'go',
 'goth',
 'guess',
 'have',
 'heard',
 'home',
 'i',
 "i've",
 'in',
 'invites',
 'is',
 'it',
 "it's",
 'kind',
 'know',
 'know,',
 'least',
 'leaves.',
 'like',
 'liked',
 'lot',
 'made',
 'me',
 'me.',
 'mediocre',
 'missed,',
 'mom',
 'moved',
 'much',
 'nasty',
 'never',
 'no',
 'not',
 'of',
 'on',
 'on.',
 'out',
 'over',
 'part',
 'people',
 'possible.',
 'really',
 'really.',
 'recommend',
 'rest...although',
 'reviews',
 'san',
 'sawyer',
 'school',
 'see',
 'seen',
 'shop',
 'sister,',
 'so',
 'sold',

In [26]:
len(w0)

133

In [27]:
veczr.vocabulary_['absurd']

1297

In [28]:
trn_term_doc[0,1297]

0

In [29]:
trn_term_doc[0,5000]

0

## Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [30]:
def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [31]:
x=trn_term_doc # scipy.sparse.csr.csr_matrix
y=trn_y

r = np.log(pr(1)/pr(0)) # log-count ratio
b = np.log((y==1).mean() / (y==0).mean())

Here is the formula for Naive Bayes.

In [32]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.81656

...and binarized Naive Bayes.

In [34]:
x=trn_term_doc.sign() # -1 if x < 0, 0 if x==0, 1 if x > 0.
r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.83016

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

In [35]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.83248

In [40]:
print(trn_term_doc[0])

  (0, 58)	1
  (0, 47523)	1
  (0, 716)	1
  (0, 3680)	1
  (0, 54176)	1
  (0, 73868)	1
  (0, 41987)	1
  (0, 71296)	1
  (0, 6304)	1
  (0, 25722)	1
  (0, 43234)	1
  (0, 72847)	1
  (0, 29012)	1
  (0, 38801)	1
  (0, 49225)	1
  (0, 55633)	1
  (0, 39532)	1
  (0, 61928)	1
  (0, 45872)	1
  (0, 27876)	1
  (0, 2287)	1
  (0, 66645)	1
  (0, 5472)	1
  (0, 45042)	1
  (0, 53845)	1
  :	:
  (0, 41848)	2
  (0, 40218)	1
  (0, 46117)	1
  (0, 11355)	1
  (0, 57049)	1
  (0, 34716)	4
  (0, 3219)	4
  (0, 58714)	1
  (0, 71057)	2
  (0, 8)	10
  (0, 32391)	9
  (0, 66441)	3
  (0, 13310)	1
  (0, 25457)	2
  (0, 18299)	1
  (0, 67252)	6
  (0, 24585)	1
  (0, 937)	1
  (0, 3)	2
  (0, 66580)	1
  (0, 46749)	11
  (0, 24758)	1
  (0, 66458)	9
  (0, 34616)	2
  (0, 66684)	5


In [39]:
print(trn_term_doc.sign()[0])

  (0, 58)	1
  (0, 47523)	1
  (0, 716)	1
  (0, 3680)	1
  (0, 54176)	1
  (0, 73868)	1
  (0, 41987)	1
  (0, 71296)	1
  (0, 6304)	1
  (0, 25722)	1
  (0, 43234)	1
  (0, 72847)	1
  (0, 29012)	1
  (0, 38801)	1
  (0, 49225)	1
  (0, 55633)	1
  (0, 39532)	1
  (0, 61928)	1
  (0, 45872)	1
  (0, 27876)	1
  (0, 2287)	1
  (0, 66645)	1
  (0, 5472)	1
  (0, 45042)	1
  (0, 53845)	1
  :	:
  (0, 41848)	1
  (0, 40218)	1
  (0, 46117)	1
  (0, 11355)	1
  (0, 57049)	1
  (0, 34716)	1
  (0, 3219)	1
  (0, 58714)	1
  (0, 71057)	1
  (0, 8)	1
  (0, 32391)	1
  (0, 66441)	1
  (0, 13310)	1
  (0, 25457)	1
  (0, 18299)	1
  (0, 67252)	1
  (0, 24585)	1
  (0, 937)	1
  (0, 3)	1
  (0, 66580)	1
  (0, 46749)	1
  (0, 24758)	1
  (0, 66458)	1
  (0, 34616)	1
  (0, 66684)	1


In [36]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.85504

...and the regularized version

In [41]:
m = LogisticRegression(C=0.1, dual=True) # C: Inverse of regularization strength; must be a positive float
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.84872

In [42]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [43]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [44]:
trn_term_doc.shape

(25000, 800000)

In [47]:
vocab = veczr.get_feature_names()

In [48]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [49]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [50]:
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [51]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.905

Here is the $\text{log-count ratio}$ `r`.  

In [52]:
r.shape, r

((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

In [53]:
np.exp(r)

matrix([[0.94678, 0.85129, 0.78049, ..., 3.     , 0.5    , 0.5    ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [54]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768

## fastai NBSVM++

In [55]:
sl=2000

In [56]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [57]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

epoch      trn_loss   val_loss   <lambda>                     
    0      0.024684   0.120301   0.91596   



[array([0.1203]), 0.9159599999618531]

In [60]:
learner

DotProdNB(
  (w): Embedding(800001, 1, padding_idx=0)
  (r): Embedding(800001, 2)
)

In [61]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

epoch      trn_loss   val_loss   <lambda>                     
    0      0.019628   0.114144   0.9204    
    1      0.010331   0.11221    0.92128                      



[array([0.11221]), 0.9212799999618531]

In [62]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

epoch      trn_loss   val_loss   <lambda>                     
    0      0.017184   0.111032   0.92128   
    1      0.01029    0.110031   0.92232                       



[array([0.11003]), 0.922319999961853]

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)