**Important: This notebook will only work with fastai-0.7.x. Do not try to run any fastai-1.x code from this path in the repository because it will load fastai-0.7.x**

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression

## Video 10, 1:02:00 IMDB

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [135]:
PATH='data/aclImdb/'
names = ['neg','pos']

In [136]:
%ls {PATH}

README      imdb.vocab  imdbEr.txt  [34mtest[m[m/       [34mtrain[m[m/


In [137]:
%ls {PATH}train

labeledBow.feat  [34mpos[m[m/             unsupBow.feat    urls_pos.txt
[34mneg[m[m/             [34munsup[m[m/           urls_neg.txt     urls_unsup.txt


In [138]:
%ls {PATH}

README      imdb.vocab  imdbEr.txt  [34mtest[m[m/       [34mtrain[m[m/


In [139]:
%ls {PATH}train

labeledBow.feat  [34mpos[m[m/             unsupBow.feat    urls_pos.txt
[34mneg[m[m/             [34munsup[m[m/           urls_neg.txt     urls_unsup.txt


In [140]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt


In [141]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review

In [142]:
trn[0]

"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."

In [143]:
trn_y[0]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [144]:
veczr = CountVectorizer(tokenizer=tokenize)

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [145]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [146]:
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [147]:
trn_term_doc[0]

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>

In [148]:
len(trn[0].split(' '))

41

In [149]:
trn[0].split(' ')

['Working',
 'with',
 'one',
 'of',
 'the',
 'best',
 'Shakespeare',
 'sources,',
 'this',
 'film',
 'manages',
 'to',
 'be',
 'creditable',
 'to',
 "it's",
 'source,',
 'whilst',
 'still',
 'appealing',
 'to',
 'a',
 'wider',
 'audience.<br',
 '/><br',
 '/>Branagh',
 'steals',
 'the',
 'film',
 'from',
 'under',
 "Fishburne's",
 'nose,',
 'and',
 "there's",
 'a',
 'talented',
 'cast',
 'on',
 'good',
 'form.']

In [150]:
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [151]:
len(vocab)

75132

In [152]:
veczr.vocabulary_['film']

24540

In [153]:
trn_term_doc[0,24540]

2

In [154]:
trn_term_doc[0,5000]

0

In [155]:
len(vocab), vocab[-1], vocab[0]

(75132, '\uf0b7', '\x08\x08\x08\x08a')

In [156]:
index = 0
for i in range(len(vocab)):
    f = trn_term_doc[0, i]
    if f > 0:
        index += 1
        print(index, vocab[i], f)

1 ' 3
2 , 3
3 . 2
4 / 2
5 < 2
6 > 2
7 a 2
8 and 1
9 appealing 1
10 audience 1
11 be 1
12 best 1
13 br 2
14 branagh 1
15 cast 1
16 creditable 1
17 film 2
18 fishburne 1
19 form 1
20 from 1
21 good 1
22 it 1
23 manages 1
24 nose 1
25 of 1
26 on 1
27 one 1
28 s 3
29 shakespeare 1
30 source 1
31 sources 1
32 steals 1
33 still 1
34 talented 1
35 the 2
36 there 1
37 this 1
38 to 3
39 under 1
40 whilst 1
41 wider 1
42 with 1
43 working 1


In [157]:
trn_y==1

array([False, False, False, ...,  True,  True,  True])

In [158]:
len(trn_y), (trn_y==1).sum(), (trn_y==0).sum()

(25000, 12500, 12500)

In [159]:
trn_term_doc[trn_y==1]

<12500x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 1877598 stored elements in Compressed Sparse Row format>

In [160]:
trn_term_doc[trn_y==0]

<12500x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 1872147 stored elements in Compressed Sparse Row format>

## Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [161]:
def pr(y_i):
    p = x[y==y_i].sum(0)
    return p+1

In [162]:
x=trn_term_doc
y=trn_y

p = pr(1)/pr(1).sum()
q = pr(0)/pr(0).sum()
r = np.log(p/q)
b = np.log((y==1).mean() / (y==0).mean())

In [179]:
x.shape

(25000, 75132)

In [183]:
freq = x.sum(0)
freq

matrix([[    1,     1, 24560, ...,     1,     2,     7]], dtype=int64)

In [182]:
freq.shape

(1, 75132)

In [163]:
pr(1)

matrix([[    2,     1, 11820, ...,     2,     2,     1]], dtype=int64)

In [178]:
pr(1).shape

(1, 75132)

In [164]:
pr(1).sum()

3789022

In [184]:
pr(0)

matrix([[    1,     2, 12742, ...,     1,     2,     8]], dtype=int64)

In [185]:
pr(0).shape

(1, 75132)

In [186]:
pr(0).sum()

3747495

In [192]:
pr(0).sum() + pr(1).sum()

7536517

In [188]:
(pr(0).sum() + pr(1).sum() - freq.sum()) / 2

75132.0

In [189]:
freq = pr(0) + pr(1)

In [191]:
freq.sum()

7536517

In [165]:
p.sum()

1.0000000000000002

In [166]:
q.sum()

0.9999999999999999

In [167]:
b

0.0

In [168]:
x.shape, type(vocab)

((25000, 75132), list)

In [169]:
type(r), r.shape,

(numpy.matrix, (1, 75132))

In [170]:
r[:,:5]

matrix([[ 0.68213, -0.70417, -0.08613, -0.09142, -0.19624]])

In [171]:
x

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

Here is the formula for Naive Bayes.

In [172]:
val_term_doc.shape

(25000, 75132)

In [173]:
r.shape

(1, 75132)

In [174]:
(val_term_doc @ r.T).shape

(25000, 1)

In [175]:
val_term_doc @ r.T

matrix([[ -8.61109],
        [ -4.82614],
        [  2.91111],
        ...,
        [117.64775],
        [  2.4274 ],
        [  5.32726]])

In [176]:
val_y

array([0, 0, 0, ..., 1, 1, 1])

In [177]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.8074

In [193]:
r.shape

(1, 75132)

In [194]:
p.shape

(1, 75132)

In [195]:
q.shape

(1, 75132)

In [197]:
np.stack([p, q]).shape

(2, 75132)

In [206]:
pre_preds = val_term_doc @ np.stack([np.log(p), np.log(q)]).T + b

In [207]:
pre_preds

matrix([[ -841.49517,  -832.88408],
        [ -893.40053,  -888.57439],
        [-1479.0384 , -1481.9495 ],
        ...,
        [-7603.59445, -7721.2422 ],
        [-1036.11346, -1038.54086],
        [ -725.37551,  -730.70278]])

In [208]:
preds = pre_preds.T[0] > pre_preds.T[1]
(preds==val_y).mean()

0.8074

## Video 10, 1:30:20 IMDB

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

In [210]:
LogisticRegression

sklearn.linear_model._logistic.LogisticRegression

In [212]:
m = LogisticRegression(C=1e8, dual=False, max_iter=1000)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.85184

...and the regularized version

In [209]:
m = LogisticRegression(C=0.01, dual=False, max_iter=1000)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.8826

## My Version

## Video 11, 0:0:0 - 0:21:00

In [133]:
# Reive SGD

## Video 11, 0:21:00 IMDB

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [88]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [89]:
trn_term_doc.shape

(25000, 800000)

In [90]:
vocab = veczr.get_feature_names()

In [91]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [92]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [93]:
p = pr(1)/pr(1).sum()
q = pr(0)/pr(0).sum()
r = np.log(p/q)
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [95]:
m = LogisticRegression(C=0.1, dual=False, max_iter=1000)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.9048

Here is the $\text{log-count ratio}$ `r`.  

In [96]:
r.shape, r

((1, 800000),
 matrix([[-0.04911, -0.15543, -0.24226, ...,  1.10419, -0.68757, -0.68757]]))

In [97]:
np.exp(r)

matrix([[0.95208, 0.85605, 0.78485, ..., 3.01678, 0.5028 , 0.5028 ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

## Video 11, 0:25:13 IMDB

## Let's help the LR to learn

In [104]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=False, C=0.1, max_iter=1000)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91756

In [107]:
for c in [-5, -4, -3, -2, -1, 0, 1, 2, 3]:
    C = 10**c
    x_nb = x.multiply(r)
    m = LogisticRegression(dual=False, C=C, max_iter=1000)
    m.fit(x_nb, y);

    val_x_nb = val_x.multiply(r)
    preds = m.predict(val_x_nb)
    print(c, (preds.T==val_y).mean())

-5 0.813
-4 0.83448
-3 0.8838
-2 0.9122
-1 0.91756
0 0.91748
1 0.91724
2 0.91648
3 0.91624


## Video 11, 0:44:00 - 1:15:30 IMDB

## fastai NBSVM++

In [99]:
sl=2000

In [100]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(
    trn_term_doc, trn_y, val_term_doc, val_y, sl
)

In [101]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.025529   0.118863   0.91684   



[0.11886349875450135, 0.9168399999809265]

In [102]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.021495   0.113222   0.92112   
    1      0.012013   0.111317   0.92244                      



[0.11131713506221771, 0.922440000038147]

In [103]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.017727   0.110428   0.92256   
    1      0.009777   0.109769   0.92228                       



[0.10976871147632598, 0.9222799999809265]

## Video 11, 1:15:30 - 1:36:06 (end)

## Video 12, 0:0:0 - 0:44:20

In [127]:
# Rossman store

## Video 12, 0:44:20 - 1:03:14

In [131]:
# review the whole course

## Video 1:03:14 - 1:41:41 (end)

In [130]:
# Ethical issues

## My Version

In [110]:
sl=2000

In [116]:
# Here is how we get a model from a bag of words
md2 = TextClassifierData.from_bow(
    trn_term_doc, trn_y, val_term_doc, val_y, sl
)

In [117]:
learner2 = md2.shaojun_learner()
learner2.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.025987   0.119626   0.91564   



[0.11962647980690003, 0.9156399999809265]

In [118]:
learner2.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.020512   0.113522   0.92084   
    1      0.011565   0.111534   0.92204                      



[0.11153364145755768, 0.9220399999809266]

In [119]:
learner2.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.018348   0.110497   0.92112   
    1      0.009587   0.10919    0.92136                       



[0.10919006671905518, 0.9213599999809265]

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)