In [107]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression

In [101]:
from fastai.nlp import texts_from_folders

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [41]:
PATH='data/aclImdb/'
names = ['neg','pos']

In [42]:
%ls {PATH}

README      imdb.vocab  imdbEr.txt  [34mtest[m[m/       [34mtrain[m[m/


* pos: positive reviews
* neg: negative reviews

In [102]:
%ls {PATH}train

labeledBow.feat  [34mpos[m[m/             unsupBow.feat    urls_pos.txt
[34mneg[m[m/             [34munsup[m[m/           urls_neg.txt     urls_unsup.txt


In [44]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt


In [46]:
def texts_from_folders(src, names):
    texts,labels = [],[]
    for idx,name in enumerate(names):
        path = os.path.join(src, name)
        for fname in sorted(os.listdir(path)):
            fpath = os.path.join(path, fname)
            texts.append(open(fpath).read())
            labels.append(idx)
    return texts,np.array(labels)

**Loading data from train and test data**

In [47]:
trn,trn_y = texts_from_folders(f'{PATH}train',names)
val,val_y = texts_from_folders(f'{PATH}test',names)

Here is the text of the first review

In [48]:
trn[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [50]:
trn_y[0] #negative reviews

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [52]:
val[0]

"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in."

In [53]:
veczr = CountVectorizer(tokenizer=tokenize) #this is used to tokenize any text into bag of words

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [54]:
trn_term_doc = veczr.fit_transform(trn) #transform into term document matrix
val_term_doc = veczr.transform(val) #applying the same transformation to validation

<img src="data/term_doc_matrix.png">

Above you can see a sample term-document matrix

In [60]:
trn_term_doc #25000 documents and 75312 unique terms and 3749745 non-empty elements in the matrix

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [62]:
trn_term_doc[0] #first document

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 93 stored elements in Compressed Sparse Row format>

In [63]:
vocab = veczr.get_feature_names(); vocab[5000:5005] #to get the list of all unique words

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [64]:
len(vocab)

75132

In [65]:
w0 = set([o.lower() for o in trn[0].split(' ')]); w0 #tokenize without use of tokenizer....not as good as tokenizer

{'a',
 'absurd',
 'an',
 'and',
 'audience',
 'be',
 'better',
 'briefly.',
 'by',
 'can',
 'chantings',
 'cinematography',
 'comedy.',
 'crazy',
 'cryptic',
 'dialogue',
 'easy',
 'era',
 'even',
 'eventually',
 'example',
 'feelings',
 'for',
 'formal',
 'forrest',
 'frederic',
 'from',
 'future',
 'general',
 'good',
 'grader.',
 'great',
 'has',
 'insane,',
 'into',
 'is',
 'it',
 "it's",
 'just',
 'kirkland',
 'level',
 'make',
 'making',
 'man',
 'might',
 'mob',
 'narrative',
 'no',
 'of',
 'off',
 'off.',
 'on',
 'opening',
 'orchestra',
 'out',
 'pig.',
 'putting.',
 'sally',
 'scene',
 'seem',
 'seen',
 'shakespeare',
 'should',
 'singers.',
 'some',
 'stars',
 'starts',
 'stays',
 'story',
 'technical',
 'terrific',
 'than',
 'that',
 'the',
 'think',
 'third',
 'those',
 'time',
 'to',
 'too',
 'turned',
 'unfortunately',
 'unnatural',
 'vilmos',
 'violent',
 'who',
 'whole',
 'with',
 'would',
 'you',
 'zsigmond.'}

In [66]:
len(w0)

91

In [24]:
veczr.vocabulary_['lacked'] #id of a particular term in the vocabulary

37467

In [26]:
trn_term_doc[2,37467] #count of the above term in the second document

1

## Naive Bayes

P(class = positive|given a particular doc) = P(given doc| class =positive)* P(class=positive)/P(given doc)

P(class = negative|given a particular doc) = P(given doc| class =negative)* P(class=negative)/P(given doc)

##### We are interested in the ratio of the above 2 equation. If the LHS ratio is greater than 1 we'll assign it positive class and vice-versa

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [70]:
x=trn_term_doc
y=trn_y

In [114]:
p = x[y==1].sum(0)+1/(sum(y == 1) + 1) #for each feature it'll tell probability it has come in a positive review
q = x[y==0].sum(0)+1/(sum(y == 0) + 1) #for each feature it'll tell probability it has come in a negative review

r = np.log(p/q) #calculating r
b = np.log((y==1).sum()/(y==0).sum()) # calculating the ratio of positive reviews/negative reviews

Here is the formula for Naive Bayes.

In [80]:
pre_preds = val_term_doc @ r.T + b #we are now calculating the overall score for each doc in the validation matrix by adding the log ratio for the words present in each doc
preds = pre_preds.T>0 # as we are in log space if the ratio is greater than 1 it is positive and vice-versa
(preds==val_y).mean() #calculating the accuracy

0.80740000000000001

...and binarized Naive Bayes.

In [81]:
pre_preds = val_term_doc.sign() @ r.T + b #here we have replaced the count of word with 1 if present and 0 if not (word presence is important and not the number of occurences)
preds = pre_preds.T>0
(preds==val_y).mean()

0.82623999999999997

##### Naive Bayes assumes that the probability of each word is independent and hence it is Naive

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

Unigram mean tokens are single word

##### So in Naive Bayes we are applying a theoretical model and predicting. But we can actually learn the ratio by applying logistic regression

In [83]:
m = LogisticRegression(C=1e8, dual=True)  #when input is wider than it is tall use dual = True for faster computation 
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.85224

In [89]:
(m.coef_)

array([[ 0.     , -0.     ,  0.43774, ...,  0.01346, -0.     , -0.00623]])

#### Binarize version

In [90]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.85511999999999999

...and the regularized version

##### Higher value of C implies higher regularization

In [91]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.88307999999999998

In [92]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404000000000005

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [93]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [94]:
trn_term_doc.shape

(25000, 800000)

In [95]:
vocab = veczr.get_feature_names()

In [96]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [97]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()
p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log((y==1).sum()/(y==0).sum())

Here we fit regularized logistic regression where the features are the trigrams.

In [98]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.90500000000000003

In [99]:
m = LogisticRegression(C=1e-5, dual=True) #with too much regularization
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.76160000000000005

Here is the $\text{log-count ratio}$ `r`.  

In [138]:
r.shape, r

((1, 800000),
 matrix([[-0.04911, -0.15543, -0.24226, ...,  1.10419, -0.68757, -0.68757]]))

In [165]:
np.exp(r)

matrix([[ 0.95208,  0.85605,  0.78485, ...,  3.01678,  0.5028 ,  0.5028 ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

##### So we multiply term doc with the Naive bayes ratio as prior input and then we develop logistic regression. This reduces regularization and improves fit.

In [103]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1) #Now regularization is giving penalty being different from Naive Bayes prior assumption
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768000000000005

## fastai NBSVM++

In [104]:
sl=2000

In [108]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [112]:
# class DotProdNB(nn.Module):
#     def __init__(self, nf, ny, w_adj=0.4, r_adj=10):
#         super().__init__()
#         self.w_adj,self.r_adj = w_adj,r_adj
#         self.w = nn.Embedding(nf+1, 1, padding_idx=0) #for every feature one weight, consider this as linear layer
#         self.w.weight.data.uniform_(-0.1,0.1)
#         self.r = nn.Embedding(nf+1, ny) #one value per class

#     def forward(self, feat_idx, feat_cnt, sz):
#         w = self.w(feat_idx) #feature index of words in a doc and then it is looked up in the embedding matrix
#         r = self.r(feat_idx) #w is basically x*w
#         x = ((w+self.w_adj)*r/self.r_adj).sum(1) #w_adj is changing the prior, so regularization is pushing it to .4
#         return F.softmax(x)

##### Idea is to penalize thing which are deviating from the prior

In [109]:
learner = md.dotprod_nb_learner() #generalization of naive bayes
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)


  0%|          | 0/391 [00:00<?, ?it/s][A


AttributeError: 'BOW_Dataset' object has no attribute 'X'

In [113]:
??md.dotprod_nb_learner

In [None]:
??

In [110]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)



  0%|          | 0/391 [00:00<?, ?it/s][A[A


AttributeError: 'BOW_Dataset' object has no attribute 'X'

In [None]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)