In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from torchtext import vocab, data, datasets

## IMBD dataset and the sentiment classification task

The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing

In [2]:
sl=1000
vocab_size=200000

In [3]:
PATH='data/aclImdb/'

names = ['neg','pos']
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review:

In [4]:
trn[0]

'This is the first of these "8 Films To Die For" collection that I\'ve seen and it\'s certainly not made me want to see any of the rest...although I\'ve heard at least a couple of them are decent. I don\'t know, this wasn\'t terrible but it didn\'t really do much for me. Your basic dysfunctional cannibal family in suburbia kind of thing, mom & dad died, the family sold the farm & moved to San Francisco (?) where they continued to bring home stray food sources whenever possible. The best part of this was the creepy Goth sister, who of course invites a friend over from school that never leaves. Anyway, of course we have a butcher shop in the basement and so on and so on. This family is sort of like the white-bread version of the Sawyer Clan, they\'re nasty & they do bad things but they ain\'t go no soul. I see a lot of reviews from people that liked this, and I guess I don\'t know what I missed, but I found it to be very mediocre & I wouldn\'t recommend it to anyone, really. 4 out of 10.

In [5]:
trn_y[0]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`). Here is how you specify parameters to the CountVectorizer. We will be working with the top 200000 unigrams, bigrams and trigrams.

In [6]:
veczr = CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=vocab_size)

In the next line `fit_transform(trn)` computes the vocabulary and other hyparameters learned from the training set. It also transforms the training set. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document $i$ and it is binary (it has a $1$ for each vocabulary n-gram present in document $i$  and $0$ otherwise).

In [7]:
trn_term_doc = veczr.fit_transform(trn) # scipy.sparse.csr.csr_matrix
val_term_doc = veczr.transform(val)

In [8]:
trn_term_doc.shape # (dataset size, vocabulary size)

(25000, 200000)

In [9]:
veczr.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': 200000,
 'min_df': 1,
 'ngram_range': (1, 3),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function fastai.text.tokenize>,
 'vocabulary': None}

In [10]:
# here is the vocabulary
vocab = veczr.get_feature_names()

In [11]:
vocab[50:55]

['! " and', '! " as', '! " at', '! " but', '! " for']

## Weighted Naive Bayes

Our first model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above. Each feature if multiplied by a log-count ratio (see below for explanation). A logitic regression model is then trained to predict sentiment.

Here is how to define **log-count ratio** for a feature $f$:

$\text{log-count ratio} = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [12]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [15]:
# http://forums.fast.ai/t/howto-installation-on-windows/10439/87

# You have to change code in fastai/metrics.py

# Before
#def accuracy_multi(preds, targs, thresh):
#    return ((preds>thresh)==targs).float().mean()

# After
#def accuracy_multi(preds, targs, thresh):
#    return ((preds>thresh).float()==targs).float().mean()

In [13]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-5, cycle_len=1)

epoch      trn_loss   val_loss   <lambda>                     
    0      0.068083   0.122717   0.916408  



[0.12271736, 0.9164082481123298]

In [14]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6)

epoch      trn_loss   val_loss   <lambda>                     
    0      0.059264   0.10752    0.923489  



[0.10751964, 0.9234894501888539]

### unigram

Here is use `CountVectorizer` with a different set of parameters. In particular ngram_range by default is set to (1, 1)so we will get unigram features. Note that we are specifiying our own `tokenize` function.

In [15]:
veczr =  CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

Here is how to compute the $\text{log-count ratio}$ `r`. 

In [16]:
# Check vocabulary size
len(veczr.get_feature_names())

75132

In [17]:
x=trn_term_doc # scipy.sparse.csr.csr_matrix (25000, 75132)
y=trn_y # (25000,)

p = x[y==1].sum(0)+1 # numpy.matrixlib.defmatrix.matrix (1, 75132), add 1 to avoid zero dividing
q = x[y==0].sum(0)+1 # numpy.matrixlib.defmatrix.matrix (1, 75132), add 1 to avoid zero dividing
r = np.log((p/p.sum())/(q/q.sum())) # (1, 75132)
b = np.log(len(p)/len(q))

#### minimum example

```
x[y ==1]: [[3, 0, 1]]
x[y == 0]: [[1, 1, 0]]

p = [4, 1, 2]
q = [2, 2, 1]

p / p.sum() = [ 0.57142857,  0.14285714,  0.28571429]
q / q.sum() = [ 0.4,  0.4,  0.2]

r = np.log((p/p.sum()) / (q/q.sum())) = [ 0.35667494, -1.02961942,  0.35667494]
b = np.log(len(p) / len(q)) = 0.0

# prediction
val = [[1, 1, 0]]
val @ r.T + b = [[-0.67294448]] # => 0
```    

Here is the formula for Naive Bayes.

In [20]:
pre_preds = val_term_doc @ r.T + b # (25000, 75132) * (75132, 1) + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.8074

In [21]:
pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.82624

In [22]:
print(val_term_doc[0])

  (0, 3)	8
  (0, 8)	1
  (0, 13)	2
  (0, 15)	8
  (0, 16)	6
  (0, 59)	1
  (0, 1039)	6
  (0, 1041)	6
  (0, 1050)	2
  (0, 1362)	1
  (0, 2619)	1
  (0, 2726)	2
  (0, 3219)	1
  (0, 3817)	2
  (0, 3862)	1
  (0, 5221)	1
  (0, 5234)	1
  (0, 5254)	1
  (0, 5472)	2
  (0, 6304)	1
  (0, 7063)	1
  (0, 7229)	1
  (0, 8696)	6
  (0, 9854)	1
  (0, 9936)	1
  :	:
  (0, 65309)	1
  (0, 66400)	1
  (0, 66441)	1
  (0, 66458)	6
  (0, 66554)	1
  (0, 66580)	1
  (0, 66596)	1
  (0, 66743)	1
  (0, 67182)	1
  (0, 67252)	5
  (0, 67451)	1
  (0, 68935)	1
  (0, 72232)	1
  (0, 72337)	5
  (0, 72400)	1
  (0, 72506)	1
  (0, 72745)	2
  (0, 72747)	1
  (0, 72896)	1
  (0, 73451)	1
  (0, 73453)	1
  (0, 73488)	1
  (0, 74251)	1
  (0, 74478)	2
  (0, 74503)	1


In [23]:
# The sign function returns -1 if x < 0, 0 if x==0, 1 if x > 0.
print(val_term_doc.sign()[0])

  (0, 3)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 16)	1
  (0, 59)	1
  (0, 1039)	1
  (0, 1041)	1
  (0, 1050)	1
  (0, 1362)	1
  (0, 2619)	1
  (0, 2726)	1
  (0, 3219)	1
  (0, 3817)	1
  (0, 3862)	1
  (0, 5221)	1
  (0, 5234)	1
  (0, 5254)	1
  (0, 5472)	1
  (0, 6304)	1
  (0, 7063)	1
  (0, 7229)	1
  (0, 8696)	1
  (0, 9854)	1
  (0, 9936)	1
  :	:
  (0, 65309)	1
  (0, 66400)	1
  (0, 66441)	1
  (0, 66458)	1
  (0, 66554)	1
  (0, 66580)	1
  (0, 66596)	1
  (0, 66743)	1
  (0, 67182)	1
  (0, 67252)	1
  (0, 67451)	1
  (0, 68935)	1
  (0, 72232)	1
  (0, 72337)	1
  (0, 72400)	1
  (0, 72506)	1
  (0, 72745)	1
  (0, 72747)	1
  (0, 72896)	1
  (0, 73451)	1
  (0, 73453)	1
  (0, 73488)	1
  (0, 74251)	1
  (0, 74478)	1
  (0, 74503)	1


Here is how we can fit regularized logistic regression where the features are the unigrams.

In [24]:
m = LogisticRegression(C=0.1, fit_intercept=False, dual=True)
m.fit(x, y) # x: (25000, 75132), y: (25000,)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.88284

### bigram with NB features

Similar to the model before but with bigram features.

In [25]:
veczr =  CountVectorizer(ngram_range=(1,2), tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [26]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()
p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))

Here we fit regularized logistic regression where the features are the bigrams. Bigrams are giving us 2% boost. 

In [27]:
m = LogisticRegression(C=0.1, fit_intercept=False)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.90272

Here is the $\text{log-count ratio}$ `r`.  

In [28]:
r

matrix([[ 0.68627,  0.68627, -0.70002, ...,  0.68627, -0.70002, -0.70002]])

Here we fit regularized logistic regression where the features are the bigrams multiplied by the $\text{log-count ratio}$. We are getting an extra boost for the normalization. 

In [29]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=1, fit_intercept=False)
m.fit(x_nb, y);

In [30]:
# val_x_nb must be defined
val_x_nb = val_x.multiply(r)

In [31]:
w = m.coef_.T
preds = (val_x_nb @ w + m.intercept_)>0 # The @ (at) operator is intended to be used for matrix multiplication. See https://stackoverflow.com/questions/27385633/what-is-the-symbol-for-in-python
(preds.T==val_y).mean()

0.9148

This is an interpolation between Naive Bayes the regulaized logistic regression approach.

In [32]:
beta=0.25

val_x_nb = val_x.multiply(r)
w = (1-beta)*m.coef_.mean() + beta*m.coef_.T
preds = (val_x_nb @ w + m.intercept_)>0
(preds.T==val_y).mean()

0.9164

In [33]:
w2 = w.T[0]*r.A1

In [34]:
preds = (val_x @ w2 + m.intercept_)>0
(preds.T==val_y).mean()

0.9164

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)

### Unused helpers

In [None]:
class EzLSTM(nn.LSTM):
    def __init__(self, input_size, hidden_size, *args, **kwargs):
        super().__init__(input_size, hidden_size, *args, **kwargs)
        self.num_dirs = 2 if self.bidirectional else 1
        self.input_size = input_size
        self.hidden_size = hidden_size
        
    def forward(self, x):
        h0 = c0 = Variable(torch.zeros(self.num_dirs,x.size(1),self.hidden_size)).cuda()
        outp,_ = super().forward(x, (h0,c0))
        return outp[-1]

In [None]:
def init_wgts(m, last_l=-2):
    c = list(m.children())
    for l in c:
        if isinstance(l, nn.Embedding): 
            l.weight.data.uniform_(-0.05,0.05)
        elif isinstance(l, (nn.Linear, nn.Conv1d)):
            xavier_uniform(l.weight.data, gain=calculate_gain('relu'))
            l.bias.data.zero_()
    xavier_uniform(c[last_l].weight.data, gain=calculate_gain('linear'));

class SeqSize(nn.Sequential):
    def forward(self, x):
        for l in self.children():
            x = l(x)
            print(x.size())
        return x

### End