### Imports and data

In [1]:
%reload_ext autoreload
%autoreload 2

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
from IPython.display import display

from sklearn import metrics
import feather
import pdpbox.pdp as pdp
from plotnine import *

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import gc
import os
from contextlib import contextmanager

plt.rcParams["figure.figsize"] = (10, 6)
np.set_printoptions(precision=5)
pd.set_option("display.precision", 5)
%precision 5


import spacy
#!python -m spacy download en_core_web_sm

# constants #################################################################
PATH = 'data/aclImdb/'
import pymorton as pm


def morton_indexer(w, h):
    assert w & (w - 1) == 0
    assert h & (h - 1) == 0
    enc = np.zeros(w * h, dtype=np.int)
    dec = np.zeros(w * h, dtype=np.int)
    i = 0
    for r in range(h):
        for c in range(w):
            m = pm.interleave(c, r)
            enc[i] = m
            dec[m] = i
            i += 1
    return enc, dec


# defines: ##################################################################
# @contextmanager
# def rf_samples(n='all'):
#     if isinstance(n, int) and n > 0:
#         set_rf_samples(n)
#     try:
#         yield
#     finally:
#         if isinstance(n, int) and n > 0:
#             reset_rf_samples()


def split_vals(a, n):
    return a[:n].copy(), a[n:].copy()


class AttrDict(dict):
    __getattr__ = dict.__getitem__
    __setattr__ = dict.__setitem__


[93m    Linking successful[0m
    /home/fastai/anaconda3/envs/fastai/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/fastai/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



### Readme

```
Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to
the training and test sets. Each contains [pos/, neg/] directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
the star rating for that review on a 1-10 scale. For example, the file
[test/pos/200_8.txt] is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb. The
[train/unsup/] directory has 0 for all ratings because the ratings are
omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
have its URL on line 200 of this file. Due the ever-changing IMDb, we
are unable to link directly to the review, but only to the movie's
review page.

In addition to the review text files, we include already-tokenized bag
of words (BoW) features that were used in our experiments. These 
are stored in .feat files in the train/test directories. Each .feat
file is in LIBSVM format, an ascii sparse-vector format for labeled
data.  The feature indices in these files start from 0, and the text
tokens corresponding to a feature index is found in [imdb.vocab]. So a
line with 0:7 in a .feat file means the first word in [imdb.vocab]
(the) appears 7 times in that review.

LIBSVM page for details on .feat file format:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
rating is a good way to get a sense for the average polarity of a word
in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which
introduces it. This paper also contains classification results which
you may want to compare against.


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Contact

For questions/comments/corrections please contact Andrew Maas
amaas@cs.stanford.edu
```

# IMDB dataset and the sentiment classification task¶

In [2]:
!ls -al {PATH}

total 66560
drwxr-xr-x 4 fastai fastai     4096 Apr  1 09:02 .
drwxrwxr-x 5 fastai fastai     4096 Mar 31 11:11 ..
-rw-r--r-- 1 fastai fastai     4037 Jun 26  2011 README
-rw-r--r-- 1 fastai fastai   845980 Apr 12  2011 imdb.vocab
-rw-r--r-- 1 fastai fastai   903029 Jun 11  2011 imdbEr.txt
-rw-rw-r-- 1 fastai fastai 66383276 Apr  1 09:02 pickled.pkl
drwxr-xr-x 4 fastai fastai     4096 Mar 31 11:11 test
drwxr-xr-x 5 fastai fastai     4096 Mar 31 11:11 train


In [3]:
ls -al {PATH}train

total 66640
drwxr-xr-x 5 fastai fastai     4096 Mar 31 11:11 [0m[01;34m.[0m/
drwxr-xr-x 4 fastai fastai     4096 Apr  1 09:02 [01;34m..[0m/
-rw-r--r-- 1 fastai fastai 21021197 Apr 12  2011 labeledBow.feat
drwxr-xr-x 2 fastai fastai   356352 Mar 31 11:11 [01;34mneg[0m/
drwxr-xr-x 2 fastai fastai   356352 Mar 31 11:11 [01;34mpos[0m/
drwxr-xr-x 2 fastai fastai  1462272 Mar 31 11:11 [01;34munsup[0m/
-rw-r--r-- 1 fastai fastai 41348699 Apr 12  2011 unsupBow.feat
-rw-r--r-- 1 fastai fastai   612500 Apr 12  2011 urls_neg.txt
-rw-r--r-- 1 fastai fastai   612500 Apr 12  2011 urls_pos.txt
-rw-r--r-- 1 fastai fastai  2450000 Apr 12  2011 urls_unsup.txt


In [4]:
%ls {PATH}train/pos/*.txt | head

data/aclImdb/train/pos/0_9.txt
data/aclImdb/train/pos/10000_8.txt
data/aclImdb/train/pos/10001_10.txt
data/aclImdb/train/pos/10002_7.txt
data/aclImdb/train/pos/10003_8.txt
data/aclImdb/train/pos/10004_8.txt
data/aclImdb/train/pos/10005_7.txt
data/aclImdb/train/pos/10006_7.txt
data/aclImdb/train/pos/10007_7.txt
data/aclImdb/train/pos/10008_7.txt
ls: write error: Broken pipe


In [26]:
def load_data():
    fname = f'{PATH}pickled.pkl'
    if os.path.exists(fname):
        with open(fname, 'rb') as f:
            return AttrDict(pickle.load(f))
    else:
        names = ['neg', 'pos']
        trn, trn_y = texts_labels_from_folders(f'{PATH}train', names)
        val, val_y = texts_labels_from_folders(f'{PATH}test', names)
        dta = locals()
        with open(fname, 'wb') as f:
            pickle.dump(dta, f)
        return AttrDict(dta)
%time data = load_data()

CPU times: user 95.8 ms, sys: 39.9 ms, total: 136 ms
Wall time: 136 ms


In [27]:
data.trn[0], data.trn_y[0]

("This is definitely a stupid, bad-taste movie. Eddie Murphy stars in what is written like a sitcom. He is surrounded with his perfect family, full of good family values. If you're looking for politically correct entertainment, this movie is for you. But if you hate the idea of being the only one not to laugh at obscene gags in a movie-theater full of pop-corn addicts, just flee.",
 0)

In [182]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

# sklearn.feature_extraction.text
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(data.trn)
val_term_doc = veczr.transform(data.val)

In [183]:
veczr.get_feature_names()[:10]

['\x08\x08\x08\x08a', '\x10own', '!', '"', '#', '$', '%', '&', "'", '(']

In [184]:
veczr.get_feature_names()[5000:5010], veczr.vocabulary_['absurd']

(['aussie',
  'aussies',
  'austen',
  'austeniana',
  'austens',
  'auster',
  'austere',
  'austerity',
  'austin',
  'austinese'],
 1297)

# Naive Bayes

We define the **log-count ratio  𝑟**  for each word  *$f$* :

$r=\log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature *$f$* in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [196]:
x = trn_term_doc
y = data.trn_y


def make_pr(x, y):
    def pr(y_i):
        p = x[y == y_i].sum(0)
        return (p + 1) / ((y == y_i).sum() + 1)

    return pr


pr = make_pr(x, y)

r = np.log(pr(1) / pr(0))
b = np.log((y == 1).mean() / (y == 0).mean())

In [197]:
# Naive Bayes
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T > 0
(preds == data.val_y).mean()

0.81656

In [198]:
# Binarized Naive Bayes
x = trn_term_doc.sign()
r = np.log(pr(1) / pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T > 0
(preds == data.val_y).mean()

0.83184

## Logistic regression

In [188]:
m = LogisticRegression(C=1e-2, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)

(preds==data.val_y).mean()

0.85128

In [189]:
# binarized
m = LogisticRegression(C=1e-2)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc)

(preds==data.val_y).mean()

0.85128

## Trigram with NB features

In [190]:
veczr3 = CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800_000)
trn_term_doc3 = veczr3.fit_transform(data.trn)
val_term_doc3 = veczr3.transform(data.val)

In [178]:
trn_term_doc3.shape

(25000, 800000)

In [179]:
veczr3.get_feature_names()[200_000:200_010]

['emotional and',
 'emotional and complex',
 'emotional and physical',
 'emotional and powerful',
 'emotional and psychological',
 'emotional and social',
 'emotional and thought',
 'emotional appeal',
 'emotional as',
 'emotional attachment']

In [201]:
y = data.trn_y
x = trn_term_doc3.sign()
pr = make_pr(x, y)
val_x = val_term_doc3.sign()
r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

pre_preds = val_x @ r.T + b
preds = pre_preds.T > 0
(preds == data.val_y).mean()

0.87912

In [203]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)

preds = m.predict(val_x)
(preds.T == data.val_y).mean()

0.905

In [204]:
r.shape

(1, 800000)

In [217]:
x_nb = x.multiply(r)
val_x_nb = val_x.multiply(r)
m = LogisticRegression(C=0.1, dual=True)
m.fit(x_nb, y)
preds = m.predict(val_x_nb)
(preds.T == data.val_y).mean()

0.91768

In [213]:
x.shape, x_nb.shape

((25000, 800000), (25000, 800000))

# fastai NBSVM++¶

In [222]:
sl = 2_000
md = TextClassifierData.from_bow(trn_term_doc3, data.trn_y, val_term_doc3, data.val_y, sl)

learner = md.dotprod_nb_learner()
learner.fit(0.02, 4, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=4, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>                     
    0      0.023144   0.119381   0.9168    
    1      0.013382   0.116176   0.92008                      
    2      0.008749   0.113512   0.92092                       
    3      0.006416   0.111717   0.92016                       



[array([0.11172]), 0.92016]

[...](https://youtu.be/XJ_waZlJU8g?t=2764)

# References

- [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning](https://www.aclweb.org/anthology/P12-2018)