# Sentiment Classification of Movie Reviews (using Naive Bayes, Logistic Regression, and Ngrams)

## IMDB 

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *

In [3]:
import sklearn.feature_extraction.text as sklearn_text

In [4]:
?? URLs

In [6]:
path = untar_data(URLs.IMDB_SAMPLE)
path

Downloading http://files.fast.ai/data/examples/imdb_sample


PosixPath('/home/rprilepskiy/.fastai/data/imdb_sample')

In [7]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [9]:
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0))

In [11]:
movie_reviews.valid.x[0], movie_reviews.valid.y[0]

(Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
  
   xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
  
   xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk ,

In [12]:
len(movie_reviews.train.x), len(movie_reviews.valid.x)

(800, 200)

In [13]:
len(movie_reviews.vocab.itos), len(movie_reviews.vocab.stoi)

(6008, 19161)

In [14]:
movie_reviews.vocab.stoi['language']

917

In [17]:
movie_reviews.vocab.itos[917]

'language'

In [18]:
movie_reviews.vocab.itos[20:30]

['that', 'this', '"', "'s", '\n \n ', '-', 'was', 'as', 'for', 'movie']

In [20]:
movie_reviews.vocab.itos[-1]

'sollett'

In [21]:
movie_reviews.vocab.itos[:20]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the',
 '.',
 ',',
 'and',
 'a',
 'of',
 'to',
 'is',
 'it',
 'in',
 'i']

In [22]:
movie_reviews.vocab.stoi

defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'xxfld': 4,
             'xxmaj': 5,
             'xxup': 6,
             'xxrep': 7,
             'xxwrep': 8,
             'the': 9,
             '.': 10,
             ',': 11,
             'and': 12,
             'a': 13,
             'of': 14,
             'to': 15,
             'is': 16,
             'it': 17,
             'in': 18,
             'i': 19,
             'that': 20,
             'this': 21,
             '"': 22,
             "'s": 23,
             '\n \n ': 24,
             '-': 25,
             'was': 26,
             'as': 27,
             'for': 28,
             'movie': 29,
             'with': 30,
             'but': 31,
             'film': 32,
             'you': 33,
             ')': 34,
             'on': 35,
             '(': 36,
             "n't": 37,
             'are': 38,
             'he': 39,
             'his': 40,
       

In [29]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['ranger']]

'xxunk'

In [24]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['language']]

'language'

In [32]:
t = movie_reviews.train[0][0]
t

Text xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !

In [33]:
t.data[:30]

array([   2,    5, 4619,   25,    0,   25,  867,   52,    5, 3776,    5, 1800,   95,   37,   85,  191,   64,  935,
          0, 2738,  517,   18,   21,   11,   84, 2417,  192,   88, 3777,   64])

# Creating our term-document matrix

In [35]:
c = Counter([4,2,8,8,4,8])

In [36]:
c

Counter({4: 2, 2: 1, 8: 3})

In [37]:
c.values()

dict_values([2, 1, 3])

In [38]:
c.keys()

dict_keys([4, 2, 8])

In [40]:
(movie_reviews.valid.x)[0]

Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
 
  xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
 
  xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxma

In [41]:
Counter((movie_reviews.valid.x)[0].data)

Counter({2: 1,
         5: 32,
         21: 3,
         71: 1,
         189: 1,
         748: 1,
         288: 1,
         285: 1,
         63: 2,
         221: 1,
         666: 2,
         59: 1,
         13: 4,
         2705: 1,
         14: 6,
         2875: 1,
         11: 10,
         18: 2,
         358: 1,
         0: 32,
         77: 1,
         15: 6,
         478: 1,
         1833: 1,
         50: 3,
         9: 10,
         319: 1,
         6: 1,
         2743: 1,
         12: 1,
         115: 1,
         4126: 1,
         197: 2,
         1331: 1,
         25: 2,
         324: 1,
         10: 7,
         3963: 1,
         16: 4,
         74: 1,
         24: 3,
         2817: 1,
         5821: 1,
         2595: 1,
         710: 1,
         3429: 1,
         84: 1,
         149: 1,
         20: 1,
         26: 1,
         605: 1,
         378: 1,
         1057: 1,
         251: 1,
         258: 1,
         1346: 1,
         194: 1,
         239: 1,
         49: 1,
         27

In [46]:
movie_reviews.vocab.itos[49]

'an'

In [47]:
(movie_reviews.valid.x)[1]

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk

In [73]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
#     return (values, j_indices, indptr)

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)

In [74]:
%%time
val_term_doc = get_term_doc_matrix(movie_reviews.valid.x, len(movie_reviews.vocab.itos))

CPU times: user 31.7 ms, sys: 220 µs, total: 31.9 ms
Wall time: 30.1 ms


In [76]:
%%time
trn_term_doc = get_term_doc_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))

CPU times: user 112 ms, sys: 375 µs, total: 112 ms
Wall time: 107 ms


In [77]:
trn_term_doc.shape

(800, 6008)

In [78]:
trn_term_doc[:,-10:]

<800x10 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [79]:
val_term_doc.shape

(200, 6008)

In [81]:
movie_reviews.vocab.itos[-1:]

['sollett']

In [85]:
val_term_doc.todense()[:10,-10:]

matrix([[0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0],
        [0, 0, 0, 0, ..., 0, 0, 0, 0]])

In [83]:
movie_reviews.vocab.itos[3]

'xxeos'

In [84]:
review = movie_reviews.valid.x[1]; review

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk