# Movie Review Sentiment (Naive Bayes)

In [1]:
import io
import requests
import re
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Make this notebook's output stable across runs
random_state = 1
np.random.seed(random_state)

# Options for plots
%matplotlib inline
sns.set()
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.labelpad'] = 12

## Load IMDB Movie Reviews

Source: http://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
train_url = 'https://raw.githubusercontent.com/natecraig/aiml/main/Data/movie_train.txt'
test_url = 'https://raw.githubusercontent.com/natecraig/aiml/main/Data/movie_test.txt'

train_download = requests.get(train_url).content
test_download = requests.get(test_url).content

In [3]:
# For train and test sets, the first 12,500 reviews are positive,
# and the second 12,500 reviews are negative

X_train_raw = []
for l in io.StringIO(train_download.decode('utf-8')):
    X_train_raw.append(l.strip())
    
X_test_raw = []
for l in io.StringIO(test_download.decode('utf-8')):
    X_test_raw.append(l.strip())
    
categories = ['Negative', 'Positive']
y = [1 if i < 12500 else 0 for i in range(25000)]

In [4]:
print(y[1000])

1


In [5]:
print(X_train_raw[1000])

This was a must see documentary for me when I missed the opportunity in 2004, so I was definitely going to watch the repeat. I really sympathised with the main character of the film, because, this is true, I have a milder condition of the skin problem he had, Dystrophic Epidermolysis Bullosa (EB). This is a sad, sometimes amusing and very emotional documentary about a boy with a terrible skin disorder. Jonny Kennedy speaks like a kid (because of wasting vocal muscle) and never went through puberty, but he is 36 years old. Most sympathising moments are seeing his terrible condition, and pealing off his bandages. Jonny had quite a naughty sense of humour, he even narrated from beyond the grave when showing his body in a coffin. He tells his story with the help of his mother, Edna Kennedy, his older brother and celebrity model, and Jonny's supporter, Nell McAndrew. It won the BAFTAs for Best Editing and Best New Director (Factual), and it was nominated for Best Sound (Factual) and the Fla

In [6]:
print(y[20000])

0


In [7]:
print(X_train_raw[20000])

This movie tries hard, but completely lacks the fun of the 1960s TV series, that I am sure people do remember with fondness. Although I am 17, I watched some of the series on YouTube a long time ago and it was enjoyable and fun. Sadly, this movie does little justice to the series.<br /><br />The special effects are rather substandard, and this wasn't helped by the flat camera-work. The script also was dull and lacked any sense of wonder and humour. Other films with under-par scripting are Home Alone 4, Cat in the Hat, Thomas and the Magic Railroad and Addams Family Reunion.<br /><br />Now I will say I liked the idea of the story, but unfortunately it was badly executed and ran out of steam far too early, and I am honestly not sure for this reason this is something for the family to enjoy. And I was annoyed by the talking suit, despite spirited voice work from Wayne Knight.<br /><br />But the thing that angered me most about this movie was that it wasted the talents of Christopher Lloyd

In [8]:
# Drop HTML line breaks
regex = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
X_train_raw = [regex.sub(' ', x) for x in X_train_raw]
X_test_raw = [regex.sub(' ', x) for x in X_test_raw]

In [9]:
print(X_train_raw[20000])

This movie tries hard, but completely lacks the fun of the 1960s TV series, that I am sure people do remember with fondness. Although I am 17, I watched some of the series on YouTube a long time ago and it was enjoyable and fun. Sadly, this movie does little justice to the series. The special effects are rather substandard, and this wasn't helped by the flat camera work. The script also was dull and lacked any sense of wonder and humour. Other films with under par scripting are Home Alone 4, Cat in the Hat, Thomas and the Magic Railroad and Addams Family Reunion. Now I will say I liked the idea of the story, but unfortunately it was badly executed and ran out of steam far too early, and I am honestly not sure for this reason this is something for the family to enjoy. And I was annoyed by the talking suit, despite spirited voice work from Wayne Knight. But the thing that angered me most about this movie was that it wasted the talents of Christopher Lloyd, Jeff Daniels and Daryl Hannah, 

## Create Features Using Word Counts

In [10]:
countvect = CountVectorizer()
X_train_count = countvect.fit_transform(X_train_raw)
X_test_count = countvect.transform(X_test_raw)

In [11]:
X_train_count.shape

(25000, 74849)

In [12]:
# The vocabulary dictionary records the index of each word's feature
# Print 10 example entries
for idx, (k, v) in enumerate(countvect.vocabulary_.items()):
    if idx == 10: break
    print((k, v))    

('bromwell', 9251)
('high', 30902)
('is', 34585)
('cartoon', 10850)
('comedy', 13498)
('it', 34683)
('ran', 53427)
('at', 4753)
('the', 66339)
('same', 57283)


In [13]:
# Find the location of these example words
example_words = ['me', 'the', 'and',
                 'popcorn', 'scary', 'fun',
                 'tom', 'cruise', 'will', 'ferrell', 'ferrel',      
                 'disney', 'elsa', 'timon']

for word in example_words:
    print(f'{word + ":":14}{countvect.vocabulary_[word]}')

me:           41798
the:          66339
and:          3258
popcorn:      50824
scary:        57836
fun:          26333
tom:          67237
cruise:       15792
will:         73091
ferrell:      24285
ferrel:       24284
disney:       18936
elsa:         21403
timon:        66964


## Fit Naive Bayes Model to Word Counts

In [14]:
nbclf = MultinomialNB()
nbclf.fit(X_train_count, y)

In [15]:
# Get performance on training data
print(classification_report(y, nbclf.predict(X_train_count),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.87      0.93      0.90     12500
    Positive       0.93      0.86      0.89     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000



In [16]:
# Get performance on test data
print(classification_report(y, nbclf.predict(X_test_count),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.78      0.88      0.83     12500
    Positive       0.86      0.75      0.80     12500

    accuracy                           0.81     25000
   macro avg       0.82      0.81      0.81     25000
weighted avg       0.82      0.81      0.81     25000



## Create Features Using TFIDF

In [17]:
tfidf = TfidfTransformer(sublinear_tf=True)
X_train_tfidf = tfidf.fit_transform(X_train_count)
X_test_tfidf = tfidf.transform(X_test_count)

In [18]:
print(X_train_tfidf[0])

  (0, 74349)	0.05206888010010324
  (0, 74158)	0.055050768098681556
  (0, 72911)	0.061112244873247706
  (0, 72904)	0.033681880397817464
  (0, 72773)	0.04128496380247337
  (0, 72753)	0.03797512811718631
  (0, 72703)	0.03617656837997499
  (0, 72544)	0.10903604567212596
  (0, 68145)	0.0859738287067699
  (0, 67125)	0.05584454818700603
  (0, 66925)	0.0386085227347137
  (0, 66699)	0.053325764053003
  (0, 66526)	0.04751002409485954
  (0, 66367)	0.07247542780430569
  (0, 66339)	0.06066862820581213
  (0, 66322)	0.054204063816507164
  (0, 66299)	0.042450397436525476
  (0, 65750)	0.12716378450734914
  (0, 65748)	0.3115242868369409
  (0, 64683)	0.10625234549709642
  (0, 64115)	0.05319903669985538
  (0, 63768)	0.1749891196225475
  (0, 63767)	0.16908507449768323
  (0, 61617)	0.03673963173839623
  (0, 60495)	0.08891486418508797
  :	:
  (0, 32729)	0.09506058762703333
  (0, 30902)	0.17694132818362251
  (0, 30670)	0.05165066420396213
  (0, 24623)	0.1460232241180382
  (0, 24328)	0.12291687653693922
  (0, 

## Fit Naive Bayes Model to TFIDF

In [19]:
nbclf = MultinomialNB()
nbclf.fit(X_train_tfidf, y)

In [20]:
# Get performance on training data
print(classification_report(y, nbclf.predict(X_train_tfidf),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.90      0.93      0.91     12500
    Positive       0.93      0.89      0.91     12500

    accuracy                           0.91     25000
   macro avg       0.91      0.91      0.91     25000
weighted avg       0.91      0.91      0.91     25000



In [21]:
# Get performance on test data
print(classification_report(y, nbclf.predict(X_test_tfidf),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.80      0.89      0.84     12500
    Positive       0.88      0.78      0.82     12500

    accuracy                           0.83     25000
   macro avg       0.84      0.83      0.83     25000
weighted avg       0.84      0.83      0.83     25000



## Use nGrams

In [22]:
countvect = CountVectorizer(stop_words='english', ngram_range=(1, 2))
X_train_count = countvect.fit_transform(X_train_raw)
X_train_count.shape

(25000, 1771985)

In [23]:
# Find the location of example ngrams
example_words = ['good movie', 'bad movie',
                 'silly plot', 'crazy action',
                 'buttered popcorn']

for word in example_words:
    print(f'{word + ":":18}{countvect.vocabulary_[word]}')

good movie:       669449
bad movie:        120741
silly plot:       1421010
crazy action:     345009
buttered popcorn: 207145


In [24]:
clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('nbclf', MultinomialNB()),
])

In [25]:
clf.fit(X_train_raw, y)

In [26]:
# Get performance on training data
print(classification_report(y, clf.predict(X_train_raw),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.98      0.99      0.98     12500
    Positive       0.99      0.98      0.98     12500

    accuracy                           0.98     25000
   macro avg       0.98      0.98      0.98     25000
weighted avg       0.98      0.98      0.98     25000



In [27]:
# Get performance on test data
print(classification_report(y, clf.predict(X_test_raw),
                            target_names=categories))

              precision    recall  f1-score   support

    Negative       0.83      0.89      0.86     12500
    Positive       0.88      0.82      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.86      0.85      0.85     25000
weighted avg       0.86      0.85      0.85     25000

