## Predicting the Movie Review as Positive or Negative 

In [1]:
# Importing the Libraries and the Dataset

import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
# Looking at a Review

from IPython.display import Markdown, display
print(df['label'][5])
display(Markdown('> '+df['review'][5]))

neg


> to put it bluntly , ed wood would have been proud of this . 
a totally ridiculous plot is encompassed with bad humor , hokey drama , zero logic and a crap screenplay . 
also , a beautifully anti-climactic ending . 
not to say it didn't look intriguing when i saw the previews . 
so much for truth in advertising . 
roland emmerich , who's later " independence day " would look like " the 400 blows " compared to this , co-writed and directed this inane sci-fi film which uses the cliche of there being some connection between eqypt and aliens . 
in a useless opening sequence , men find a stone in 1914 with hieroglyphics on it . 
it wouldn't be till present day ( '94 ) till they would actually figure it out . 
they're decipherer ? 
a slightly-neurotic scientist ( nice twist ) , dr . dan jackson ( james spader , doing his best outside of erotic thrillers and some indy fare ) who's life sucks so much that people walk out of his lectures after the third word . 
why do they use him to decipher what no one else could ? 
so there is a hokey ending ! 
duh ! 
he figures it out in about a minute . 
yea . 
and then they get a suicidal colonel or something , " jack " o'neill ( kurt russel , with his wyat earp locks in the beginning then a flat-top that would make howie long snap into a fetal position ) . 
why a suicidal colonel ? 
for the ending ! 
you'll get the hang of this . 
they open the stargate , a bunch of them go through it with a bomb to blow it up if they find anything bad . 
after an overdone special effects thing , they're . . . inside a goddam pyramid . 
so they went to egypt , right ? 
wrong . 
they're on another planet that was filmed in egypt . 
they discover a cilvilization ruled by ra , the sun god ( the androginous jaye davidson , with a voice modifier to make him sound like barry white with asthma ) , and there are fights , explosions and a kiss between two people . 
yea . 
also melodrama , stupidity , hokey scenes and a bizarre language . 
an anti-climactic ending ends with stupid lines ( " say hello to king tut , asswhole ! " 
- the quintessential line , lemme tell ya ) and some convenient pesudo-pseudo-pseudo-character development . 
by the end , you just wanna go home and watch , i don't know , the " outer limits " or something . 
the script's terrible . 
the special effects are okay , but nothing great . 
the story's so weak that it's almost opaque . 
the whole experience just isn't worth it unless you're so bored that you'd consider watching a " full house " marathon . . . or 
this . 
i'd pick this , obviously , but still , it's just not fun at all . 
and i can't wait for it to premier on mst3k . 


In [5]:
# Test for Missing Values and Blank Spaces '   '

# Check for the existence of NaN values in a cell:
df.isnull().sum()

label      0
review    35
dtype: int64

In [6]:
# Drop NA

df.dropna(inplace=True)
print(len(df))

# Detect for Empty Strings in the Dataset

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

1965
27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [7]:
# Validate
print(df['review'][147])

# Drop the '  ' values
df.drop(blanks, inplace=True)
len(df)

  


1938

In [8]:
# Machine Learning Process

from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [9]:
# Create Pipeline for Vectorine -> Train -> Fit

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

In [10]:
# Do for Naive Bayes

# Feed the Data Through Pipeline
text_clf_nb.fit(X_train, y_train)

# Form a prediction set
predictions = text_clf_nb.predict(X_test)

# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

# Print a classification report
print(metrics.classification_report(y_test,predictions))

[[287  21]
 [130 202]]
0.7640625
              precision    recall  f1-score   support

         neg       0.69      0.93      0.79       308
         pos       0.91      0.61      0.73       332

    accuracy                           0.76       640
   macro avg       0.80      0.77      0.76       640
weighted avg       0.80      0.76      0.76       640



In [11]:
# Fit through Linear SVC Pipeline

text_clf_lsvc.fit(X_train, y_train)

# Form a prediction set
predictions = text_clf_lsvc.predict(X_test)

# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

# Print a classification report
print(metrics.classification_report(y_test,predictions))

# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[259  49]
 [ 49 283]]
              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       308
         pos       0.85      0.85      0.85       332

    accuracy                           0.85       640
   macro avg       0.85      0.85      0.85       640
weighted avg       0.85      0.85      0.85       640

0.846875


###  Adding Stopwords to CountVectorizer


By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.

The [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class accepts the following arguments:
> *CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, **stop_words=None**, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)*

[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) supports the same arguments and more. Under *stop_words* we have the following options:
> stop_words : *string {'english'}, list, or None (default)*

That is, we can run `TfidVectorizer(stop_words='english')` to accept scikit-learn's built-in list,<br>
or `TfidVectorizer(stop_words=[a, and, the])` to filter these three words. In practice we would assign our list to a variable and pass that in instead.

Scikit-learn's built-in list contains 318 stopwords:
> <pre>from sklearn.feature_extraction import text
> print(text.ENGLISH_STOP_WORDS)</pre>
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']

However, there are words in this list that may influence a classification of movie reviews. With this in mind, let's trim the list to just 60 words:

#### Creating and Applying Stopwords

In [12]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

In [13]:
# ADD STOPWORDS TO THE LINEAR SVC PIPELINE

text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', LinearSVC()),
])

# Fitting the Pipeline
text_clf_lsvc2.fit(X_train, y_train)

# Predict via Pipeline

predictions = text_clf_lsvc2.predict(X_test)

# Print the Metrics

print(metrics.confusion_matrix(y_test,predictions))

print(metrics.classification_report(y_test,predictions))

print(metrics.accuracy_score(y_test,predictions))

[[256  52]
 [ 48 284]]
              precision    recall  f1-score   support

         neg       0.84      0.83      0.84       308
         pos       0.85      0.86      0.85       332

    accuracy                           0.84       640
   macro avg       0.84      0.84      0.84       640
weighted avg       0.84      0.84      0.84       640

0.84375


We went from 84.7% without filtering stopwords to 84.4% after adding a stopword filter to our pipeline. Keep in mind that 2000 movie reviews is a relatively small dataset. The real gain from stripping stopwords is improved processing speed; depending on the size of the corpus, it might save hours

In [17]:
# Test the Own Sentences for Sentiment

myreview = 'This is a sweet film about an unusual but notable part of "Hollywood," for a group of people who have, \
for the most part, barely been to Los Angeles.'

# Print the Prediction of the Review - via Naive Bayes
print(text_clf_nb.predict([myreview]))

# Print the Prediction of the Review - via SVC
print(text_clf_lsvc.predict([myreview]))

['pos']
['neg']
