# Movie Reviews


## Introduction

For this case-study, we are utilizing a data-set of about 2000 movie reviews. These reviews are full length reviews that have been classified as being both positive and negative.

A sample review has been included below

> how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 

>this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternates between unfunny scenes of the brothers bickering over what to do with their inheritance and endless action sequences as the two take on their increasingly determined furry foe . 
whatever promise the film starts with soon deteriorates into boring dialogue , terrible overacting , and increasingly uninspired slapstick that becomes all sound and fury , signifying nothing . 
the script becomes so unspeakably bad that the best line poor lee evens can utter after another run in with the rodent is : " i hate that mouse " . 
oh cringe ! 

>this is home alone all over again , and ten times worse . 
one touching scene early on is worth mentioning . 
we follow the mouse through a maze of walls and pipes until he arrives at his makeshift abode somewhere in a wall . 
he jumps into a tiny bed , pulls up a makeshift sheet and snuggles up to sleep , seemingly happy and just wanting to be left alone . 
it's a magical little moment in an otherwise soulless film . 
a message to speilberg : if you want dreamworks to be associated with some kind of artistic credibility , then either give all concerned in mouse hunt a swift kick up the arse or hire yourself some decent writers and directors . 

>this kind of rubbish will just not do at all . 


### Goal

To classify the reviews as being positive or negative based on the text content of the reviews. 


### Result Summary
The best model is able to attain an accuracy of 85% on the test data-set

### Table of Contents
1. Getting and Setting-up the data
2. Data Exploration and Preperation 
3. Vectorizing the Data and Training Models
4. Making the Best Model Better
5. Prediction on New Data

***

## 1. Getting and Setting-up Data

In [1]:
# Initial Housekeeping

# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "Moview_Reviews"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

#create the reqired folders if they don't already exist
os.makedirs(IMAGES_PATH, exist_ok=True)

# A simple function that helps save images
def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)



In [2]:
# Housekeeping, Part 2

# to make the plots larger in size
from pylab import rcParams
rcParams['figure.figsize'] = 16,10


# A set of libraries pertaining to forecasting methods

# from statsmodels.tsa.arima_model import ARMA, ARIMA, ARMAResults, ARIMAResults
# from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# from statsmodels.tsa.stattools import adfuller
# from pmdarima import auto_arima
# from statsmodels.tsa.seasonal import seasonal_decompose
# from statsmodels.graphics.tsaplots import month_plot, quarter_plot
# from statsmodels.tsa.statespace.tools import diff
# from statsmodels.tsa.statespace.sarimax import SARIMAX
# from statsmodels.tsa.holtwinters import ExponentialSmoothing
# from pmdarima import auto_arima
# from statsmodels.tsa.filters.hp_filter import hpfilter

from sklearn.metrics import mean_squared_error

In [4]:
# Getting the data

df = pd.read_csv('/Users/ramavishwanathan/Desktop/Rama Files/purse/ml_all/NLP_Notes/TextFiles/moviereviews.tsv', sep = '\t')

Having a quick look at the data

In [5]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
label     2000 non-null object
review    1965 non-null object
dtypes: object(2)
memory usage: 31.3+ KB


In [7]:
# looking at the review column
print(df['review'][10])

upon first viewing of this movie , the phrases " been there " and " done that " come quickly to mind . 
spy hard manages to steal almost every joke from the zucker brothers films , the most popular of which are airplane and the naked gun series . 
stealing stuff can be profitable in this industry , but only when you steal the right stuff . 
what little plot there is involves dick steele , aka . 
agent wd-40 ( leslie nielsen ) trying to save the world from an almost deranged madman played by andy griffith . 
along the way to it goal ( goal ? ) , the film manages to spoof mainly the james bond type films , but also manages to hit on films such as home alone and sister act . 
the trick about spoofing is that you have to actually be funny , or at the least , satirical . 
spy hard achieves neither , as it borrows all of the wrong elements from the superior zucker brothers films . 
the " dick , the world is in danger . 
what is it ? 
well , it's a big roundish ball floating in spac

***

## 2. Data Exploration and Preperation

In this section, we will explore the data and importantly, prepare the analytical data-set.

The sub-goals include:
1. Identifying and Treating the data for missing values
2. Dividing the Data into Training and Test Sets


In [8]:
# Checking for missing values

df.isnull().sum()

label      0
review    35
dtype: int64

There are **35 missing records (NAs)** for the review field. These **missing records will be dropped**.

In [9]:
df = df.dropna()

In [10]:
# Re-checking for missing values

df.isnull().sum()

label     0
review    0
dtype: int64

The next step is to find any of the reviews are blank (Empty Strings). These **empty strings do not show as NAs and have to be detected separately**.

In [11]:
# check and remove empty strings (same as missing values)
# Author: Jose Portila, PIERIAN DATA

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)


27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


There are **27 blank reviews** and these too have to be dropped

In [13]:
df = df.drop(blanks)

Having a quick look at the counts of positive and negative reviews

In [14]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

It is a **fairly even split between positive and negative reviews**

Now, the next step is to divide the data into **training and testing sets**

In [15]:
# dividing the data into training and test sets

from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## 3. Vectorizing the Data and Training Models


### Vectorizing Data

The purpose of Vectorizing the data is to create a matrix of the comprehensive set of words which are found across all the reviews. *sparse* matrix is called **Document Term Matrix (DTM)**

The **Document Term Matrix (DTM)** will be populated with **term frequency-inverse document frequency (TF-IDF)** for *each* word.

An **(TF-IDF) factor** diminishes the weight of the terms that occur very frequently and increases the weight of the terms that occor rarely. This ensures that common words such as the articles, names of movie stars etc aren't overly influencial in the model.

### Linear SVM Model

Linear SVM is known to work well with text classification, so will be trying that out upfront.


### Pipeline

We'll use a pipeline to integrate the Vectorization and the model building steps.

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC


from sklearn.pipeline import Pipeline

text_clf_lsvm = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('clf',LinearSVC())
])

In [18]:
# Implmenting Linear SVM
text_clf_lsvm.fit(X_train, y_train)

# Form a prediction set
Y_test = text_clf_lsvm.predict(X_test)

In [19]:
# A Great Utility, to quickly calculate classification metrics 
# MULTI CLASS SCENARIO

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print("Confusion Matrix: ")
print(confusion_matrix(y_test,Y_test))

print("\n")
print("Classification Report: ")
print(classification_report(y_test,Y_test))


print("\n")
print("Accuracy Score: ")
print(accuracy_score(y_test,Y_test))

Confusion Matrix: 
[[197  36]
 [ 37 215]]


Classification Report: 
              precision    recall  f1-score   support

         neg       0.84      0.85      0.84       233
         pos       0.86      0.85      0.85       252

   micro avg       0.85      0.85      0.85       485
   macro avg       0.85      0.85      0.85       485
weighted avg       0.85      0.85      0.85       485



Accuracy Score: 
0.8494845360824742


We are getting an **accuracy of 85% which is very good.** 

The other great thig is that there is a great consistency in the precision and recall values which indicates that the model is well balanced.

A quick Recap of Precision and Recall Concepts 
* Precision: Proportion of True Positive among all the Positive cases identified
* Recall: Propotion of True Positive among the set of Positive cases

***

## 4. Making the Model Better - Adding StopWords

We are looking at a short-list of about 60 stop words

More details:
By default, **CountVectorizer** and **TfidfVectorizer** do not filter stopwords. However, they **offer some optional settings, including passing in your own stopword list**.

In [20]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

In [21]:
# RUN THIS CELL TO ADD STOPWORDS TO THE LINEAR SVC PIPELINE:
# ADD A STOPWORDS HYPERPARAMETER to the TfidVectorizer

# Executing Pipeline
text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', LinearSVC()),
])

# Fitting Model
text_clf_lsvc2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [22]:
# Form a prediction set
Y_test = text_clf_lsvc2.predict(X_test)

In [23]:
# A Great Utility, to quickly calculate classification metrics 
# MULTI CLASS SCENARIO

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print("Confusion Matrix: ")
print(confusion_matrix(y_test,Y_test))

print("\n")
print("Classification Report: ")
print(classification_report(y_test,Y_test))


print("\n")
print("Accuracy Score: ")
print(accuracy_score(y_test,Y_test))

Confusion Matrix: 
[[195  38]
 [ 38 214]]


Classification Report: 
              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       233
         pos       0.85      0.85      0.85       252

   micro avg       0.84      0.84      0.84       485
   macro avg       0.84      0.84      0.84       485
weighted avg       0.84      0.84      0.84       485



Accuracy Score: 
0.843298969072165


Despite adding StopWords, the best model is still is the initial version of the SVM model.

***

## 5. Prediction on New Data

The following is ow the model can be imnplemented on new data (reviews)

In [24]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."

In [25]:
print(text_clf_lsvc2.predict([myreview]))

['neg']
