# Introduction

The Project below is uses the Rotten Totato Movie Review Data Set from Kaggle to create an Sentiment Analysis Model. The creators of the dataset used Amazon's Mechanical Turk to create a finely labeled dataset which is ofent used as a benchmark dataset for Sentiment Analysis. 

The data set contains movie reviews that are separated into phrases. Each observation in the data set is a phrase that is associated to a Sentence Id and has a Sentiment associated. Sentiments are as follows : 

    0 - negative
    1 - somewhat negative
    2 - neutral
    3 - somewhat positive
    4 - positive

# Importing Libraries

The Project starts with importing all the necessary libraries to preprocess and clean the data.

In [1]:
import pandas as pd # Gives us the data structure and operations for easy manipulation of data
import numpy as np  # Allows us to randomly assign training and test observations
from sklearn.feature_extraction.text import TfidfTransformer #To conduct Term Frequency Inverse Document Frequency Tranformation
from sklearn.feature_extraction.text import CountVectorizer #To create text document matrix
from sklearn.metrics import accuracy_score, confusion_matrix # Provides libraries to calculate accuracy and confusion matrix
import plotly.tools as pytools # Allows user to connect to their plotly account
import cred # provides credentials for the plotly account
import plotly.plotly as py # Allows us to plot graphs using plotly
import plotly.graph_objs as go # creates graphs using the plotly library
from nltk.stem import PorterStemmer # Provides root for words
from nltk.corpus import stopwords # Provides stop words for the english language
from nltk.tokenize import word_tokenize # Helps tokenize phrases to words
import re # helps program parse strings

# Connecting to the Plotly Server 

We connect to the plotly server using the set_credentials_file function fromthe plotly.tools module. We pull the username and api_key credentials from the cred module.

In [2]:
pytools.set_credentials_file(username=cred.username, api_key=cred.api_key)

# Importing Data into Pandas

We use the read_csv function to load the train.tsv data file into python. The program runs some cursory check of the data using the head() method.

In [3]:
df=pd.read_csv('train.tsv', sep="\t")

In [4]:
df.shape

(156060, 4)

In [5]:
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


As per the Kaggle instructions, the target variable in the Sentiment column had 5 non overlapping categories. We check this using the unique operator which returns 5 distinct values for the Sentiment columns.

In [6]:
df['Sentiment'].unique()

array([1, 2, 3, 4, 0], dtype=int64)

# Cleaning the Data

## Stop Words and Stemming 

First, let us talk about stop words. Stop words are words very commonly used that do not add much meaning to the sentence and are used primarily use for grammatical puposes like "is", "the", and "a". We want to remove these words so that we can denoise our data set and create better models. 

Secondly, we often use different words to say the same thing. When someone describles riding a horse, they can use the words "ride", "riding", "rode", "ridden", etc. that all stem from the same word "ride". Stemming achieve exactly this and helps reduces the dimentionality of data set and in NLP terms helps create a more normalized data set.

So the function stop_and_stem below helps us remove stop words and find stems for remaining words. It achieves this by: 
1. converting the phrase to lower case
2. tokenizing a given phrase into words
3. removing words that fall in the stop_words set
4. using the ps.stem command to convert words to their respective roots

I added another feature to the function below that removes all punctuation's from the resulting processed phrase and join the dataset back together using the join method.

In [7]:
stop_words=set(stopwords.words("english"))
ps=PorterStemmer()

#removes stop words/ digits/ puntuations and stemms the words
def stop_and_stem(phrase):
    global stop_words, ps
    phrase=[ps.stem(word) for word in word_tokenize(phrase.lower()) if not word in stop_words]
    phrase=[re.sub(r'[^\w\s]','',word) for word in phrase]
    return " ".join(phrase)

We then use the apply methods in the Phrase column of our dataset and pass the stop_and_stem function to process the phrase information. 

In [8]:
df['Phrase'] = df['Phrase'].apply(stop_and_stem)

Using the command below, you will see that the words in the phrases have been reduced to their stems.

In [9]:
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,seri escapad demonstr adag good goos also good...,1
1,2,1,seri escapad demonstr adag good goos,2
2,3,1,seri,2
3,4,1,,2
4,5,1,seri,2


## Converting Phrase Column to Matrix 

So the first step we need to take after the data has been stemmed and the stop words have been removed is that we have to convert the phrases into a matrix format with token counts. We can achieve this using the CountVectorizer class from the sklearn.feature_extraction.text module. 

Once we create the CountVectorizer object count_vector, we use the fit_transform method to feed the Phrase column and assign the resulting matrix to the x_df_counts. x_df_counts will have the same number of rows as the number of observations in our original dataset.

In [10]:
#Tokenize tex with scikit learn
#Convert a collection of text documents to a matrix of token counts
count_vector=CountVectorizer()
#Learn the vocabulary dictionary and return term-document matrix.
x_df_counts=count_vector.fit_transform(df['Phrase'])
x_df_counts.shape

(156060, 11856)

If you want to look for the index of any given word stem, you can use the vocabulary_.get() method and pass the word of interest as the argument. 

In [11]:
#returns index of words
count_vector.vocabulary_.get("seri")

9133

To get an exhaustive list of all the features, you can use the get_feature_names() which will return a list of all the features.

In [12]:
count_vector.get_feature_names()[1:10]

['100',
 '10000',
 '100minut',
 '100year',
 '101',
 '102minut',
 '103minut',
 '104',
 '105']

## Term Frequency Inverse Document Frequency

Term Frequency- Inverse document Frequency or TFDIF is a term often used in the domain of information retreival. This techniques helps assign a numerical statistic to a words that shows its importance in the document.The tf-idf value is proportional to the ratio of the requency of the word in the phrase and the frequency of the word in entire corpus
We initialize the TfidTranformer object tfidf_transformer and using the fit_transform method to generate the the TFDIF statistic for each word in a given phrase in the Phrase column.

In [13]:
tfidf_transformer=TfidfTransformer()
x_df_tfidf=tfidf_transformer.fit_transform(x_df_counts)

# Separating Training and Test set

Before we start building out models, we need to divid our data set into a training set and a test set. We use a random uniform number generator to divide the data set into an 80:20 training-test ratio.
We do this by first setting the random seed to 10 using the random.seed() function. This will help us reporduce our findings. Then, we genrate 156060 draws from a uniform distribution between 0 and 1 and create a bolean array msk where draws less than .8 are labeled as True and otherwise labeled False. We use the newly created msk variable to select observations from x_df_tfidf to assign our training and test set

In [14]:
np.random.seed(10)
msk = np.random.rand(len(df)) < 0.8
train = x_df_tfidf[msk]
test = x_df_tfidf[~msk]

As you can see from the command below, there is an approximate 80-20 split between training and test set.

In [15]:
train.shape, test.shape

((124964, 11856), (31096, 11856))

# Building Models

Each of the models constructed below are divided into 5 distinct sections:
1. Classifier: Once the relevant module has been imported, we initial a classifier object. We then use the fit method and pass the training predictors and the corresponding target variables to generate the model. Once the model has been created, we use the predict method to get the training and test predictions.
2. Classifier Parameters: We then print the classifier to see the parameters of the model.
3. Accuracy and Confusion Matrix: We generate the accuracy and confusion matrix for the training and test set using the accuracy_score and confusion_matrix function from the sklearn.metrics module
4. Graph Confusion Matrix for Training Data
5. Graph Confusion Matrix for Testig Data

## Multinomial Naive Bayes

Classifier:

In [16]:
#Training  Multinomial Naive Bayes Model
from sklearn.naive_bayes import MultinomialNB

clf_MultinomialNB=MultinomialNB().fit(train, df[msk]['Sentiment'])
train_pred_MultinomialNB=clf_MultinomialNB.predict(train)
test_pred_MultinomialNB=clf_MultinomialNB.predict(test)


Classifier Parameters:

In [17]:
print(clf_MultinomialNB)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


Accuracy and Confusion Matrix: 

In [18]:
#Model Performance Metrics
accuracy_MultinomialNB_train = accuracy_score( df[msk]['Sentiment'],
                                              train_pred_MultinomialNB)#TODO
accuracy_MultinomialNB_test = accuracy_score( df[~msk]['Sentiment'],
                                             test_pred_MultinomialNB)#TODO
print ("Train Accuracy: ",accuracy_MultinomialNB_train,"\nTest Accuracy:",
       accuracy_MultinomialNB_test)

confusion_MultinomialNB_train=confusion_matrix(df[msk]['Sentiment'],
                                               train_pred_MultinomialNB)
confusion_MultinomialNB_test=confusion_matrix(df[~msk]['Sentiment'],
                                              test_pred_MultinomialNB)


Train Accuracy:  0.62017861144 
Test Accuracy: 0.585477231798


Graph Confusion Matrix for Training Data

In [19]:
#make training confusion matrix heat map
trace1 = go.Heatmap(z=confusion_MultinomialNB_train,
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="Multinomial Naive Bayes Confusion Matrix Training",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='mnb-train-con-heatmap')

Graph Confusion Matrix for Testing Data

In [20]:
#Making test confusion matrix heat map
trace2 = go.Heatmap(z=confusion_MultinomialNB_test,
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data2=[trace2]
layout2 = go.Layout(title="Multinomial Naive Bayes Confusion Matrix Testing",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig2 = go.Figure(data=data2, layout=layout2)
py.iplot(fig2, filename='mnb-test-con-heatmap')

## Support Vector Machine (Linear Kernel)

Classifier:

In [21]:
from sklearn import svm 

clf_svm_Linear = svm.LinearSVC().fit(train, df[msk]['Sentiment'])
train_pred_svm_linear=clf_svm_Linear.predict(train)
test_pred_svm_linear=clf_svm_Linear.predict(test)

Classifier Parameters:

In [22]:
print(clf_svm_Linear)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)


Accuracy and Confusion Matrix:

In [23]:
#Model Performance Metrics
accuracy_svm_linear_train= accuracy_score( df[msk]['Sentiment'],
                                          train_pred_svm_linear)
accuracy_svm_linear_test= accuracy_score( df[~msk]['Sentiment'],
                                         test_pred_svm_linear)#
print ("Train Accuracy: ",accuracy_svm_linear_train,
       "\nTest Accuracy:",accuracy_svm_linear_test)

confusion_svm_linear_train=confusion_matrix(df[msk]['Sentiment'],
                                            train_pred_svm_linear)
confusion_svm_linear_test=confusion_matrix(df[~msk]['Sentiment'],
                                           test_pred_svm_linear)


Train Accuracy:  0.696736660158 
Test Accuracy: 0.637992024698


Graph Confusion Matrix for Training Data:

In [24]:
#make training confusion matrix heat map
trace1 = go.Heatmap(z=confusion_svm_linear_train, 
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="SVM Confusion Matrix Training",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='svm-train-con-heatmap')

Graph Confusion Matrix for Testing Data

In [25]:
#Making test confusion matrix heat map
trace2 = go.Heatmap(z=confusion_svm_linear_test, 
                    x=[0,1,2,3,4],
                    y=[0,1,2,3,4])
data2=[trace2]
layout2 = go.Layout(title="SVM Confusion Matrix Testing",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig2 = go.Figure(data=data2, layout=layout2)
py.iplot(fig2, filename='svm-test-con-heatmap')

## Neural Network

Classifier:

In [26]:
from sklearn.neural_network import MLPClassifier


clf_mlp = MLPClassifier(solver='adam',activation='logistic', 
                        hidden_layer_sizes=(10), 
                        max_iter=400,random_state=1)
clf_mlp=clf_mlp.fit(train, df[msk]['Sentiment'])
train_pred_mlp=clf_mlp.predict(train)
test_pred_mlp=clf_mlp.predict(test)


Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.



Classifier Parameters:

In [27]:
print(clf_mlp)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=10, learning_rate='constant',
       learning_rate_init=0.001, max_iter=400, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)


Accuracy and Confusion Matrix:

In [28]:
#Model Performance Metrics
accuracy_mlp_train= accuracy_score( df[msk]['Sentiment'],
                                   train_pred_mlp)
accuracy_mlp_test= accuracy_score( df[~msk]['Sentiment'],
                                  test_pred_mlp)#
print ("Train Accuracy: ",accuracy_mlp_train,
       "\nTest Accuracy:",accuracy_mlp_test)

confusion_mlp_train=confusion_matrix(df[msk]['Sentiment'],
                                     train_pred_mlp)
confusion_mlp_test=confusion_matrix(df[~msk]['Sentiment'],
                                    test_pred_mlp)

Train Accuracy:  0.796829486892 
Test Accuracy: 0.641658091073


Graph Confusion Matrix for Training Data:

In [29]:
#make training confusion matrix heat map
trace1 = go.Heatmap(z=confusion_mlp_train,
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="Neural Network Confusion Matrix Training",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='nn-train-con-heatmap')

Graph Confusion Matrix for Testing Data:

In [30]:
#Making test confusion matrix heat map
trace2 = go.Heatmap(z=confusion_mlp_test,
                    x=[0,1,2,3,4],
                    y=[0,1,2,3,4])
data2=[trace2]
layout2 = go.Layout(title="Neural Network Confusion Matrix Testing",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig2 = go.Figure(data=data2, layout=layout2)
py.iplot(fig2, filename='nn-test-con-heatmap')

## Decision Trees

Classifier:

In [31]:
from sklearn import tree


clf_dtrees=tree.DecisionTreeClassifier().fit(train,
                            df[msk]['Sentiment'])
train_pred_dtrees=clf_dtrees.predict(train)
test_pred_dtrees=clf_dtrees.predict(test)

Classifier Parameters:

In [32]:
print(clf_dtrees)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')


Accuracy and Confusion Matrix:

In [33]:
#Model Performance Metrics
accuracy_dtrees_train= accuracy_score( df[msk]['Sentiment'],
                                      train_pred_dtrees)
accuracy_dtrees_test= accuracy_score( df[~msk]['Sentiment'],
                                     test_pred_dtrees)
print ("Train Accuracy: ",accuracy_dtrees_train,
       "\nTest Accuracy:",accuracy_dtrees_test)

confusion_dtrees_train=confusion_matrix(df[msk]['Sentiment'],
                                        train_pred_dtrees)
confusion_dtrees_test=confusion_matrix(df[~msk]['Sentiment'],
                                       test_pred_dtrees)

Train Accuracy:  0.865321212509 
Test Accuracy: 0.62255595575


Graph Confusion Matrix of Training Data:

In [34]:
#make training confusion matrix heat map
trace1 = go.Heatmap(z=confusion_dtrees_train,
                    x=[0,1,2,3,4],
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="Decision Trees Confusion Matrix Training",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='dt-train-con-heatmap')

Graph Confusion Matrix for Testing Data:

In [35]:
#Making test confusion matrix heat map
trace2 = go.Heatmap(z=confusion_dtrees_test,
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data2=[trace2]
layout2 = go.Layout(title="Decision Trees Confusion Matrix Testing",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig2 = go.Figure(data=data2, layout=layout2)
py.iplot(fig2, filename='dt-test-con-heatmap')

## Random Forest

Classifier:

In [36]:
from sklearn.ensemble import RandomForestClassifier

clf_forest=RandomForestClassifier(max_depth=None,
                                  random_state=0).fit(train,
                                df[msk]['Sentiment'])
train_pred_forest=clf_forest.predict(train)
test_pred_forest=clf_forest.predict(test)

Classifier Parameters:

In [37]:
print(clf_forest)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)


Accuracy and Confusion Matrix:

In [38]:
#Model Performance Metrics
accuracy_forest_train= accuracy_score( df[msk]['Sentiment'],
                                      train_pred_forest)
accuracy_forest_test= accuracy_score( df[~msk]['Sentiment'],
                                     test_pred_forest)#
print ("Train Accuracy: ",accuracy_forest_train,
       "\nTest Accuracy:",accuracy_forest_test)

confusion_forest_train=confusion_matrix(df[msk]['Sentiment'],
                                        train_pred_forest)
confusion_forest_test=confusion_matrix(df[~msk]['Sentiment'],
                                       test_pred_forest)

Train Accuracy:  0.856942799526 
Test Accuracy: 0.634551067661


Graph Confusion Matrix for Training Data:

In [39]:
#make training confusion matrix heat map
trace1 = go.Heatmap(z=confusion_forest_train, 
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="Random Forest Confusion Matrix Training",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='rf-train-con-heatmap')

Graph Confusion Matrix for Testing Data:

In [40]:
#make testing confusion matrix heat map
trace1 = go.Heatmap(z=confusion_forest_test, 
                    x=[0,1,2,3,4], 
                    y=[0,1,2,3,4])
data1=[trace1]
layout1 = go.Layout(title="Random Forest Confusion Matrix Testing",
                xaxis=dict(title='Predicted Sentiment'),
                yaxis=dict(title='True Sentiment'))
fig1 = go.Figure(data=data1, layout=layout1)
py.iplot(fig1, filename='rf-test-con-heatmap')

# Conclusion

Below is the accuracies (training, test) of the above models:
    1. Multinomial Naive Bayes: (61.98%, 58.63%)
    2. SVM                    : (69.75%, 63.75%)
    3. Neural Network         : (79.68%, 64.17%)
    4. Decision Trees         : (86.43%, 62.19%)
    5. Random Forest          : (85.57%, 63.57%)

The Neural Network model had the best test performance overall and a decent training accuracy score.    

## Limitation and Improvement 

While most of the initial limitations of this project has been mitigated there has not been a significant increase in model performance. May be building recurrent neural networks and incorporating part of speech to the initial data set might be helpful.