# Introduction

Sentiment Analysis is a popular Natural Language Processing (NLP) task which allows us to extract the overall opinion in a text. In this project, we will be performing Sentiment Analysis on some IMDB movie reviews, to classify the overall review as positive or negative. When dealing with text data, a prevalent issue is how to encode the words as a numeric feature that can be used to compute the output of a classification algorithm. Especially because words don’t naturally lend themselves to a numeric ordering, there have been many approaches on how to featurize a text. In this project, we will use the bag of words model, which uses the count of a word in a text as a feature. We will begin by using logistic regression to perform this task, followed by a decision tree approach, and finally, using a random forest model.

# Importing the tools

In [11]:
import numpy as np
import pandas as pd
import os
import sklearn
import sklearn.linear_model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import grid_search
from sklearn import tree
from sklearn import model_selection
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Reading in Data
Using the function readFile(filename), we will read the contents of a text file and output the contents of the file as a list containing single words. Using this function, we will read all the files into a Pandas Dataframe, in which each row represents a text file and the columns contain the counts of each word in that specific text file. Note that there should be a column for every possible word that occurs throughout all text files, so all the columns together form the unique vocabulary for the dataset. If a word does not appear in a particular file, let its count be 0. In addition, add a column in this Dataframe with the document label, containing either the value ’positive’ or ’negative’. You may also want to add a column with the file name, so you can later check which reviews were incorrectly classified.

In [2]:
def segmentWords(s): 
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

#### Create a Dataframe containing the counts of each word in a file

In [3]:
d = []

for c in os.listdir("data_training/train"):
    directory = "data_training/train/" + c
    for f in os.listdir(directory):
        words = readFile(directory + "/" + f)
        e = {x:words.count(x) for x in words}
        e['__FileID__'] = f
        e['__CLASS__'] = 1 if c[:3] == 'pos' else 0
        d.append(e)

**Create a dataframe from d - make sure to fill all the nan values with zeros.**


In [4]:
df = pd.DataFrame(d).fillna(0)

In [71]:
print(df.shape)
df.head()

(1400, 42776)


Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
df.describe()

Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
count,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,...,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0
mean,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.003571,0.008571,...,0.000714,0.001429,0.000714,0.000714,0.006429,0.002857,0.001429,0.000714,0.000714,0.001429
std,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.080127,0.272517,...,0.026726,0.053452,0.026726,0.026726,0.128058,0.065426,0.037783,0.026726,0.026726,0.037783
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,10.0,...,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0,1.0


In [7]:
print(df.__FileID__.head())
df.__CLASS__.tail()

0    cv676_22202.txt
1     cv155_7845.txt
2    cv465_23401.txt
3    cv398_17047.txt
4    cv206_15893.txt
Name: __FileID__, dtype: object


1395    1
1396    1
1397    1
1398    1
1399    1
Name: __CLASS__, dtype: int64

# Training/Validation Split
Because we don’t have access to the labels of the test set, we randomly shuffle the dataset and split the data into a training set and validation set, so we can test our trained model on the validation set. In general, even if you have access to the labels of the test set, it is a good idea to use a validation set to prevent overfitting to the test set. (Hint: Use train_test_split from sklearn.model_selection)

#### Split data into training and validation set 

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)
* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

In [8]:
features = df.drop(['__FileID__', '__CLASS__'], axis=1)
labels = df.__CLASS__
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(features, labels, test_size=0.2, 
                                                                         random_state=42)

In [9]:
# this step was done above before splitting data into training and validation set
print(X_train.shape, X_val.shape, Y_train.shape, Y_val.shape)

(1120, 42774) (280, 42774) (1120,) (280,)


# Logistic Regression
Now we train a basic logistic regression model to classify the sentiment of the reviews. Make sure you do not use the filename as a feature if you previously included it in the Dataframe. Compare the accuracy of this basic model on the training set and the validation set. Are you overfitting? Try changing the parameters of the logistic regression, such as adding a regularization term, to reduce the overfitting.

#### Basic Logistic Regression
* Use sklearn's linear_model.LogisticRegression() to create model.
* Fit the data and labels with model.
* Score model with the same data and labels.

In [12]:
logreg = sklearn.linear_model.LogisticRegression()
logreg.fit(X_train, Y_train)
print("Train acc:", logreg.score(X_train, Y_train), "\nValidation acc:", 
      logreg.score(X_val, Y_val))

Train acc: 1.0 
Validation acc: 0.839285714286


A training accuracy of 1.0 shows our base estimator logistic regression model is overfitting the training data. We will need to change the parameters of the model to improve/avoid overfitting.

## Changing Parameters

In [13]:
Cs = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
# gamma parameter which inversely controls the standard deviation of our kernel's distribution
penalty = ['l1', 'l2']
# initialize the dictionary of parameters
param_grid = {'C': Cs, 'penalty' : penalty}
# initialize the search using input as nfold cross validation
lr = sklearn.linear_model.LogisticRegression()
search = grid_search.GridSearchCV(lr, param_grid)
# fit the search object to our input training data
search.fit(X_train, Y_train)
# output the best parameters
search.best_params_

{'C': 1000, 'penalty': 'l1'}

In [16]:
logreg2 = sklearn.linear_model.LogisticRegression(penalty='l1', C=1000)
logreg2.fit(X_train, Y_train)
print("Training acc:", logreg2.score(X_train, Y_train), "\nValidation acc:", 
      logreg2.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.885714285714


Our regularized log reg model still overfits the training data, but it also increased the testing accuracy from 84% to 88.5%, which is most important by a significant amount. 

# Backward Stepwise Selection to Reduce Features
Overfitting mainly occurs if the model is too expressive for the given task. As a result, it is able to not only fit the pattern in the training data, but also the randomness of the dataset, which thereby causes a high accuracy on the training data and a low accuracy on the test data. One way to reduce the expressiveness of the model is by reducing the number of features. Currently, we used the count of every word present as a feature, but perhaps some words are more indicative of the sentiment of the review than others. Those other words, which don’t contribute much to determining the sentiment, may be overfitting to the noise in the training set. One way to identify these features is to look at the features whose weights are close to 0 (remember to normalize the weights if using this method). This process is called Backward Stepwise Selection, which aims to remove the features whose removal would reduce the test error.

## Feature Selection/Reduction
* In the backward stepsize selection method, you can remove coefficients and the corresponding x columns, where the coefficient is more than a particular amount away from the mean - you can choose how far from the mean is reasonable.

* We selected features here using a recursive feature elimination model that reduces overfitting by shrinking the hypothesis space of our logistic regression model. This didn't give results that were significantly better at reducing overfitting than the L1  regularization above as the testin accuracy increased from 88.5% to a 89.2%, which, when dealing with data at scale, is an increase worth running the backward steps selection (or feature reduction) method.

In [18]:
# A recursive feature elimination approach
from sklearn.feature_selection import RFE

# A new logistic regression model with parameters from above and a feature selector
lr2 = sklearn.linear_model.LogisticRegression(C=1000, penalty='l1')
selector = RFE(lr2, step=10000, n_features_to_select=41000)

In [19]:
# fit RFE selector to training set
selector.fit(X_train, Y_train)
lr2 = selector.estimator_

In [20]:
# figure out which columns to drop
columns = features.columns
feature_mask = selector.support_
columns_to_drop = [columns[i] for i in range(columns.size) if not feature_mask[i]]

In [22]:
# Create print function to print scores of estimators
def print_results(estimator, X, y, leadingString=''):
    print(leadingString, estimator.score(X, y))

In [23]:
# show training and testing accuracies after feature reduction
print_results(lr2, X_train.drop(columns_to_drop, axis=1), Y_train, "Training results: ")
print_results(lr2, X_val.drop(columns_to_drop, axis=1), Y_val, "Testing results: ")

Training results:  1.0
Testing results:  0.892857142857


* We selected features here using a recursive feature elimination model that reduces overfitting by shrinking the hypothesis space of our logistic regression model. This didn't give results that were significantly better at reducing overfitting than the L1  regularization above as the testing accuracy increased from 88.5% to 89.2%, which, when dealing with data at scale, is an increase that may be worth permitting the computation time and cost of running a Backward Stepwise Selection (or feature reduction) method.

# Single Decision Tree

#### Basic Decision Tree

* Initialize model as a decision tree with sklearn.
* Fit the data and labels to the model.


In [25]:
dt_clf = tree.DecisionTreeClassifier(criterion='entropy')
dt_clf.fit(X_train, Y_train)
print("Training acc:", dt_clf.score(X_train, Y_train), "\nValidation acc:", dt_clf.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.635714285714


Why is a single decision tree so prone to overfitting?

A single decision tree is prone to overfitting because it is able to go out of it's way to search for maximum purity in the nodes, this is not desired as this will easily fit to the noise in the training set and incorrectly estimate the underlying data generating distribution. 

#### Changing Parameters
* To test out which value is optimal for a particular parameter, you can either loop through various values or look into `sklearn.model_selection.GridSearchCV`
* Side note: next time we should graph the differences in testing and training accuracies between each combination of parameters

In [26]:
parameters = {"max_depth": [None, 10, 100, 1000, 10000],
              "min_samples_split": [5, 10, 50, 100, 500, 1000],
              "min_samples_leaf": [10, 100, 1000, 10000],
              "max_leaf_nodes": [None, 10, 100, 1000, 10000],
              }
gridsearch = GridSearchCV(dt_clf, parameters)
gridsearch.fit(X_train, Y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [5, 10, 50, 100, 500, 1000], 'min_samples_leaf': [10, 100, 1000, 10000], 'max_depth': [None, 10, 100, 1000, 10000], 'max_leaf_nodes': [None, 10, 100, 1000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [27]:
# show the best parameters of the gridsearchCV regularized decision tree
gridsearch.best_params_

{'max_depth': None,
 'max_leaf_nodes': 10000,
 'min_samples_leaf': 10,
 'min_samples_split': 100}

In [29]:
# use parameters from gridsearchCV above in new decision tree model
reg_tree = gridsearch.best_estimator_

print("Training acc:", reg_tree.score(X_train, Y_train), "\nValidation acc:",
      reg_tree.score(X_val, Y_val))

Training acc: 0.757142857143 
Validation acc: 0.621428571429


We chose these parameters by grid search on orders of magnitude for the different parameters which revealed the order of magnitude of our correct parmeters, from there we performed a manual search for fine tuning. The manual fine-tuned parameters and the corresponding test and training accuracies are below.

In [32]:
# create decision tree model with manually searched parameters (best in class)
reg_tree2 = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = None, max_leaf_nodes = 125, min_samples_leaf = 2, min_samples_split = 60)
reg_tree2.fit(X_train, Y_train)

# print model training and test accuracies
print_results(reg_tree2, X_train, Y_train, "Training score: ")
print_results(reg_tree2, X_val, Y_val, "Testing score: ")

Training score:  0.858035714286
Testing score:  0.692857142857


As seen above, our decision tree model with parameters derived by a manual search produced better testing and training 
accuracies than both our base decision tree estimator (which severely overfit the training data) and our decision tree model with gridsearch cross validation produced parameters. Still, our best decision tree model above (with manually searched parameters) is significantly lower than our gridsearchCV regularized logistic regression model above (89% test accuracy).

**Add a Boost to your Decision Tree Classifer using AdaBoost( )**

Train an AdaBoost classifier using the tuned decision tree model above as the base estimator. Boosting trains weak learners sequentially so they focus on the points that are hard to classify, so make sure to limit each individual tree so it is individually weak.

In [33]:
# train an AdaBoost classifier using the tuned random forest model above as the base estimator
boost_clf = AdaBoostClassifier(base_estimator=reg_tree2, n_estimators=100)
boost_clf.fit(X_train, Y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=125,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=60,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=100, random_state=None)

In [34]:
print("Training acc:", boost_clf.score(X_train, Y_train), "\nValidation acc:",
      boost_clf.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.739285714286


The (AdaBoost) boosted decision tree classifier produces the best testing accuracy in (the decision tree) class at 73%. The model may seem to be overfitting the training data, but it captures the general signal or trend of the test data well enough to predict better than the decision tree base estimator and the gridsearchCV tuned decision tree model. Still, logistic regression, a simpler model, seems to be the better, more accurate option for sentiment analysis with this particular IMDB movie review text data.

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.


A Random Forest classifier prevents overfitting better than a decision tree by using a series of weaker trees that cannot themselves overfit the training set and takes a majority vote from them to classify. Training an AdaBoost Classifier on the same dataset trains weak learners sequentially, so they focus on the points that are hard to classify, so we made sure to limit each individual tree so it is individually weak.

In [40]:
rfc = RandomForestClassifier(criterion = 'entropy', n_estimators=100)
rfc.fit(X_train, Y_train)
print("Training acc:", rfc.score(X_train, Y_train), "\nValidation acc:",
      rfc.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.821428571429


## Changing Parameters

parameters = {"min_samples_split": [2, 5, 10],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }
gridsearch2 = GridSearchCV(rfc, parameters)
gridsearch2.fit(X_train, Y_train)

In [42]:
gridsearch2.best_params_

{'max_leaf_nodes': None, 'min_samples_leaf': 10, 'min_samples_split': 100}

In [63]:
reg_forest = gridsearch2.best_estimator_
reg_forest.fit(X_train, Y_train)
print("Training acc:", reg_forest.score(X_train, Y_train), "\nValidation acc:",
      reg_forest.score(X_val, Y_val))

Training acc: 0.935714285714 
Validation acc: 0.817857142857


After tuning the parameters of our random forest classifier using sklearn's GridSearchCV method, we decreased the model's overfitting of the training set from a 100% accuracy to a 93% accuracy. However, our test set accuracy on the parameter-tuned model was slightly lower at 81.7%.

**Add a Boost to Random Forest model with sklearn's AdaBoostClassifier( )**

In [65]:
boost_reg2 = AdaBoostClassifier(base_estimator=reg_forest)
boost_reg2.fit(X_train, Y_train)
print("Training acc:", boost_reg2.score(X_train, Y_train), "\nValidation acc:",
      boost_reg2.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.882142857143


What parameters did you choose to change and why?

We regularized the model parameters by running a standard grid search on the hyperparameters and received the following hyperparameters: min_samples_split=2, max_depth=None, min_samples_leaf=1, and max_leaf_nodes=None. The resulting model brought the training accuracy down to .90 and the validation set accuracy up to .83. 

Finally, we trained an AdaBoostClassifier model using our regularized random forest as our base_estimator and received a training accuracy of 1.0 and a testing accuracy of .89. Although the model seems to overfit the training data, it also produces our highest testing accuracy yet. We decided to optimize this AdaBoostClassifier model by running a standard grid search on the n_estimators hyperparameter, which informed us that the best resulting model used a value of 50 for n_estimators. When we ran the resulting AdaBoostClassifier model using our regularized random forest as our base estimator and 50 as our number of estimators, we received a 1.0 training accuracy and a testing accuracy of .88, a class best. Again, the AdaBoost classifiers seem to overfit the training data, most likely due to our base estimator (the regularized random forest) being relatively strong estimators rather than the required weak estimators, but our resulting testing accuracy was a class best at .88. Thus, it seems as though a boosted random forest model does not perform better than the simpler, parameter-tuned logistic regression model, which produced a testing accuracy of 89.2%.

How does a random forest classifier prevent overfitting better than a single decision tree?

A Random Forest classifier prevents overfitting better than a decision tree by using a series of weaker trees that cannot themselves overfit the training set and takes a majority vote from them to classify. Training an AdaBoost Classifier on the same dataset trains weak learners sequentially, so they focus on the points that are hard to classify, so we made sure to limit each individual tree so it is individually weak.