# Importing the tools

In [None]:
import numpy as np
import pandas as pd
import os
import sklearn
import sklearn.linear_model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import grid_search
from sklearn import tree
from sklearn import model_selection
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Reading in Data
Using the function readFile(filename), we will read the contents of a text file and output the contents of the file as a list containing single words. Using this function, we will read all the files into a Pandas Dataframe, in which each row represents a text file and the columns contain the counts of each word in that specific text file. Note that there should be a column for every possible word that occurs throughout all text files, so all the columns together form the unique vocabulary for the dataset. If a word does not appear in a particular file, let its count be 0. In addition, add a column in this Dataframe with the document label, containing either the value ’positive’ or ’negative’. You may also want to add a column with the file name, so you can later check which reviews were incorrectly classified.

In [None]:
def segmentWords(s):
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

#### Create a Dataframe containing the counts of each word in a file

In [None]:
d = []

for c in os.listdir("data_training/train"):
    directory = "data_training/train/" + c
    for f in os.listdir(directory):
        words = readFile(directory + "/" + f)
        e = {x:words.count(x) for x in words}
        e['__FileID__'] = f
        e['__CLASS__'] = 1 if c[:3] == 'pos' else 0
        d.append(e)

**Create a dataframe from d - make sure to fill all the nan values with zeros.**


In [None]:
df = pd.DataFrame(d).fillna(0)

In [None]:
print(df.shape)
df.head()

(1400, 42776)


Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df.describe()

Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
count,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,...,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0
mean,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.000714,0.003571,0.008571,...,0.000714,0.001429,0.000714,0.000714,0.006429,0.002857,0.001429,0.000714,0.000714,0.001429
std,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.026726,0.080127,0.272517,...,0.026726,0.053452,0.026726,0.026726,0.128058,0.065426,0.037783,0.026726,0.026726,0.037783
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,10.0,...,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0,1.0


In [None]:
print(df.__FileID__.head())
df.__CLASS__.tail()

0    cv676_22202.txt
1     cv155_7845.txt
2    cv465_23401.txt
3    cv398_17047.txt
4    cv206_15893.txt
Name: __FileID__, dtype: object


1395    1
1396    1
1397    1
1398    1
1399    1
Name: __CLASS__, dtype: int64

# Training/Validation Split
Because we don’t have access to the labels of the test set, we randomly shuffle the dataset and split the data into a training set and validation set, so we can test our trained model on the validation set. In general, even if you have access to the labels of the test set, it is a good idea to use a validation set to prevent overfitting to the test set. (Hint: Use train_test_split from sklearn.model_selection)

#### Split data into training and validation set

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)
* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

In [None]:
features = df.drop(['__FileID__', '__CLASS__'], axis=1)
labels = df.__CLASS__
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(features, labels, test_size=0.2,
                                                                         random_state=42)

In [None]:
# this step was done above before splitting data into training and validation set
print(X_train.shape, X_val.shape, Y_train.shape, Y_val.shape)

(1120, 42774) (280, 42774) (1120,) (280,)


# Logistic Regression
Now we train a basic logistic regression model to classify the sentiment of the reviews. Make sure you do not use the filename as a feature if you previously included it in the Dataframe. Compare the accuracy of this basic model on the training set and the validation set. Are you overfitting? Try changing the parameters of the logistic regression, such as adding a regularization term, to reduce the overfitting.

In [None]:
logreg = sklearn.linear_model.LogisticRegression()
logreg.fit(X_train, Y_train)
print("Train acc:", logreg.score(X_train, Y_train), "\nValidation acc:",
      logreg.score(X_val, Y_val))

Train acc: 1.0 
Validation acc: 0.839285714286


In [None]:
Cs = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
# gamma parameter which inversely controls the standard deviation of our kernel's distribution
penalty = ['l1', 'l2']
# initialize the dictionary of parameters
param_grid = {'C': Cs, 'penalty' : penalty}
# initialize the search using input as nfold cross validation
lr = sklearn.linear_model.LogisticRegression()
search = grid_search.GridSearchCV(lr, param_grid)
# fit the search object to our input training data
search.fit(X_train, Y_train)
# output the best parameters
search.best_params_

{'C': 1000, 'penalty': 'l1'}

In [None]:
logreg2 = sklearn.linear_model.LogisticRegression(penalty='l1', C=1000)
logreg2.fit(X_train, Y_train)
print("Training acc:", logreg2.score(X_train, Y_train), "\nValidation acc:",
      logreg2.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.885714285714


Our regularized log reg model still overfits the training data, but it also increased the testing accuracy from 84% to 88.5%, which is most important by a significant amount.

In [None]:
# A recursive feature elimination approach
from sklearn.feature_selection import RFE

# A new logistic regression model with parameters from above and a feature selector
lr2 = sklearn.linear_model.LogisticRegression(C=1000, penalty='l1')
selector = RFE(lr2, step=10000, n_features_to_select=41000)

In [None]:
# fit RFE selector to training set
selector.fit(X_train, Y_train)
lr2 = selector.estimator_

In [None]:
# figure out which columns to drop
columns = features.columns
feature_mask = selector.support_
columns_to_drop = [columns[i] for i in range(columns.size) if not feature_mask[i]]

In [None]:
# Create print function to print scores of estimators
def print_results(estimator, X, y, leadingString=''):
    print(leadingString, estimator.score(X, y))

In [None]:
# show training and testing accuracies after feature reduction
print_results(lr2, X_train.drop(columns_to_drop, axis=1), Y_train, "Training results: ")
print_results(lr2, X_val.drop(columns_to_drop, axis=1), Y_val, "Testing results: ")

Training results:  1.0
Testing results:  0.892857142857


* We selected features here using a recursive feature elimination model that reduces overfitting by shrinking the hypothesis space of our logistic regression model. This didn't give results that were significantly better at reducing overfitting than the L1  regularization above as the testing accuracy increased from 88.5% to 89.2%, which, when dealing with data at scale, is an increase that may be worth permitting the computation time and cost of running a Backward Stepwise Selection (or feature reduction) method.

# Single Decision Tree

#### Basic Decision Tree

* Initialize model as a decision tree with sklearn.
* Fit the data and labels to the model.


In [None]:
dt_clf = tree.DecisionTreeClassifier(criterion='entropy')
dt_clf.fit(X_train, Y_train)
print("Training acc:", dt_clf.score(X_train, Y_train), "\nValidation acc:", dt_clf.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.635714285714


In [None]:
parameters = {"max_depth": [None, 10, 100, 1000, 10000],
              "min_samples_split": [5, 10, 50, 100, 500, 1000],
              "min_samples_leaf": [10, 100, 1000, 10000],
              "max_leaf_nodes": [None, 10, 100, 1000, 10000],
              }
gridsearch = GridSearchCV(dt_clf, parameters)
gridsearch.fit(X_train, Y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [5, 10, 50, 100, 500, 1000], 'min_samples_leaf': [10, 100, 1000, 10000], 'max_depth': [None, 10, 100, 1000, 10000], 'max_leaf_nodes': [None, 10, 100, 1000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [None]:
# show the best parameters of the gridsearchCV regularized decision tree
gridsearch.best_params_

{'max_depth': None,
 'max_leaf_nodes': 10000,
 'min_samples_leaf': 10,
 'min_samples_split': 100}

In [None]:
# use parameters from gridsearchCV above in new decision tree model
reg_tree = gridsearch.best_estimator_

print("Training acc:", reg_tree.score(X_train, Y_train), "\nValidation acc:",
      reg_tree.score(X_val, Y_val))

Training acc: 0.757142857143 
Validation acc: 0.621428571429


In [None]:
# create decision tree model with manually searched parameters (best in class)
reg_tree2 = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = None, max_leaf_nodes = 125, min_samples_leaf = 2, min_samples_split = 60)
reg_tree2.fit(X_train, Y_train)

# print model training and test accuracies
print_results(reg_tree2, X_train, Y_train, "Training score: ")
print_results(reg_tree2, X_val, Y_val, "Testing score: ")

Training score:  0.858035714286
Testing score:  0.692857142857


In [None]:
# train an AdaBoost classifier using the tuned random forest model above as the base estimator
boost_clf = AdaBoostClassifier(base_estimator=reg_tree2, n_estimators=100)
boost_clf.fit(X_train, Y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=125,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=60,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=100, random_state=None)

In [None]:
print("Training acc:", boost_clf.score(X_train, Y_train), "\nValidation acc:",
      boost_clf.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.739285714286


The (AdaBoost) boosted decision tree classifier produces the best testing accuracy in (the decision tree) class at 73%. The model may seem to be overfitting the training data, but it captures the general signal or trend of the test data well enough to predict better than the decision tree base estimator and the gridsearchCV tuned decision tree model. Still, logistic regression, a simpler model, seems to be the better, more accurate option for sentiment analysis with this particular IMDB movie review text data.

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.


A Random Forest classifier prevents overfitting better than a decision tree by using a series of weaker trees that cannot themselves overfit the training set and takes a majority vote from them to classify. Training an AdaBoost Classifier on the same dataset trains weak learners sequentially, so they focus on the points that are hard to classify, so we made sure to limit each individual tree so it is individually weak.

In [None]:
rfc = RandomForestClassifier(criterion = 'entropy', n_estimators=100)
rfc.fit(X_train, Y_train)
print("Training acc:", rfc.score(X_train, Y_train), "\nValidation acc:",
      rfc.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.821428571429


## Changing Parameters

parameters = {"min_samples_split": [2, 5, 10],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }
gridsearch2 = GridSearchCV(rfc, parameters)
gridsearch2.fit(X_train, Y_train)

In [None]:
gridsearch2.best_params_

{'max_leaf_nodes': None, 'min_samples_leaf': 10, 'min_samples_split': 100}

In [None]:
reg_forest = gridsearch2.best_estimator_
reg_forest.fit(X_train, Y_train)
print("Training acc:", reg_forest.score(X_train, Y_train), "\nValidation acc:",
      reg_forest.score(X_val, Y_val))

Training acc: 0.935714285714 
Validation acc: 0.817857142857


After tuning the parameters of our random forest classifier using sklearn's GridSearchCV method, we decreased the model's overfitting of the training set from a 100% accuracy to a 93% accuracy. However, our test set accuracy on the parameter-tuned model was slightly lower at 81.7%.

**Add a Boost to Random Forest model with sklearn's AdaBoostClassifier( )**

In [None]:
boost_reg2 = AdaBoostClassifier(base_estimator=reg_forest)
boost_reg2.fit(X_train, Y_train)
print("Training acc:", boost_reg2.score(X_train, Y_train), "\nValidation acc:",
      boost_reg2.score(X_val, Y_val))

Training acc: 1.0 
Validation acc: 0.882142857143


What parameters did you choose to change and why?

We regularized the model parameters by running a standard grid search on the hyperparameters and received the following hyperparameters: min_samples_split=2, max_depth=None, min_samples_leaf=1, and max_leaf_nodes=None. The resulting model brought the training accuracy down to .90 and the validation set accuracy up to .83.

Finally, we trained an AdaBoostClassifier model using our regularized random forest as our base_estimator and received a training accuracy of 1.0 and a testing accuracy of .89. Although the model seems to overfit the training data, it also produces our highest testing accuracy yet. We decided to optimize this AdaBoostClassifier model by running a standard grid search on the n_estimators hyperparameter, which informed us that the best resulting model used a value of 50 for n_estimators. When we ran the resulting AdaBoostClassifier model using our regularized random forest as our base estimator and 50 as our number of estimators, we received a 1.0 training accuracy and a testing accuracy of .88, a class best. Again, the AdaBoost classifiers seem to overfit the training data, most likely due to our base estimator (the regularized random forest) being relatively strong estimators rather than the required weak estimators, but our resulting testing accuracy was a class best at .88. Thus, it seems as though a boosted random forest model does not perform better than the simpler, parameter-tuned logistic regression model, which produced a testing accuracy of 89.2%.

How does a random forest classifier prevent overfitting better than a single decision tree?

A Random Forest classifier prevents overfitting better than a decision tree by using a series of weaker trees that cannot themselves overfit the training set and takes a majority vote from them to classify. Training an AdaBoost Classifier on the same dataset trains weak learners sequentially, so they focus on the points that are hard to classify, so we made sure to limit each individual tree so it is individually weak.