In [26]:
import numpy as np
import pandas as pd
import os
import sklearn as sklearn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Reading Data

In [27]:
def segmentWords(s): 
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

In [28]:
for c in os.listdir("data_training"):
    directory = "data_training/" + c
    for file in os.listdir(directory):
        print(file)
        words = readFile(directory + "/" + file)
        print(len(words))
        break

cv000_29416.txt
825
cv000_29590.txt
802


In [29]:
os.listdir("data_training")

['neg', 'pos']

#### Create a Dataframe containing the counts of each word in a file

In [30]:
d = []

for c in os.listdir("data_training"):
    directory = "data_training/" + c
    for file in os.listdir(directory):
        words = readFile(directory + "/" + file)
        e = {x:words.count(x) for x in words}
        #e['__FileID__'] = file
        e['__CLASS__'] = c
        d.append(e)

Create a dataframe from d - make sure to fill all the nan values with zeros.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html


In [31]:
df = pd.DataFrame(d)
dataframe = df.fillna(0)

#### Split data into training and validation set 

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [32]:
training_data = dataframe.sample(frac = 0.8)#.drop(labels = ['__FileID__'],axis = 1)
validation_data = dataframe.drop(labels =training_data.index, axis = 0)#.drop(labels = ['__CLASS__'],axis = 1)


#training_data,validation_data = train_test_split(dataframe.values, test_size=0.2)

* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [33]:
x_training_data = training_data.drop(labels = '__CLASS__',axis = 1)
y_training_data = training_data['__CLASS__']
x_validation_data = validation_data.drop(labels = '__CLASS__',axis = 1)
y_validation_data = validation_data['__CLASS__']

# Logistic Regression

#### Basic Logistic Regression
* Use sklearn's linear_model.LogisticRegression() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [34]:
#using default regularization term
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
my_logistic_reg = logistic.fit(x_training_data,y_training_data)
print('Score of Logistic Regression using default regularization term: ',my_logistic_reg.score(x_validation_data,y_validation_data))

Score of Logistic Regression using default regularization term:  0.825


I am not overfitting, a score of 0.85 is really good

#### Changing Parameters

In [35]:
#using customized regularization term and methodology
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(C = 1e6)
my_logistic_reg = logistic.fit(x_training_data,y_training_data)
print('Score of Logistic Regression using customized regularization term and newton-cg method: ',my_logistic_reg.score(x_validation_data,y_validation_data))


Score of Logistic Regression using customized regularization term and newton-cg method:  0.81875


#### Feature Selection
* In the backward stepsize selection method, you can remove coefficients and the corresponding x columns, where the coefficient is more than a particular amount away from the mean - you can choose how far from the mean is reasonable.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.std.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html

How did you select which features to remove? Why did that reduce overfitting?

ANS: We select features that have the least correlation and aren't collinear. That reduces overfitting because we are now reducing the variance of our features and increasing the dimensions of the function we are trying to predict.

In [36]:
np.std(np.mean(x_training_data))

0.3663323989776312

In [37]:
n_col = {}
for col_name in x_training_data:
    temp_std = np.std(x_training_data[col_name])
    if temp_std > .07:
        n_col[col_name] = x_training_data[col_name]
new_x_training_data = pd.DataFrame.from_dict(n_col)

In [38]:
keys = []
for k in n_col:
    keys.append(k)
new_x_validation_data = x_validation_data[keys]

In [39]:
n_my_logistic_reg = logistic.fit(new_x_training_data,y_training_data)
print('Score of new Logistic Regression using customized regularization term and newton-cg method: ',n_my_logistic_reg.score(new_x_validation_data,y_validation_data))


Score of new Logistic Regression using customized regularization term and newton-cg method:  0.821875


# Single Decision Tree

#### Basic Decision Tree

* Initialize your model as a decision tree with sklearn.
* Fit the data and labels to the model.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


In [40]:
clf = DecisionTreeClassifier(random_state=0)
decision_tree = clf.fit(x_training_data,y_training_data)
print('The accuracy of default tree is:', decision_tree.score(x_validation_data,y_validation_data))

The accuracy of default tree is: 0.66875


#### Changing Parameters
* To test out which value is optimal for a particular parameter, you can either loop through various values or look into sklearn.model_selection.GridSearchCV

References:


http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [45]:
clf = DecisionTreeClassifier(random_state=0,criterion = 'entropy')
decision_tree = clf.fit(x_training_data,y_training_data)
print('The accuracy of default tree is:', decision_tree.score(x_validation_data,y_validation_data))


The accuracy of default tree is: 0.678125


In [53]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV
svc = svm.SVC(C=1, kernel='linear',tol=1e-4)
svc_fit = svc.fit(x_training_data,y_training_data)
svc_score = svc_fit.score(x_validation_data,y_validation_data)


0.834375


In [54]:
print('Decision Tree with an customized error tolerance of 1e-4 scored',svc_score)

Decision Tree with an customized error tolerance of 1e-4 scored 0.834375


How did you choose which parameters to change and what value to give to them? Feel free to show a plot.

ANS: We chose to change the error tolerance parameter so that we can try to decrease the amount of overfitting.

Why is a single decision tree so prone to overfitting?

ANS: Because with only one test set, the learning algorithm will continue to develop hypothesis that try to reduce training set error at the expense of increasing test set error

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [62]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
my_forest = clf.fit(x_training_data,y_training_data)
my_forest.score(x_validation_data,y_validation_data)

0.65625

#### Changing Parameters

What parameters did you choose to change and why?

In [78]:
clf = RandomForestClassifier(criterion = 'entropy',n_estimators = 1000)
my_forest = clf.fit(x_training_data,y_training_data)
my_forest.score(x_validation_data,y_validation_data)

0.77812499999999996

ANS: We chose to increase the number of single decision tree in the random forest classifier because that will allow the Random Forest Classifier to have more estimatators, making a best fit based on more test sets, and thus produce a more accurate result.

Why is a randome forest classifier better than a single decision tree

ANS: A random forest classifier is better than a single decision tree because it prevents the latter's problem of sacrificing the accuracy the test set. A random forest classifier achieve this by taking in multiple single decision trees. This way, multiple test sets will be taken into consideration and thus is more accurate