In [1]:
import numpy as np
import pandas as pd
import os
import sklearn

# Reading Data

In [2]:
def segmentWords(s): 
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

#### Create a Dataframe containing the counts of each word in a file

In [8]:
d = []

for c in os.listdir("data_training"):
    directory = "data_training/" + c
    for file in os.listdir(directory):
        words = readFile(directory + "/" + file)
        e = {x:words.count(x) for x in words}
        e['__FileID__'] = str(file)
        e['__CLASS__'] = str(c)
        d.append(e)

Create a dataframe from d - make sure to fill all the nan values with zeros.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html


In [9]:
d

[{'!': 3,
  '"': 10,
  '&': 1,
  '(': 9,
  ')': 9,
  ',': 44,
  '-': 7,
  '.': 34,
  '10/10': 2,
  '2': 1,
  '20': 1,
  '3': 1,
  '4/10': 1,
  '7/10': 2,
  '8/10': 1,
  '9/10': 2,
  ':': 3,
  '?': 6,
  '__CLASS__': 'neg',
  '__FileID__': 'cv000_29416.txt',
  'a': 14,
  'about': 2,
  'accident': 1,
  'actors': 1,
  'actually': 2,
  'after': 2,
  'again': 2,
  'ago': 1,
  'all': 6,
  'also': 1,
  'although': 1,
  'always': 1,
  'american': 1,
  'an': 3,
  'and': 20,
  'apparently': 2,
  'apparitions': 1,
  'applaud': 1,
  'are': 13,
  'arrow': 1,
  'as': 1,
  'assuming': 1,
  'attempt': 1,
  'audience': 2,
  'away': 2,
  'back': 1,
  'bad': 2,
  'be': 1,
  'beauty': 1,
  'because': 2,
  'been': 2,
  'before': 2,
  'bentley': 1,
  'big': 1,
  'biggest': 2,
  'bit': 1,
  'blair': 1,
  'both': 1,
  'bottom': 1,
  'break': 1,
  'but': 10,
  'by': 2,
  'came': 1,
  'character': 1,
  "character's": 1,
  'characters': 1,
  'chase': 1,
  'chasing': 1,
  'chopped': 1,
  'church': 1,
  'clue': 1,


In [12]:
dF = pd.DataFrame(data=d)
dF

Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


#### Split data into training and validation set 

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [14]:
training = dF.sample(frac=0.80)
validation = dF.
training

Unnamed: 0,,earth,goodies,if,ripley,suspend,they,white,,,...,zukovsky,zundel,zurg's,zweibel,zwick,zwick's,zwigoff's,zycie,zycie',|
139,,,,,,,,,,,...,,,,,,,,,,
662,,,,,,,,,,,...,,,,,,,,,,
447,,,,,,,,,,,...,,,,,,,,,,
36,,,,,,,,,,,...,,,,,,,,,,
1131,,,,,,,,,,,...,,,,,,,,,,
639,,,,,,,,,,,...,,,,,,,,,,
34,,,,,,,,,,,...,,,,,,,,,,
909,,,,,,,,,,,...,,,,,,,,,,
1533,,,,,,,,,,,...,,,,,,,,,,
1564,,,,,,,,,,,...,,,,,,,,,,


* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

# Logistic Regression

#### Basic Logistic Regression
* Use sklearn's linear_model.LogisticRegression() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#### Changing Parameters

#### Feature Selection
* In the backward stepsize selection method, you can remove coefficients and the corresponding x columns, where the coefficient is more than a particular amount away from the mean - you can choose how far from the mean is reasonable.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.std.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html

How did you select which features to remove? Why did that reduce overfitting?

# Single Decision Tree

#### Basic Decision Tree

* Initialize your model as a decision tree with sklearn.
* Fit the data and labels to the model.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


#### Changing Parameters
* To test out which value is optimal for a particular parameter, you can either loop through various values or look into sklearn.model_selection.GridSearchCV

References:


http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

How did you choose which parameters to change and what value to give to them? Feel free to show a plot.

Why is a single decision tree so prone to overfitting?

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


#### Changing Parameters

What parameters did you choose to change and why?

How does a random forest classifier prevent overfitting better than a single decision tree?