# Project 3: Web APIs & NLP
## *Template Notebook*

In this notebook:

* [Basic Decision Tree](#basic)
* [Bagged Decision Tree](#bagged)
* [Random Forest](#random)
* [Best Model](#best-model)

#### Import Libraries & Read in Data

In [12]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [13]:
### read in data
data = pd.read_csv('../data/reddit_posts_clean.csv')

In [14]:
### select data
X = data['selftext']
y = data['is_fallout']
### TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Basic Decision Tree <a class="anchor" id="basic"></a>
<hr/>

In [None]:
dt = make_pipeline(CountVectorizer(stop_words='english', max_features=100), DecisionTreeClassifier())

dt.fit(X_train, y_train)

print('Training Score: ', dt.score(X_train, y_train))
print('Testing Score: ', dt.score(X_test, y_test))

This gives us a baseline for our decision tree model performance. Let's see if we can improve things by selecting more features.

In [31]:
dt = make_pipeline(CountVectorizer(stop_words='english'), DecisionTreeClassifier())

params = {
    'countvectorizer__max_features' : [100, 500, 800],

}

grid = GridSearchCV(dt, param_grid = params)
grid.fit(X_train, y_train)

print('Training Score: ', grid.score(X_train, y_train))
print('Testing Score: ', grid.score(X_test, y_test))
print('Best Parameters: ', grid.best_params_)

Training Score:  0.9769253380364491
Testing Score:  0.8871252204585538
Best Parameters:  {'countvectorizer__max_features': 800}


This uses max option for features so let's see what happens when we use all features.

In [30]:
dt = make_pipeline(CountVectorizer(stop_words='english'), DecisionTreeClassifier())

dt.fit(X_train, y_train)

print('Training Score: ', dt.score(X_train, y_train))
print('Testing Score: ', dt.score(X_test, y_test))

Training Score:  0.987801293356849
Testing Score:  0.8849206349206349


Performance isn't improved that much but we'll use the entire data set for subsequent tree models as long as computing performance is ok.

## Bagged Decision Tree <a class="anchor" id="bagged"></a>
<hr/>

In [18]:
bag = make_pipeline(CountVectorizer(stop_words='english'), BaggingClassifier()) 

bag.fit(X_train, y_train)

print('Training Score: ', bag.score(X_train, y_train))
print('Testing Score: ', bag.score(X_test, y_test))

Training Score:  0.9847148736037625
Testing Score:  0.8990299823633157


In [26]:
print('F1 Training Score: ', f1_score(y_train, bag.predict(X_train)))
print('F1 Testing Score: ', f1_score(y_test, bag.predict(X_test)))

F1 Training Score:  0.9840441853329243
F1 Testing Score:  0.8972633467922836


Bagging doesn't seem to improve the decision tree performance

## Random Forest <a class="anchor" id="random"></a>
<hr/>

In [23]:
rfc = make_pipeline(CountVectorizer(stop_words='english'), RandomForestClassifier())

rfc.fit(X_train, y_train)

print('Training Score: ', rfc.score(X_train, y_train))
print('Testing Score: ', rfc.score(X_test, y_test))

Training Score:  0.987801293356849
Testing Score:  0.9223985890652557


Wow. This one is pretty good considering I'm doing nothing special to this thing at all.

In [25]:
print('F1 Training Score: ', f1_score(y_train, rfc.predict(X_train)))
print('F1 Testing Score: ', f1_score(y_test, rfc.predict(X_test))) 

F1 Training Score:  0.9872796934865901
F1 Testing Score:  0.921146953405018


In [28]:
y_preds = rfc.predict(X)
print('F1 Score: ', f1_score(y, y_preds) )

F1 Testing Score:  0.9704236610711432


This model has the least evidence of overfitting and performs pretty well compared to basic models explored.

## Best Model <a class="anchor" id="best-model"></a>
<hr/>

In [29]:
rfc = make_pipeline(CountVectorizer(stop_words='english'), RandomForestClassifier())

rfc.fit(X_train, y_train)

print('R2 Training Score: ', rfc.score(X_train, y_train))
print('R2 Testing Score: ', rfc.score(X_test, y_test))
print('-----------------------')
print('F1 Training Score: ', f1_score(y_train, rfc.predict(X_train)))
print('F1 Testing Score: ', f1_score(y_test, rfc.predict(X_test))) 
print('-----------------------')
y_preds = rfc.predict(X)
print('Complete F1 Score: ', f1_score(y, y_preds) )

R2 Training Score:  0.987801293356849
R2 Testing Score:  0.9232804232804233
-----------------------
F1 Training Score:  0.9872796934865901
F1 Testing Score:  0.9222520107238605
-----------------------
Complete F1 Score:  0.9706721442428392
