# Project 3: Web APIs & NLP
## *Final Models & Conclusions*

The final two models are presented along with conclusions and futher explorations that could be conducted.

In this notebook:

* [Naive Bayes](#nb-model)
* [Random Forest](#rf-model)
* [Conclusions](#conclusions)
* [Further Exploration](#further)

#### Import Libraries & Read in Data

In [6]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [7]:
### read in data
data = pd.read_csv('../data/reddit_posts_clean.csv')

In [8]:
### select data
X = data['selftext']
y = data['is_fallout']
### TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Naive Bayes Model <a class="anchor" id="nb-model"></a>
<hr/>

In [4]:
pipe = make_pipeline(CountVectorizer(stop_words='english'),  MultinomialNB())

params = {
    'countvectorizer__min_df': [2],
    'countvectorizer__max_df': [0.9],
    'countvectorizer__ngram_range': [(1, 1)],
    'multinomialnb__alpha': [0.0001]         
}

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

print('R2 Training Score: ', grid.score(X_train, y_train))
print('R2 Testing Score: ', grid.score(X_test, y_test))
print('-----------------------')
print('F1 Score training: ', f1_score(y_train, grid.predict(X_train)))
print('F1 Score testing: ', f1_score(y_test, grid.predict(X_test)))
y_preds = grid.predict(X)
print('-----------------------')
print('Complete F1 Score: ', f1_score(y, y_preds))

R2 Training Score:  0.9610523221634333
R2 Testing Score:  0.9444444444444444
-----------------------
F1 Score training:  0.9581292463264338
F1 Score testing:  0.9405099150141643
-----------------------
Complete F1 Score:  0.9537113768201729


## Random Forest Model <a class="anchor" id="rf-model"></a>
<hr/>

In [5]:
rfc = make_pipeline(CountVectorizer(stop_words='english'), RandomForestClassifier())

rfc.fit(X_train, y_train)

print('R2 Training Score: ', rfc.score(X_train, y_train))
print('R2 Testing Score: ', rfc.score(X_test, y_test))
print('-----------------------')
print('F1 Training Score: ', f1_score(y_train, rfc.predict(X_train)))
print('F1 Testing Score: ', f1_score(y_test, rfc.predict(X_test))) 
print('-----------------------')
y_preds = rfc.predict(X)
print('Complete F1 Score: ', f1_score(y, y_preds) )

R2 Training Score:  0.987801293356849
R2 Testing Score:  0.9206349206349206
-----------------------
F1 Training Score:  0.9872835912364026
F1 Testing Score:  0.9192825112107623
-----------------------
Complete F1 Score:  0.9699668836359484


## Conclusions <a class="anchor" id="conclusions"></a>
<hr/>

* Increasing n-grams did not improve model performance.
* Use of TfidfVectorizer did not improve model performance.


## Further Exploration <a class="anchor" id="further"></a>
<hr/>

* Further processing of data could be done to clean text.
* Additional stop words could be added.
* Use of lemmatizing/stemming
* Further analysis of words specific to each subreddit. Is there a better way to choose words outside of count?
* Further exploration of parameter selection in tree models to attempt to improve performance.