# Project 3: Web APIs & NLP
## *Naive Bayes Models*

Parameter selection for Naive Bayes models.

In this notebook:

* [Basic Naive Bayes](#basic-nb)
* [Naive Bayes with TfidfVectorizer](#tfid-nb)
* [Grid search for CountVectorizer](#gs-cv)
* [Determine ideal $\alpha$](#nb-alpha)
* [Best model](#best-model)

#### Import Libraries & Read in Data

In [24]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [25]:
### read in data
data = pd.read_csv('../data/reddit_posts_clean.csv')

In [26]:
### select data
X = data['selftext']
y = data['is_fallout']
### TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Basic Naive Bayes <a class="anchor" id="basic-nb"></a>
<hr/>

In [27]:
nb = make_pipeline(CountVectorizer(stop_words='english'), MultinomialNB())

nb.fit(X_train, y_train)

print('Training Score: ', nb.score(X_train, y_train))
print('Testing Score: ', nb.score(X_test, y_test))

Training Score:  0.9616402116402116
Testing Score:  0.9470899470899471


## Naive Bayes with `TfidfVectorizer`<a class="anchor" id="tfid-nb"></a>
<hr/>

In [28]:
nb = make_pipeline(TfidfVectorizer(stop_words='english'), MultinomialNB())

nb.fit(X_train, y_train)

print('Training Score: ', nb.score(X_train, y_train))
print('Testing Score: ', nb.score(X_test, y_test))

Training Score:  0.9700176366843033
Testing Score:  0.9435626102292769


There is not much difference between `TfidfVectorizer` and `countVectorizer` so we'll stick with using `countVectorizer`.

## Grid search for CountVectorizer <a class="anchor" id="gs-cv"></a>
<hr/>

First we'll try and narrow down parameters for CountVectorizer. It was taking too long to do a grid search on everything all together.

In [31]:
pipe = make_pipeline(CountVectorizer(stop_words='english'),  MultinomialNB())

params = {
    'countvectorizer__max_features': [100, 500, 800],
    'countvectorizer__min_df': [2, 3],
    'countvectorizer__max_df': [0.9, 0.95],
    'countvectorizer__ngram_range': [(1, 1), (1, 2)],       
}

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

print('Training Score: ', grid.score(X_train, y_train))
print('Testing Score: ', grid.score(X_test, y_test))
print('Best Parameters: ', grid.best_params_)

Training Score:  0.9329805996472663
Testing Score:  0.9241622574955908
Best Parameters:  {'countvectorizer__max_df': 0.9, 'countvectorizer__max_features': 800, 'countvectorizer__min_df': 2, 'countvectorizer__ngram_range': (1, 1)}


Each time we try to adjust number of features it seems to choose the max value. We'll just switch to using all features for subsequent models.

## Determine ideal $\alpha$ <a class="anchor" id="nb-alpha"></a>
<hr/>

In [20]:
pipe = make_pipeline(CountVectorizer(stop_words='english'),  MultinomialNB())

params = {
    'countvectorizer__min_df': [2, 3],
    'countvectorizer__max_df': [0.9, 0.95],
    'countvectorizer__ngram_range': [(1, 1), (1, 2)],
    'multinomialnb__alpha': [0.01, 0.0001, 0.0001, 0.00001]         
}

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

print('Training Score: ', grid.score(X_train, y_train))
print('Testing Score: ', grid.score(X_test, y_test))
print('Best Parameters: ', grid.best_params_)

Training Score:  0.9607583774250441
Testing Score:  0.9475308641975309
Best Parameters:  {'countvectorizer__max_df': 0.9, 'countvectorizer__min_df': 2, 'countvectorizer__ngram_range': (1, 1), 'multinomialnb__alpha': 0.01}


In [21]:
print('F1 Score training: ', f1_score(y_train, grid.predict(X_train)))
print('F1 Score testing: ', f1_score(y_test, grid.predict(X_test)))

F1 Score training:  0.9577999051683261
F1 Score testing:  0.9437883797827114


In [22]:
y_preds = grid.predict(X)
print('F1 Score: ', f1_score(y, y_preds))

F1 Score:  0.9542870677404074


## Best Model <a class="anchor" id="best-model"></a>
<hr/>

In [38]:
pipe = make_pipeline(CountVectorizer(stop_words='english'),  MultinomialNB())

params = {
    'countvectorizer__min_df': [2],
    'countvectorizer__max_df': [0.9],
    'countvectorizer__ngram_range': [(1, 1)],
    'multinomialnb__alpha': [0.0001]         
}

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

print('R2 Training Score: ', grid.score(X_train, y_train))
print('R2 Testing Score: ', grid.score(X_test, y_test))
print('-----------------------')
print('F1 Score training: ', f1_score(y_train, grid.predict(X_train)))
print('F1 Score testing: ', f1_score(y_test, grid.predict(X_test)))
y_preds = grid.predict(X)
print('-----------------------')
print('Complete F1 Score: ', f1_score(y, y_preds))

R2 Training Score:  0.9610523221634333
R2 Testing Score:  0.9444444444444444
-----------------------
F1 Score training:  0.9581292463264338
F1 Score testing:  0.9405099150141643
-----------------------
Complete F1 Score:  0.9537113768201729
