# Project 3: Web APIs & NLP
## *Basic Models*

In this notebook:

* [Logistic Regression](#lgr)
* [KNN](#knn)
* [Basic Decision Tree](#dtree)
* [Naive Bayes](#naive-bayes)

These are all basic models with minimal parameter selections to get a baseline understanding of how different models perform.

#### Import Libraries & Read in Data

In [1]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer

## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error


## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [2]:
### read in data
path = '../data/reddit_posts_clean.csv'
data = pd.read_csv(path)

In [3]:
### Select Data
X = data['selftext']
y = data['is_fallout']

In [4]:
### TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Null Model <a class="anchor" id="null-model"></a>
<hr/>

In [67]:
y.value_counts()

0    4754
1    4318
Name: is_fallout, dtype: int64

In [68]:
y.value_counts(normalize=True)[0]

0.5240299823633157

## Logistic Regression <a class="anchor" id="lgr"></a>
<hr/>

In [17]:
pipe = make_pipeline(CountVectorizer(stop_words='english'), StandardScaler(with_mean=False), LogisticRegression())

pipe.fit(X_train, y_train)

print('Training Score: ', pipe.score(X_train, y_train))
print('Testing Score: ', pipe.score(X_test, y_test))

Training Score:  0.9875073486184597
Testing Score:  0.8893298059964727


Eh, not great, but also not terrible for a simple model

In [20]:
pipe = make_pipeline(CountVectorizer(stop_words='english', max_features=100, ngram_range=(1,2)), StandardScaler(with_mean=False), LogisticRegression())

pipe.fit(X_train, y_train)

print('Training Score: ', pipe.score(X_train, y_train))
print('Testing Score: ', pipe.score(X_test, y_test))

Training Score:  0.8793356848912405
Testing Score:  0.8734567901234568


Increasing ngram range doesn't seem to do much here. We'll try a proper grid search with some other parameters to see if we can improve on this at all.

In [21]:
pipe = make_pipeline(CountVectorizer(stop_words='english'), StandardScaler(with_mean=False),  LogisticRegression())

params = {'countvectorizer__max_features': [100, 500],
              'countvectorizer__min_df': [2, 3],
              'countvectorizer__max_df': [0.9, 0.95],
              'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)]}

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

print('Training Score: ', grid.score(X_train, y_train))
print('Testing Score: ', grid.score(X_test, y_test))
print('Best Parameters: ', grid.best_params_)

Training Score:  0.9454732510288066
Testing Score:  0.9029982363315696
Best Parameters:  {'countvectorizer__max_df': 0.9, 'countvectorizer__max_features': 500, 'countvectorizer__min_df': 3, 'countvectorizer__ngram_range': (1, 1)}


Obviously, increasing number of features used improves performance of the model. Adding ngrams didn't seem to help all that much. We'll try this set of parameters with some other models to see if this remains true.

## KNN <a class="anchor" id="knn"></a>
<hr/>

In [71]:
knn = make_pipeline(CountVectorizer(stop_words='english', max_features=100), StandardScaler(with_mean=False), KNeighborsClassifier())

knn.fit(X_train, y_train)

print('Training Score: ', knn.score(X_train, y_train))
print('Testing Score: ', knn.score(X_test, y_test))

Training Score:  0.8375955320399765
Testing Score:  0.7848324514991182


This seems to be the worst performing model

## Basic Decision Tree <a class="anchor" id="dtree"></a>
<hr/>

In [72]:
dt = make_pipeline(CountVectorizer(stop_words='english', max_features=100), DecisionTreeClassifier())

dt.fit(X_train, y_train)

print('Training Score: ', dt.score(X_train, y_train))
print('Testing Score: ', dt.score(X_test, y_test))

Training Score:  0.9331275720164609
Testing Score:  0.859347442680776


There seems to be some evidence of overfitting (as is expected) we'll try a random forest model as well in another notebook.

## Naive Bayes <a class="anchor" id="naive-bayes"></a>
<hr/>

In [16]:
nb = make_pipeline(CountVectorizer(stop_words='english', max_features=100), MultinomialNB(alpha=0.00001))

nb.fit(X_train, y_train)

print('Training Score: ', nb.score(X_train, y_train))
print('Testing Score: ', nb.score(X_test, y_test))

Training Score:  0.876249265138154
Testing Score:  0.8783068783068783


This model performs relatively well and there doesn't look like evidence of overfitting, model performance maybe improved if more features are used.