# Final Model and Conclusion
This notebook implements select tuned models and a voting classifier to create a final model.

In [2]:
from sklearn.ensemble import VotingClassifier
import pickle
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

  from numpy.core.umath_tests import inner1d


 - Read in relevant models and data

In [3]:
models = pickle.load(open('../data/models.pk', 'rb'))
lr_model_cv = models['lr_model_cv']
rf_model_cv = models['rf_model_cv']
ada_model_cv = models['ada_model_cv']
gb_model_cv = models['gb_model_cv']
nb_model_cv = models['nb_model_cv']
X_train_cv = models['X_train_cv']
y_train  = models['y_train']
X_test_cv = models['X_test_cv']
y_test  = models['y_test']

- Voting Classifier to combine models

In [4]:
vote = VotingClassifier([
    ('lr' , lr_model_cv),
    ('rf' , rf_model_cv),
    ('ada', ada_model_cv),
    ('gb' , gb_model_cv),
    ('nb' , nb_model_cv )],
    voting='soft',
    weights=[.15,.15,.15,.15,.4]
)
vote.fit(X_train_cv, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...             warm_start=False)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         flatten_transform=None, n_jobs=1, voting='soft',
         weights=[0.15, 0.15, 0.15, 0.15, 0.4])

In [5]:
print('Train score:', vote.score(X_train_cv, y_train))
print('Test score:', vote.score(X_test_cv, y_test))

Train score: 0.9171291157972623
Test score: 0.8745837957824639


- Our final classification model is chosen using the CountVectorizer transformation on 1500 features. A Voting Classifier implements a 'soft' method, weighing the probabilities of 5 different models: Logistic Regression, Random Forest, AdaBoost, Gradient Boost, and Naive Bayes Multinomial. 
- The Naive Bayes probabilities are weighed more heavily than the other models, as it is the model that is overfitting the least on our training data. 
- We are left with a train accuracy score of 92% and test accuracy score of 87%, which is still better than a basline score of 51%.


Our text data was collected from posts and comments from two subreddits: r/BravoRealHousewives and r/nba. After using NLP methods like stemming, lemmenizing, tokenizing, and CountVectorizer, and TF-IDF transformations, we fit various classification methods (Logistic Regression, Random Forest, AdaBoost, Gradient Boost, and MultiNomial Naive Bayes) through a Voting Classifier.

While the two chosend subreddits may seem drastically different, our final model correctly classifies posts and comments with only 87% accuracy. This indiciates there may be overlap in topics and speech patterns from Real Housewives and NBA fans significant enough to deter our model from a better accuracy score. Below, we list the top 20 most frequently appearing words for both the NBA and Real Housewives. We note that a majority of popular words in Real Housewives posts and comments also appear frequently in NBA posts and comments. 

This model can be used to compare and contrast speech patterns and topics concerning fans of the Real Housewives and the NBA. In the end, we discover similar words used frequently in both topics which can make it more difficult to train our models to better predict classes. With more time and processing power, we might include additional features (higher n-grams) and tune different hyperparameters.

In [6]:
#sort words by occurrences in NBA posts and comments
nba_top_20 = X_train_cv.groupby('target').sum().T.sort_values(0, ascending=False).head(20)
nba_top_20.columns = ['nba','housewives']
nba_top_20

Unnamed: 0,nba,housewives
game,485,12
player,335,2
team,330,11
nba,320,0
play,263,35
point,250,42
season,230,251
like,213,351
year,204,82
would,202,151


In [29]:
#sort words by occurrences in Real Housewives posts and comments
rhw_top_20 = X_train_cv.groupby('target').sum().T.sort_values(1, ascending=False)
rhw_top_20.columns = ['nba','housewives']
rhw_top_20[['housewives','nba']]

Unnamed: 0,housewives,nba
like,351,213
think,259,129
season,251,230
show,208,33
get,207,189
one,196,167
know,187,89
housew,162,0
episod,159,2
would,151,202


In [57]:
#important features of adaboost model
pd.DataFrame(ada_model_cv.feature_importances_, index = X_train_cv.columns).sort_values(by=0,ascending=False).head(20)

Unnamed: 0,0
look,0.033241
game,0.027114
episod,0.01489
housew,0.014032
alway,0.013484
ball,0.013476
shit,0.013421
bravorealhousew,0.013384
ramona,0.013265
lebron,0.013226


In [58]:
#important features of gradient boost model
pd.DataFrame(gb_model_cv.feature_importances_, index = X_train_cv.columns).sort_values(by=0,ascending=False).head(20)

Unnamed: 0,0
game,0.031583
team,0.015991
player,0.015373
housew,0.014373
nba,0.012683
play,0.012321
show,0.011564
episod,0.01151
fuck,0.011417
look,0.011183


In [60]:
#important features of gradient boost model
pd.DataFrame(rf_model_cv.feature_importances_, index = X_train_cv.columns).sort_values(by=0,ascending=False).head(20)

Unnamed: 0,0
game,0.036477
player,0.028263
nba,0.023978
team,0.020912
play,0.018022
lebron,0.013774
housew,0.013491
episod,0.011957
show,0.01189
leagu,0.011311
