# Reddit & Quibi: Web API and NLP
## Part 4: Modeling

Now that I've cleaned and explored the data, I want to build the best model to classify it. To do so, I'm going to try several different classification methods:
1. [Logistic Regression](#Logistic-Regression)
2. [Naive Bayes](#Naive-Bayes)
3. [Decision Tree](#Decision-Tree)
4. [Voting Classifier](#Voting-Classifier)

In [1]:
import pandas as pd
import numpy as np
import pickle
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

In [2]:
reddit = pickle.load(open('../datasets/reddit_features.pkl', 'rb'))

In [3]:
reddit.shape

(90000, 17)

The goal of this model is to eventually classify the description of Quibi shows. Since those descripitions are limited in length, I'm only going to use the text for each observation. I'll use the same features and same train/test split for all models:

In [4]:
X = reddit['text']
y = reddit['e_subreddit']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 21)

In [6]:
def find_scores(model, x_tr, x_tst):
    train = model.score(x_tr, y_train)
    test = model.score(x_tst, y_test)
    cv = cross_val_score(model, x_tr, y_train).mean()
    return f'Train: {train} | Test: {test} | Cross Val {cv}'

## Logistic Regression

First, I'm going to fit logistic regression model with no stopwords. I want to identify key words for each subreddit that are so specific they give away the topic (i.e. HBO or podcasts). It's important to remove these because the Quibi descriptions won't have these, I want to focus on grabbing signal from words that talk more about the content.

Before creating the logistic regression model, I first need to vectorize the words. I'll be creating another Count Vectorizer with the stop words I determine later so this one is fairly simple.

In [7]:
cvec_simp = CountVectorizer(ngram_range = (1, 2), min_df = 5, max_df = .80)

In [8]:
cvec_simp.fit(X_train)
X_train_cvec_s = cvec_simp.transform(X_train)
X_test_cvec_s = cvec_simp.transform(X_test)

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
lr = LogisticRegression(solver='newton-cg')

In [11]:
lr.fit(X_train_cvec_s, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
find_scores(lr, X_train_cvec_s, X_test_cvec_s)

'Train: 0.9745777777777778 | Test: 0.9053777777777777 | Cross Val 0.8998962962962963'

### Stop Words

Based on the coefficients generated from the above model, we can see which words have the most impact. Many of these will be too specific such as the name of a TV provider or even the name of the media. While it may help in accuracy, it's not telling us anything we didn't already know. To combat this, I'm going to generate a list of stop words for each subreddit then filter it to be more robust.

In [13]:
lr.coef_

array([[-1.27882415e-01,  8.99146997e-02, -8.04015211e-02, ...,
        -2.35703828e-02, -1.62869953e-01, -5.92691997e-02],
       [-3.09486052e-01, -8.99132580e-02,  3.83380879e-02, ...,
        -1.12091040e-01,  4.42508037e-01, -1.94965854e-01],
       [ 4.37368468e-01, -1.44177121e-06,  4.20634332e-02, ...,
         1.35661423e-01, -2.79638084e-01,  2.54235054e-01]])

This DataFrame matches the terms with their coefficient for each class:

In [14]:
coef_df = pd.DataFrame(lr.coef_).T

In [15]:
features_cvec = cvec_simp.get_feature_names()

In [16]:
coef_df['term'] = features_cvec

These list hold the top 150 words for each subreddit:

In [17]:
pod_stop = coef_df.sort_values(0, ascending=False)['term'][:300]

In [18]:
tv_stop = coef_df.sort_values(1, ascending=False)['term'][:300]

In [19]:
vid_stop = coef_df.sort_values(2, ascending=False)['term'][:150]

In [20]:
coef_df.sort_values(1, ascending=False)['term'][:600]

688          2018 2018
72312       youlikebet
68509         watchmen
39742          netflix
58299       television
             ...      
48300            promo
721        2019 latest
50824          reveals
45670          phoenix
40535    normal people
Name: term, Length: 600, dtype: object

Lastly, I'm going to filter the words based on how often they appear in the other subreddits. This is important because the top 150 words contain some obvious ones to remove (i.e. "podcasts" in podcast) but they also contain some key words that speak to the different content on each (i.e. "crime"). In that case, I want to keep words that appear in the other subreddits more than a certain number of times.

In [21]:
def filter_words(stop_list, target_class):
    sel_word = []
    for i in stop_list: 
        count = 0
        for j in list(reddit[reddit['e_subreddit'] != target_class]['text']):
            if i in j:
                count += 1
            else:
                count = count
        #print(count)
        if count < 50:
            sel_word.append(i)
    return sel_word

In [22]:
pod_filter = filter_words(pod_stop, 0)

In [23]:
tv_filter = filter_words(tv_stop, 1)

In [24]:
vid_filter = filter_words(vid_stop, 2)

All of these filtered lists will be my final stop words set along with a few other words I manually picked:

In [25]:
addtl_words = ['mic', 'mics', 'service', 'hbo', 'isp', 'dish', 'pod', 'ott', 'web series',
               'cutting', 'spectrum', 'networks', 'youtuber', 'television', 'horseman', 'renewed', 'https']

In [26]:
stop_words_set = (pod_filter + tv_filter + vid_filter + addtl_words)

In [27]:
len(stop_words_set)

572

In [28]:
stop_word_set = [x.split(' ') for x in stop_words_set]

In [29]:
#Advice on unpacking these lists within lists 
#https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
stop_word_set = [i for j in stop_word_set for i in j]

In [30]:
len(stop_word_set)

761

I want to add these to the predetermined English stop words:

In [31]:
english = list(CountVectorizer(stop_words = 'english').get_stop_words())

In [32]:
stop_word_set = stop_word_set + english

In [33]:
len(stop_word_set)

1079

Now that I have set stopwords, I'll fit a few more models with a new Count Vectorizer:

In [34]:
cvec = CountVectorizer(tokenizer = None,
                    stop_words = stop_word_set, 
                            ngram_range = (1, 2), 
                           max_df = .95, 
                            max_features = 60_000)

In [35]:
cvec.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.95, max_features=60000, min_df=1,
                ngram_range=(1, 2), preprocessor=None,
                stop_words=['podcasts', 'podcasting', 'podcasters', 'overcast',
                            'stitcher', 'pods', 'listeners', 'listening',
                            'cohost', 'podcaster', 'recording', 'itunes',
                            'anchor', 'listener', 'podbean', 'traders',
                            'podchaser', 'chatting', 'luminary', 'soundcloud',
                            'spotify', 'podsync', 'microphones', 'velvet',
                            'dnd', 'castbox', 'entrepreneurship', 'launchpaddm',
                            'listened', 'microphone', ...],
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [36]:
X_train_cvec = cvec.transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [37]:
lr_2 = LogisticRegression(solver='newton-cg')

In [38]:
lr_2.fit(X_train_cvec, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [39]:
find_scores(lr_2, X_train_cvec, X_test_cvec)

'Train: 0.920074074074074 | Test: 0.8159555555555555 | Cross Val 0.8021777777777779'

This model ended up being the best score but also for interpretation, so I pickled this one to take to another notebook:

In [40]:
pickle.dump(lr_2, open('../model/log_reg_1.pkl', 'wb'))

In [41]:
pickle.dump(cvec, open('../model/cvec_1.pkl', 'wb'))

A few of the other models that I tried (but didn't make the cut for the Quibi data) are below:

In [42]:
tf = TfidfVectorizer(tokenizer = None, stop_words = stop_word_set, max_df = .7, ngram_range = (1, 2), min_df = 5)

In [43]:
tf.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.7, max_features=None,
                min_df=5, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words=['podcasts', 'podcasting', 'podcasters', 'overcast',
                            'stitcher', 'pods', 'listeners', 'listening',
                            'coh...ter', 'recording', 'itunes',
                            'anchor', 'listener', 'podbean', 'traders',
                            'podchaser', 'chatting', 'luminary', 'soundcloud',
                            'spotify', 'podsync', 'microphones', 'velvet',
                            'dnd', 'castbox', 'entrepreneurship', 'launchpaddm',
                            'listened', 'microphone', ...],
                strip_accents=None, sublinear_tf=False,
                token_pattern='(?u)

In [44]:
X_train_tf = tf.transform(X_train)

In [45]:
X_test_tf = tf.transform(X_test)

In [46]:
lr_3 = LogisticRegression(solver='newton-cg')

In [47]:
lr_3.fit(X_train_tf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [48]:
find_scores(lr_3, X_train_tf, X_test_tf)

'Train: 0.8700296296296296 | Test: 0.8132 | Cross Val 0.7997925925925926'

## Naive Bayes

In [49]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB

In [50]:
mnb = MultinomialNB()

In [51]:
mnb.fit(X_train_cvec, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [52]:
find_scores(mnb, X_train_cvec, X_test_cvec)

'Train: 0.8246370370370371 | Test: 0.7855555555555556 | Cross Val 0.7768592592592592'

This model was super fast but unfortunately not very accurate.

## Decision Tree

In [53]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [54]:
from sklearn.tree import DecisionTreeClassifier

In [55]:
dt = DecisionTreeClassifier(random_state = 42)

In [56]:
dt.fit(X_train_cvec, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [57]:
find_scores(dt, X_train_cvec, X_test_cvec)

'Train: 0.9757629629629629 | Test: 0.7135555555555556 | Cross Val 0.7000740740740741'

This model had really high variance and scored much lower than the Logistic Regreesion models above.

## Voting Classifier

In [58]:
from sklearn.ensemble import VotingClassifier

In [59]:
vc = VotingClassifier(
            estimators=[
        ('lr', lr_2), ('mnb', mnb), ('dt', dt)], voting='hard')

In [60]:
vc.fit(X_train_cvec, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='newton-cg', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('mnb',
                              MultinomialNB(alpha=1.0, class_prior=None,
                                            fit_prior=True)),
                             ('dt',
                              Dec...eClassifier(ccp_alpha=0.0,
               

In [61]:
find_scores(vc, X_train_cvec, X_test_cvec)

'Train: 0.9372148148148148 | Test: 0.8123111111111111 | Cross Val 0.7992148148148148'

This model didn't do particularly great but it was interesting to see how the models interact, if one might correct the bias of another.

After trying a variety of options, the 2nd Logistic Regression model is the best and suites the needs of the project. That's what I'll use in the following notebook to predict on the Quibi shows.