# Models!

In this notebook I run and test as many different models.

I was interested in seeing the difference between titles and selftext and the feature engineered column I created.

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB



In [2]:
df_reddit = pd.read_csv('Datasets/reddit_cleaned_title_and_selftext.csv')

In [3]:
df_reddit.head()

Unnamed: 0,author,id,num_comments,score,created_utc,selftext,title,subreddit,char_count_title,word_count_title,char_count_selftext,word_count_selftext,title + selftext,clean_title,clean_selftext,clean_title_+_selftext,neg,neu,pos,compound
0,nothanksbud5,g5rffm,1,2,1587516214,Wow i didn’t realize how much music is about b...,Why is almost all music seem to be about love?,0,46,10,219,36,Why is almost all music seem to be about love?...,almost music seem love,wow realize much music love romance seems like...,almost music seem love wow realize much music ...,0.08,0.301,0.619,0.9777
1,dontknowwhattdo,g5r7z2,3,2,1587515419,I thought that during this time it would be ni...,pieces of advice that have stuck with you?,0,42,8,285,55,pieces of advice that have stuck with you? I t...,pieces advice stuck,thought time would nice hear words encourageme...,pieces advice stuck thought time would nice he...,0.091,0.383,0.526,0.9657
2,sharkfinnsouphk,g5r5q2,2,0,1587515173,I just can't shake this worry about kids (and ...,Worried about people stuck at home,0,34,6,269,50,Worried about people stuck at home I just can'...,worried people stuck home,shake worry kids adults stuck home lock sexual...,worried people stuck home shake worry kids adu...,0.467,0.456,0.077,-0.8924
3,dehlen1me,g5r3t3,0,1,1587514972,https://youtu.be/9_AWrNmcMZA\nThis is one of t...,How a 5 Dollar bill can help you to feel bette...,0,62,13,179,24,How a 5 Dollar bill can help you to feel bette...,dollar bill help feel better,https youtu awrnmcmza one amazing uplifting vi...,dollar bill help feel better https youtu awrnm...,0.0,0.63,0.37,0.8555
4,fighterpilot909,g5qtjo,2,2,1587513886,Imagine how insane that book would be. To make...,I want an autobiography from John McAfee so badly,0,49,9,144,28,I want an autobiography from John McAfee so ba...,want autobiography john mcafee badly,imagine insane book would make even better cou...,want autobiography john mcafee badly imagine i...,0.215,0.63,0.156,-0.3818


In [4]:
df_reddit.shape

(21825, 20)

In [19]:
df_reddit.isnull().sum()

author                    0
id                        0
num_comments              0
score                     0
created_utc               0
selftext                  0
title                     0
subreddit                 0
char_count_title          0
word_count_title          0
char_count_selftext       0
word_count_selftext       0
title + selftext          0
clean_title               0
clean_selftext            0
clean_title_+_selftext    0
neg                       0
neu                       0
pos                       0
compound                  0
dtype: int64

In [13]:
df_reddit['subreddit'].value_counts(normalize=True)

0    0.578923
1    0.421077
Name: subreddit, dtype: float64

In [14]:
df_reddit.dropna(inplace=True)

In [15]:
df_reddit['subreddit'].value_counts(normalize=True)

0    0.579003
1    0.420997
Name: subreddit, dtype: float64

For some reason when I pulled in my csv there was missing values. I checked my previous notebook(03_Further_Cleaning.ipynb), but couldn't find out why there were added null values, but there weren't too many of them so I decided to remove them.

## Testing on Titles

Testing on the clean titles I created to see how they run in a model

In [22]:
X = df_reddit['clean_title']
y = df_reddit['subreddit']

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

In [24]:
y_test.value_counts(normalize=True)

0    0.578996
1    0.421004
Name: subreddit, dtype: float64

In [25]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

In [32]:
pipe_params = {
    'cvec__max_features': [2000, 3000, 4000, 5000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2), (2,2)]
}

In [33]:
gs = GridSearchCV(pipe,
                  pipe_params,
                  cv=5)

In [34]:
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [35]:
gs.best_score_

0.723219923849684

In [36]:
gs.score(X_test, y_test)

0.7342284347986022

I am not too surpised to see that the model didn't score too high since there wasn't as much information on the titles as other columns like selftext. It seems to fit people well to the test model as well. Remembering from the EDA that the titles were about the same in word count. So this makes sense.

## Testing on Clean Selftext

Testing on the clean selftext I created to see how they run in a model

In [37]:
X = df_reddit['clean_selftext']
y = df_reddit['subreddit']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

In [43]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver='liblinear'))
])

In [44]:
pipe_params = {
    'cvec__max_features': [2000, 3000, 4000, 5000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2), (2,2)]
}

In [45]:
gs = GridSearchCV(pipe,
                  pipe_params,
                  cv=5)

In [46]:
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [47]:
gs.best_score_

0.7771783320369616

In [48]:
gs.score(X_test, y_test)

0.7855434982527129

In [49]:
gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 2000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2)}

This column did perform better then the title column, but was hoping for a better score. Now I would like to see how the feature engineered column did.


## Testing on Clean Title + Selftext

Testing on the clean titles and selftext I created to see how they run in a model

In [50]:
X = df_reddit['clean_title_+_selftext']
y = df_reddit['subreddit']

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

In [52]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver='liblinear'))
])

In [53]:
# Going to use the best params from selftext
pipe_params = {
    'cvec__max_features': [2000],
    'cvec__min_df': [2],
    'cvec__max_df': [.9],
    'cvec__ngram_range': [(1,2)]
}

In [54]:
gs = GridSearchCV(pipe,
                  pipe_params,
                  cv=5)

In [55]:
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [56]:
gs.best_score_

0.7850882294158469

In [60]:
gs.score(X_train, y_train)

0.864124103255871

In [61]:
gs.score(X_test, y_test)

0.7980503954386611

In [62]:
lr_model = gs.estimator

In [63]:
lr_model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('cvec',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                           

In [64]:
lr_model.score(X_train, y_train)

0.9850389355570544

In [65]:
lr_model.score(X_test, y_test)

0.7984182453558948

It looks as though the feature engineered column faired a little better, but not too much. Also after doing a little more to the model I realized it was quite overfit.

## Trying out Naive Bayes

In [66]:
X = df_reddit['clean_title_+_selftext']
y = df_reddit['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

In [70]:
mnb = MultinomialNB()

mnb.fit(X_train_cv, y_train)

mnb.score(X_train_cv, y_train)

0.8684162119075357

In [71]:
mnb.score(X_test_cv, y_test)

0.8199374655140702

Compared to the last three models this one seems to be performing the best. It is a little overfit, but not too much. I would be interested to see how the score would be if I gathered more subreddit submissions and did a little better cleaning if this model would perform a little better.

### Trying out the Clean Title + Selftext differently

I wanted to see if I ran a model without the pipeline and different hyperparameters how it will look.

In [143]:
X = df_reddit['clean_title_+_selftext']
y = df_reddit['subreddit']

In [144]:
# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    stratify=y,
                                                    random_state = 42)

In [145]:
#Instantiating CountVectorizer 
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 2000,
                             ngram_range= (1,2),
                             min_df = 2) 

In [146]:
X_train_vectorizer = vectorizer.fit_transform(X_train)

X_test_vectorizer = vectorizer.transform(X_test)


In [147]:
# Instantiate logistic regression model.
lr = LogisticRegression(solver = 'liblinear')
# Fit model to training data.
lr.fit(X_train_vectorizer, y_train)

# Evaluate model on training data.
lr.score(X_train_vectorizer, y_train)

0.864124103255871

In [148]:
lr.score(X_test_vectorizer, y_test)

0.7980503954386611

Comparing the outcomes between the 'clean' title + selftext and the normal title + selftext to see if they were performing any differently.

In [120]:
X = df_reddit['title + selftext']
y = df_reddit['subreddit']

In [125]:
#Instantiating CountVectorizer 
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english',
                             max_features = 2000,
                             ngram_range= (1,2)) 

In [126]:
X_train_vectorizer = vectorizer.fit_transform(X_train)

X_test_vectorizer = vectorizer.transform(X_test)


In [127]:
# Instantiate logistic regression model.
lr = LogisticRegression(solver = 'liblinear')
# Fit model to training data.
lr.fit(X_train_vectorizer, y_train)

# Evaluate model on training data.
lr.score(X_train_vectorizer, y_train)

0.9321846833036973

In [128]:
lr.score(X_test_vectorizer, y_test)

0.7914290969284532

Running the two models has showed me that the two columns are scoring at very different percentages. At least in the training sets. The test sets are performing pretty similarly. The major difference is that the untampered 'title + selftext' is way more overfit. By about 14% points where the clean column is only overfit by 7%. Which seems to be better, but both testing sets performed about the same. Which makes me think if I need a lot more data. Maybe 21,000 rows isn't enough data to parse though to see a drastic difference.

## Trying out a model with word count and sentiment analysis included

I was interested to see if the model would perform better if I tried adding in more columns. I was wondering if a sentiment analysis would help the testing set improve.

In [20]:
df_reddit.columns

Index(['author', 'id', 'num_comments', 'score', 'created_utc', 'selftext',
       'title', 'subreddit', 'char_count_title', 'word_count_title',
       'char_count_selftext', 'word_count_selftext', 'title + selftext',
       'clean_title', 'clean_selftext', 'clean_title_+_selftext', 'neg', 'neu',
       'pos', 'compound'],
      dtype='object')

In [21]:
X = df_reddit[['word_count_title', 'word_count_selftext', 'clean_title_+_selftext', 'neg', 'neu',
       'pos', 'compound', 'num_comments']]
y = df_reddit['subreddit']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.25, 
                                                    random_state = 42, 
                                                    stratify = y)

In [23]:
X_train.shape, X_test.shape

((16309, 8), (5437, 8))

In [24]:
X_train.head()

Unnamed: 0,word_count_title,word_count_selftext,clean_title_+_selftext,neg,neu,pos,compound,num_comments
19518,17,233,person excited spend extra time spouse possibl...,0.028,0.592,0.38,0.9941,10
21027,9,103,yeah due corona work hi guys live sound engine...,0.198,0.658,0.144,-0.4567,2
8544,13,542,people underestimate effect culture idea men w...,0.242,0.623,0.135,-0.9854,0
426,19,195,secretly hoping corona virus takes massive tol...,0.227,0.636,0.136,-0.9153,1
17529,16,90,despite media says things revert back pre covi...,0.219,0.593,0.188,-0.3191,5


In [25]:
# Countvectorizing the clean title + selftext
cvec = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 2000,
                             ngram_range= (1,2),
                             min_df = 2) 


X_train_cv = cvec.fit_transform(X_train['clean_title_+_selftext'])
X_test_cv = cvec.transform(X_test['clean_title_+_selftext'])

In [26]:
# Turning the CountVectorized data into dataframes
X_train_df = pd.DataFrame(X_train_cv.todense(), columns= cvec.get_feature_names(), index=X_train.index)

X_test_df = pd.DataFrame(X_test_cv.todense(), columns= cvec.get_feature_names(), index=X_test.index)

In [27]:
X_train_df.head()

Unnamed: 0,ability,able,able get,absolute,absolutely,abuse,abused,abusive,accept,accepted,...,years old,yes,yesterday,yet,yo,young,younger,youtube,youtube com,zero
19518,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21027,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8544,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
426,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17529,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Adding 'neg', 'neu', 'pos', 'compound', 'word_count_title', 'word_count_selftext', 
#'num_comments' back to training set
X_train_df = pd.merge(left = X_train_df,
                     right = X_train[['neg', 
                                      'neu', 
                                      'pos', 
                                      'compound',
                                      'word_count_title', 
                                      'word_count_selftext',
                                      'num_comments']],
                     left_index = True,
                     right_index = True)

X_train_df.head()

Unnamed: 0,ability,able,able get,absolute,absolutely,abuse,abused,abusive,accept,accepted,...,youtube,youtube com,zero,neg,neu,pos,compound,word_count_title,word_count_selftext,num_comments
19518,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0.028,0.592,0.38,0.9941,17,233,10
21027,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.198,0.658,0.144,-0.4567,9,103,2
8544,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.242,0.623,0.135,-0.9854,13,542,0
426,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.227,0.636,0.136,-0.9153,19,195,1
17529,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.219,0.593,0.188,-0.3191,16,90,5


In [29]:
# Adding 'neg', 'neu', 'pos', 'compound', 'word_count_title', 'word_count_selftext', 
#'num_comments' back to testing set
X_test_df = pd.merge(left = X_test_df,
                    right = X_test[['neg', 
                                      'neu', 
                                      'pos', 
                                      'compound',
                                      'word_count_title', 
                                      'word_count_selftext',
                                      'num_comments']],
                    left_index = True,
                    right_index = True)

In [30]:
X_train_df.shape, y_train.shape

((16309, 2007), (16309,))

In [31]:
X_test_df.shape, y_test.shape

((5437, 2007), (5437,))

In [38]:
lr = LogisticRegression(solver='liblinear', penalty='l2')

lr.fit(X_train_df, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [39]:
print('train:', lr.score(X_train_df, y_train))
print('test:', lr.score(X_test_df, y_test))

train: 0.8669446317983935
test: 0.8033842192385506


The testing set was a lot higher than some of the other models I ran. I still think that the Naive Bayes model did the best. 

## Trying to find out what words were most helpful in the model

In [41]:
results = lr.fit(X_train_df, y_train)

In [43]:
results.coef_

array([[-0.21592967, -0.11841878, -0.57113855, ...,  0.01556339,
         0.00526231, -0.00167541]])

In [89]:
words_df = pd.DataFrame(results.coef_,
             columns=X_train_df.columns).T.sort_values(0)

In [90]:
words_df

Unnamed: 0,0
quarantine,-2.026719
poll https,-1.961879
view poll,-1.924721
wanted share,-1.709266
corona,-1.599827
...,...
sexually,1.686206
died,1.853095
death,1.962149
sexual,2.197139


In [91]:
# Dropping words I don't think should be there
words_df.drop(index=['seriousconversation', 'neg'], inplace=True)

In [93]:
# Checking out what are the top 20 words
words_df.loc[(words_df[0] > 1.03), :].sort_values(0, ascending = False)

Unnamed: 0,0
religious,2.371477
sexual,2.197139
death,1.962149
died,1.853095
sexually,1.686206
politics,1.667395
trauma,1.660829
cancer,1.633291
drugs,1.562543
dying,1.441792


In [83]:
# Removing more words that I feel aren't too useful
words_df.drop(index=['poll https', 'view poll', 'neu', 'mid', 'casualconversation wiki'], inplace=True)

In [87]:
words_df.loc[(words_df[0] < -.95), :].sort_values(0)

Unnamed: 0,0
quarantine,-2.026719
wanted share,-1.709266
corona,-1.599827
isolation,-1.434727
lockdown,-1.287376
caring,-1.267222
cake,-1.224584
stay home,-1.188175
social distancing,-1.171571
bought,-1.133384


After going through the modeling process I realized that there is still some more good work that can be done. Listing out the coefficients I see that there are some other words I could have removed to maybe help the model. Cleaning and parsing out the self text with website links. Removing the subreddit name from all the values. 

I also think I may need a lot more data. My initial thought of about 22,000 was clearly not enough. I also needed to tweak the hyperparameters more. 

This dataset was definitly large and not doing proper cleaning I can see how it is pretty exhaustive to your computer. One of the models I ran took 30 minutes. 