**Imports**

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

**Load in Data / Finalize Pre-processing**

In [2]:
df = pd.read_csv('Data/modelready_fin.csv')

In [3]:
df.shape

(1699, 2)

In [4]:
df.head()

Unnamed: 0,combine,target
0,imgur reupload since posted photo sorry im kin...,0
1,pokemon go team rocket event detail leaked chr...,0
2,long story short time lived germany made frien...,0
3,tired wasting tm getting right move wasted fiv...,0
4,ive little guy waiting patiently front researc...,0


In [5]:
df.isnull().sum()

combine    1
target     0
dtype: int64

In [6]:
#looks like an empty string was saved to csv and exported out as a null, removing it
df[df['combine'].isnull()]

Unnamed: 0,combine,target
676,,1


In [7]:
df.dropna(inplace=True)

In [8]:
df.shape

(1698, 2)

In [9]:
X = df['combine']
y = df['target']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, random_state=23)

**Modeling**

Baseline Model

In [11]:
y.mean()

0.6154299175500589

The baseline model prediction is .6154. This means that a model where we simply predict the majority class, in our case the silph road subreddit, would be right 61.54% of the time.

Count Vectorizer and Logistic Regression - gridsearch

In [12]:
pipeCVLR = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver='lbfgs'))
])

In [13]:
#Did a couple run throughs to tune parameters, what's shown is the most recent run through
pipeCVLR_params = {
    'cvec__ngram_range': [(1,2)],
    'cvec__max_features': [1000],
    'cvec__min_df': [5],
    'lr__C':[.3,.4,.5,.6,.7]
}
gsCVLR = GridSearchCV(pipeCVLR, param_grid=pipeCVLR_params, cv=5,verbose=1,n_jobs=2)
gsCVLR.fit(X_train, y_train)
print(gsCVLR.best_score_)
gsCVLR.best_params_

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:    3.4s finished


0.7069913589945012


{'cvec__max_features': 1000,
 'cvec__min_df': 5,
 'cvec__ngram_range': (1, 2),
 'lr__C': 0.5}

In [14]:
gsCVLR.score(X_test,y_test)

0.6964705882352941

Count Vectorizer and Multinomial Bayes - gridsearch

In [15]:
pipeCVMB = Pipeline([
    ('cvec', CountVectorizer()),
    ('Mbay', MultinomialNB())
])

In [16]:
#Did a couple run throughs to tune parameters, what's shown is the most recent run through
pipeCVMB_params = {
    'cvec__ngram_range': [(1,2)],
    'cvec__max_features': [1000],
    'cvec__min_df': [6],
    'Mbay__alpha':[.03,.04,.05,.06,.07]
}
gsCVMB = GridSearchCV(pipeCVMB, param_grid=pipeCVMB_params, cv=5,verbose=1,n_jobs=2)
gsCVMB.fit(X_train, y_train)
print(gsCVMB.best_score_)
gsCVMB.best_params_

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  22 out of  25 | elapsed:    1.6s remaining:    0.2s
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:    1.8s finished


0.7164179104477612


{'Mbay__alpha': 0.03,
 'cvec__max_features': 1000,
 'cvec__min_df': 6,
 'cvec__ngram_range': (1, 2)}

In [17]:
gsCVMB.score(X_test,y_test)

0.7223529411764706

TF-IDF and Logistic Regression - gridsearch

In [18]:
pipeTVLR = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(solver='lbfgs'))
])

In [19]:
#Did a couple run throughs to tune parameters, what's shown is the most recent run through
pipeTVLR_params = {
    'tvec__ngram_range': [(1,2)],
    'tvec__max_features': [1000],
    'tvec__min_df': [2],
    'lr__C':[.8,.9,1,1.1,1.2]
}
gsTVLR = GridSearchCV(pipeTVLR, param_grid=pipeTVLR_params, cv=5,verbose=1,n_jobs=2)
gsTVLR.fit(X_train, y_train)
print(gsTVLR.best_score_)
gsTVLR.best_params_

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  22 out of  25 | elapsed:    1.7s remaining:    0.2s
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:    1.9s finished


0.7179890023566379


{'lr__C': 1.1,
 'tvec__max_features': 1000,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 2)}

In [20]:
gsTVLR.score(X_test,y_test)

0.7105882352941176

TF-IDF Vectorizer and Multinomial Bayes - gridsearch

In [21]:
pipeTVMB = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('Mbay', MultinomialNB())
])

In [22]:
#Did a couple run throughs to tune parameters, what's shown is the most recent run through
pipeTVMB_params = {
    'tvec__ngram_range': [(1,3)],
    'tvec__max_features': [1000],
    'tvec__min_df': [5],
    'Mbay__alpha':[.03,.04,.05,.06,.07]
}
gsTVMB = GridSearchCV(pipeTVMB, param_grid=pipeTVMB_params, cv=5,verbose=1,n_jobs=2)
gsTVMB.fit(X_train, y_train)
print(gsTVMB.best_score_)
gsTVMB.best_params_

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:    3.2s finished


0.7172034564021995


{'Mbay__alpha': 0.04,
 'tvec__max_features': 1000,
 'tvec__min_df': 5,
 'tvec__ngram_range': (1, 3)}

In [23]:
gsTVMB.score(X_test,y_test)

0.7435294117647059

**Analyzing Output**

Analyze coefficients of my best model, Multinomial Naive Bayes with a TF-IDF vectorizer

In [29]:
outputdf = pd.DataFrame(np.exp(gsTVMB.best_estimator_.named_steps['Mbay'].coef_),\
                        columns = gsTVMB.best_estimator_.named_steps['tvec'].get_feature_names())

In [26]:
#highest scoring coefficients are words/phrases that are more likley associated with theSilphRoad, lowest scoring
#coefficients are the opposite
outputdf.sort_values(by=0,axis=1,ascending=False).T

Unnamed: 0,0
raid,0.009099
go,0.008924
dortmund,0.008255
pokemon,0.008123
fest,0.007116
shiny,0.007105
go fest,0.007011
mewtwo,0.006829
spawn,0.006346
anyone,0.006138


**Conclusions**

Of the highest predicting words/phrases for theSilphRoad subreddit, only a couple seem to indicate the supposed more research/mechanics based focus of r/theSilphRoad. I'll analyze a couple here.
- raid - raids are a special type of mode in pokemon go that allows players to catch strong pokemon, while I would expect both subreddits to discuss this, there are discussions regarding the catch rates and where/when raids can form that I could see to leading to higher discussion on r/theSilphRoad
- dortmund - This was a location of the most recent Pokemon Go Fest. This is a yearly event hosted by Niantic where a bunch of trainers meet up and experience a number of specific bonuses for people playing at this specific location. It doesn't really make sense that this would be discussed more in theSilphRoad since there's not much in terms of research and mechanics that are discussed for these events
- spawn - Generally used to discuss where/when pokemon appear in the game. It makes a lot of sense that this would be correlated with theSilphRoad subreddit since there's a lot of mechanics that go into this

Of the highest predicting words/phrases for the pokemongo subreddit, a number jump out for discussion
- "caught shiny" - shiny pokemon are a rare version of a specific pokemon and often times something trainers are trying to collect. Posts relating to "I caught a shiny" are generally removed from theSilphRoad since they offer no research value
- "first shiny" - reasoning for this would be very similar to "caught shiny"
- "started playing" - another phrase that one could imagine is used a lot on a more general subreddit but has little research value and is thus not found on theSilphRoad subreddit.

In [33]:
#Calculate improvement in accuracy
(.7435 - .6154) / .6154

0.2081572960675985

It does seem like my model is able to predict between the two subreddits with my best model giving me a 21% increase in accuracy over the baseline model. However, it doesn't seem like most of my highest correlated words with theSilphRoad subreddit show the supposed research-heavy focus of the subreddit. This could either be because my model isn't great or because the subreddit itself doesn't actually focus as highly on research as it indicates. In any case, based on this analysis and the words that my model says are most predictive, I'm not seeing much reason from this analysis to dedicate a more technical employee to following theSilphRoad subreddit.