In [101]:
import pandas as pd
import numpy as np
import requests
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import time
from nltk.tokenize import word_tokenize



In this project, I will build a classification model that will predict whether a reddit post, with only the title and text content provided, is in the World of Warcraft (WoW) subreddit or the Final Fantasy XIV (FFXIV) subreddit.  
I am approaching the problem from the perspective of part of the marketing team at Square Enix, the publisher of FFXIV.  With this model, we can predict uncover language that our players use frequently, so we can better craft our marketing to appeal to them.

The first thing we need is to use pushshift to grab our posts.  I built a function to grab 100 posts from every month from each subreddit, going back to the subreddit's founding.

In [102]:
def posts_getter(subreddit):
    #using https://www.epochconverter.com/ I converted the current date 
    #and dates of the FFXI subreddit's founding (ffxiv is younger, so this way we'll have a more equal split)
    current_epoch = 1611872155
    founding_epoch = 1454019344
    one_month_in_seconds = 2628288
    url = 'https://api.pushshift.io/reddit/search/submission'
    posts = []
    #sets the founding_epoch variable depending on what subreddit was passed.  If the subreddit isn't within the scope of this project, it returns an error.  
    #With some fancy webscraping I could probably work it out to get any subreddit, but for now I'm leaving it to these two.
    #if subreddit == 'wow':
        #founding_epoch = wow_founding_epoch
    #elif subreddit == 'ffxiv':
        #founding_epoch = ffxiv_founding_epoch
   # else:
        #print('sorry, I have no data for that subreddit')
    #iterates, month by month, from the founding of the subreddit to january 28, 2021, pulling 100 posts every month.
    #this spread of time helps ensure that there are no duplicate posts, and also gives us a very wide view of each subreddit
    time.sleep(15)
    for month in range(founding_epoch, current_epoch, one_month_in_seconds):
        res = requests.get(url, {'subreddit': subreddit, 'size': 100, 'before': month})
        try:
            data = res.json()
            posts.extend(data['data'])
        except:
            pass
    return posts
#gets the posts
ffxiv_posts = posts_getter('ffxiv')
wow_posts = posts_getter('wow')
#turns those posts into a dataframe
ffxiv_df = pd.DataFrame(ffxiv_posts)
wow_df = pd.DataFrame(wow_posts)
print('ffxiv shape is',ffxiv_df.shape)
print(wow_df.shape

In [108]:
wow_df.shape

(10893, 100)

In [109]:
wow_df.loc[lambda wow_df: wow_df['selftext'].isna() == True] = wow_df.loc[lambda wow_df: wow_df['selftext'].isna() == True].fillna('notatextpost')

In [110]:
wow_df['selftext'].isna().sum()

0

In [111]:
ffxiv_df.loc[lambda ffxiv_df: ffxiv_df['selftext'].isna() == True] = ffxiv_df.loc[lambda fxiv_df: ffxiv_df['selftext'].isna() == True].fillna('notatextpost')

In [112]:
ffxiv_df['selftext'].isna().sum()

0

In [113]:
df = pd.concat([ffxiv_df, wow_df], axis = 0)

Normally, I'd leave the below uncommented in order to save the dataframe.  However, I'm commenting it out to make sure I'm calling the same dataframe that I created on Friday, January 29th.  This keeps my modeling consistent day to day.

In [114]:
#df.to_csv('./concatenated_df.csv')

In [115]:
df = pd.read_csv('./concatenated_df.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Now, I'm turning my subreddit column into a numeric, and renaming it to reflect that.  I could have used get_dummies here as well, but decided to use this method as it makes certain that FFXIV is my positive value.

In [116]:
df['subreddit'].replace(['ffxiv','wow'],[1,0], inplace = True)
df.rename(columns = {'subreddit':'is_on_ffxiv'}, inplace = True)

Next, I separate the three columns I really need: our features and our target.

In [117]:
df = df[['selftext','title', 'is_on_ffxiv']]

In [118]:
df.isna().sum()

selftext       5478
title             0
is_on_ffxiv       0
dtype: int64

In [119]:
df.loc[lambda df: df['selftext'].isna() == True] = df.loc[lambda df: df['selftext'].isna() == True].fillna('notatextpost')

In [121]:
df.shape

(14738, 3)

In [122]:
#next step is to get some distributions, of post-word-count and post-length.  Refer to prevous labs for code.

Next, in order to build my model, I use the Snowball Stemmer to get stems of every word in my text posts and my titles (ignoring the stop words so that I can save on computational expense, since they'll just get dropped later). This will help save on computational expenses later on as well, since I'll have fewer features from my CountVectorizer.  
I then use CountVectorizer to split up all my self-text and titles.   This results in a very large dataset, but that is to be expected given the wide variety of language that can be used on reddit.

In [124]:
def identify_selftext_tokens(row):
    text = row['selftext']
    tokens = word_tokenize(text)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words
def identify_title_tokens(row):
    text = row['title']
    tokens = word_tokenize(text)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

In [125]:
df['self_text_words'] = df.apply(identify_selftext_tokens, axis = 1)
df['title_words'] = df.apply(identify_title_tokens, axis = 1)

In [126]:
df['self_text_words']

0                                           [notatextpost]
1                                                [removed]
2                                                [removed]
3                                           [notatextpost]
4                                           [notatextpost]
                               ...                        
14733                                       [notatextpost]
14734    [Hey, all, just, started, an, alt, as, a, sham...
14735                                       [notatextpost]
14736                                       [notatextpost]
14737    [I, have, almost, cleared, raid, bosses, and, ...
Name: self_text_words, Length: 14738, dtype: object

In [127]:
df['title_words']

0                     [Post, your, favourite, screenshots]
1             [Old, Player, Thinking, about, Coming, Back]
2                        [New, player, Recruit, a, friend]
3        [Have, drawn, anything, in, like, a, decade, P...
4        [What, are, the, chances, the, fate, of, this,...
                               ...                        
14733    [This, is, why, TC, is, fun, as, a, Destro, Wa...
14734               [Pro, and, con, of, elemental, shaman]
14735    [Twisting, corridors, is, so, hard, for, warri...
14736    [When, Flayedwing, Toxin, gives, you, a, third...
14737                               [Loot, is, a, problem]
Name: title_words, Length: 14738, dtype: object

In [128]:
df['stemmed_self_text'] = df['self_text_words'].apply(lambda x: [SnowballStemmer("english", ignore_stopwords=True).stem(y) for y in x])
df['stemmed_title'] = df['title_words'].apply(lambda x: [SnowballStemmer("english", ignore_stopwords=True).stem(y) for y in x])

In [129]:
df['stemmed_self_text'].tail(15)

14723    [is, it, all, class, except, energi, base, spe...
14724                                       [notatextpost]
14725    [okay, i, was, in, war, mode, when, i, was, do...
14726                                              [delet]
14727    [just, spent, two, hour, on, it, just, to, be,...
14728    [so, i, just, start, play, wow, for, real, abo...
14729                                      [i, salut, you]
14730                                       [notatextpost]
14731                                       [notatextpost]
14732    [so, i, know, i, want, to, dps, and, i, m, awa...
14733                                       [notatextpost]
14734    [hey, all, just, start, an, alt, as, a, shaman...
14735                                       [notatextpost]
14736                                       [notatextpost]
14737    [i, have, almost, clear, raid, boss, and, i, h...
Name: stemmed_self_text, dtype: object

In [130]:
#let's also get columns for the length of each post and title
df['self_text_len'] = df['selftext'].apply(lambda x: len(x.split()))
df['title_len'] = df['title'].apply(lambda x: len(x.split()))

In [131]:
df.drop(df.loc[lambda df: df['selftext'] == '[removed]'].index, inplace = True)

In [132]:
df.head()

Unnamed: 0,selftext,title,is_on_ffxiv,self_text_words,title_words,stemmed_self_text,stemmed_title,self_text_len,title_len
0,notatextpost,Post your favourite screenshots?,1,[notatextpost],"[Post, your, favourite, screenshots]",[notatextpost],"[post, your, favourit, screenshot]",1,4
3,notatextpost,Haven't drawn anything in like a decade. Picke...,1,[notatextpost],"[Have, drawn, anything, in, like, a, decade, P...",[notatextpost],"[have, drawn, anyth, in, like, a, decad, pick,...",1,21
4,notatextpost,What are the chances the fate of this NPC gets...,1,[notatextpost],"[What, are, the, chances, the, fate, of, this,...",[notatextpost],"[what, are, the, chanc, the, fate, of, this, n...",1,13
5,So a few weeks ago there was a new survey done...,"FFXIV survey, realm pop..etc (looking for)",1,"[So, a, few, weeks, ago, there, was, a, new, s...","[FFXIV, survey, realm, pop, etc, looking, for]","[so, a, few, week, ago, there, was, a, new, su...","[ffxiv, survey, realm, pop, etc, look, for]",49,6
6,notatextpost,Drew my Miquo'te PLD and Lalafell SMN,1,[notatextpost],"[Drew, my, PLD, and, Lalafell, SMN]",[notatextpost],"[drew, my, pld, and, lalafel, smn]",1,7


In [133]:
df.drop(columns = ['selftext','title','self_text_words','title_words'], inplace = True)

In [134]:
df.head()

Unnamed: 0,is_on_ffxiv,stemmed_self_text,stemmed_title,self_text_len,title_len
0,1,[notatextpost],"[post, your, favourit, screenshot]",1,4
3,1,[notatextpost],"[have, drawn, anyth, in, like, a, decad, pick,...",1,21
4,1,[notatextpost],"[what, are, the, chanc, the, fate, of, this, n...",1,13
5,1,"[so, a, few, week, ago, there, was, a, new, su...","[ffxiv, survey, realm, pop, etc, look, for]",49,6
6,1,[notatextpost],"[drew, my, pld, and, lalafel, smn]",1,7


In [135]:
df.to_csv('./stemmed_and_cleaned_df')

In [136]:
df = pd.read_csv('./stemmed_and_cleaned_df')

In [88]:
#creates the Count Vectorized self text column
self_text_cove = CountVectorizer(stop_words = 'english')
self_text_cove.fit(df['stemmed_self_text'])
self_text_words = pd.DataFrame(self_text_cove.transform(df['stemmed_self_text']).todense(),columns = self_text_cove.get_feature_names())
#creates the Count Vectorized title column
title_cove = CountVectorizer(stop_words = 'english')
title_cove.fit(_df['stemmed_title'])
title_words = pd.DataFrame(title_cove.transform(df['stemmed_title']).todense(),columns = title_cove.get_feature_names())

In [89]:
cove_df = pd.concat([title_words, self_text_words, df.drop(columns = ['stemmed_self_text','stemmed_title'])], axis = 1)

In [90]:
cove_df.shape

(14738, 39534)

In [91]:
cove_df.head()

Unnamed: 0,00,000,000g,000th,00100,00am,00pm,01,02,03,...,紅蓮の反乱,紅蓮の戦乱,織田さんがアラミゴの方とクガネまで,英雄よ,蒼天の旅路,蒼天の神話,難易度は侵攻編零式ぐらい,須藤氏が物騒なことをブツブツ言いながら企画していたのですさまじそう,라그나로스,is_on_ffxiv
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Let's take a look at the most common words across both subreddits.

In [92]:
cove_df.sum(axis = 0).sort_values(ascending = False).head(15)

is_on_ffxiv     6100
notatextpost    5496
just            4250
like            3307
game            2841
time            2359
ve              2252
amp             2205
com             2186
know            2111
new             1957
don             1901
really          1764
https           1720
people          1702
dtype: int64

We can see ignore the first two values (those are placeholders for what subreddit it's on, and whether the post has any self-text or not.)

And let's see that same list for each subreddit.

In [93]:
cove_df[cove_df['is_on_ffxiv'] == 0].sum(axis = 0).sort_values(ascending = False).head(15)

notatextpost    3553
just            2409
like            1930
wow             1512
time            1335
game            1301
know            1233
ve              1210
new             1115
don             1099
people          1038
com             1030
really          1024
play            1006
level            993
dtype: int64

In [94]:
cove_df[cove_df['is_on_ffxiv'] == 1].sum(axis = 0).sort_values(ascending = False).head(15)

is_on_ffxiv     6100
notatextpost    1943
just            1841
amp             1548
game            1540
like            1377
com             1156
ve              1042
https           1037
time            1024
ffxiv            904
know             878
new              842
don              802
really           740
dtype: int64

Let's get some distributions!  

In [95]:
X = cove_df.drop(columns = 'is_on_ffxiv')
y = cove_df['is_on_ffxiv']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

In [96]:
#baseline!
y.mean()

0.41389605102456234

In [97]:
ra_fo = RandomForestClassifier()
ra_fo.fit(X_train, y_train)

RandomForestClassifier()

In [98]:
print(ra_fo.score(X_train, y_train))
print(ra_fo.score(X_test, y_test))

0.9988238487288519
0.8534599728629579


Well!  Sure looks like our Random Forest model is quite over fit.  Nonetheless, that testing score is much, much better than our baseline.  So perhaps we can live with it.

In [99]:
extra_trees = ExtraTreesClassifier()
extra_trees.fit(X_train, y_train)

ExtraTreesClassifier()

In [101]:
print(extra_trees.score(X_train, y_train))
print(extra_trees.score(X_test, y_test))

0.9988238487288519
0.8583446404341927


Again, Extra Trees is overly fit, but it's better than our Random Forest, so maybe that's the one we'll take.

In [102]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [103]:
print(log_reg.score(X_train, y_train))
print(log_reg.score(X_test, y_test))

0.9764769745770379
0.8667571234735414


What a surprise!  Of the three, logistic regression has the smallest margin between testing and training, and it performs the best as well.  Let's run with that one into production.