**Problem Statement**

As an employee of Niantic, the company behind the Pokemon Go mobile game, I'm interested in determing if there is a difference in post topics/contents between two Pokemon Go related subreddits (r/pokemongo and r/theSilphRoad). The r/pokemongo subreddit is supposedly tailored towards general topics of conversation while the r/theSilphRoad subreddit is supposedly more for research into the mechanics behind the game.

To determine if there is a difference, I will be using an number of classification models to predict whether a post belongs to r/theSilphRoad or not. I will be evaluating the models based on accuracy score and the coefficients of the model with the highest accuracy will be analyzed to determine what words/phrases are predictive of the specific subreddit.

Knowing this information is helpful since both of these subreddits contain end-user feedback regarding our game. Specifically, we would like to have our employees montoring these subreddits and if would be helpful to know if we should have employees monitoring both subreddits if the topics are roughly the same or if it would be best to assign a more technical employee to theSilphRoad subreddit.

**Imports**

In [2]:
import requests
import time
import pandas as pd

**Reddit Scrapping 07/08/19**

In [3]:
headers = {'User-agent':'StevePokeGet'}

In [16]:
## code to scrape theSilphRoad subreddit
## run at 6:08 PM on 7/8/19
posts_silph = []
after = ""
for _ in range(40):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/TheSilphRoad.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_silph.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_silph)}")
    print(f"Current after: {after}")
    time.sleep(1)

27 posts scraped for a total of 27
Current after: t3_cakjf8
25 posts scraped for a total of 52
Current after: t3_cat2ow
25 posts scraped for a total of 77
Current after: t3_ca9q5m
25 posts scraped for a total of 102
Current after: t3_ca1hvd
25 posts scraped for a total of 127
Current after: t3_c9j036
25 posts scraped for a total of 152
Current after: t3_ca4yl1
25 posts scraped for a total of 177
Current after: t3_ca1tcj
25 posts scraped for a total of 202
Current after: t3_c9x724
25 posts scraped for a total of 227
Current after: t3_ca0c56
25 posts scraped for a total of 252
Current after: t3_c9w8lj
25 posts scraped for a total of 277
Current after: t3_c946ke
25 posts scraped for a total of 302
Current after: t3_c9fh1z
25 posts scraped for a total of 327
Current after: t3_c92mzr
25 posts scraped for a total of 352
Current after: t3_c9ick5
25 posts scraped for a total of 377
Current after: t3_c9bk1y
25 posts scraped for a total of 402
Current after: t3_c97wyq
25 posts scraped for a tota

In [17]:
dfsilph = pd.DataFrame(posts_silph)
dfsilph.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,"[{'is_enabled': True, 'count': 2, 'subreddit_i...",True,,,True,dronpes,,,silph-executive,"[{'e': 'text', 't': 'Executive'}]",...,,Welcome to the Silph Road! Here's what you nee...,2,4091,https://www.reddit.com/r/TheSilphRoad/comments...,[],,False,all_ads,6
1,[],False,,,False,dronpes,,,silph-executive,"[{'e': 'text', 't': 'Executive'}]",...,140.0,Headed to GO Fest? Be sure to come say hello! ...,0,336,https://i.redd.it/dzpsgjv793731.jpg,[],,False,all_ads,6
2,[],True,,,False,ThatGuyTre,,,,[],...,140.0,Apparently my hotel is a lake,0,2568,https://i.redd.it/i0z2b5df72931.jpg,[],,False,all_ads,6
3,[],True,,,False,Huaojozu,,,western-europe-european-union,[],...,,"With Team GO Rocket coming, it would be really...",0,382,https://www.reddit.com/r/TheSilphRoad/comments...,[],,False,all_ads,6
4,[],False,,,False,JaceMasood,,,usa-midwest-wheat,"[{'e': 'text', 't': 'JACEMAKINGS '}, {'a': ':u...",...,140.0,TOP 20 COUNTERS - Armored Mewtwo [Now with ful...,0,144,https://i.redd.it/52jf0mh9v5931.png,[],,False,all_ads,6


In [34]:
#Used this code to get a list of columns and identify which ones to identify duplicates by
#commenting it out for readability
#list(dfsilph.columns)

In [21]:
#I'm going to use permalink column, I think id works too but I'm not sure if the posts on this subreddit
#recycle the ids, permalink should always work though
dfsilph['permalink'].head()

0    /r/TheSilphRoad/comments/4tll96/welcome_to_the...
1    /r/TheSilphRoad/comments/c6jiw5/headed_to_go_f...
2    /r/TheSilphRoad/comments/cajhui/apparently_my_...
3    /r/TheSilphRoad/comments/caqi78/with_team_go_r...
4    /r/TheSilphRoad/comments/casg1z/top_20_counter...
Name: permalink, dtype: object

In [22]:
#confirm dataframe size before
dfsilph.shape

(989, 104)

In [25]:
#remove duplicate rows
dfsilph.drop_duplicates(subset='permalink',inplace=True)

In [26]:
#return dataframe size after
dfsilph.shape

(837, 104)

In [35]:
#output scrape to csv
dfsilph.to_csv('Silph_scrape_0708.csv',index=False)

In [27]:
## code to scrape the pokemongo subreddit
## run at 6:15 PM on 7/8/19
posts_pogo = []
after = ""
for _ in range(25):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/pokemongo.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_pogo.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_pogo)}")
    print(f"Current after: {after}")
    time.sleep(1)

27 posts scraped for a total of 27
Current after: t3_carkog
25 posts scraped for a total of 52
Current after: t3_caa2xy
25 posts scraped for a total of 77
Current after: t3_ca9j79
25 posts scraped for a total of 102
Current after: t3_cajig3
25 posts scraped for a total of 127
Current after: t3_cagkk4
25 posts scraped for a total of 152
Current after: t3_cabgf4
25 posts scraped for a total of 177
Current after: t3_c9pcrx
25 posts scraped for a total of 202
Current after: t3_c9wal8
25 posts scraped for a total of 227
Current after: t3_c9wnyz
25 posts scraped for a total of 252
Current after: t3_ca1qig
25 posts scraped for a total of 277
Current after: t3_c9n2v9
25 posts scraped for a total of 302
Current after: t3_c9wpt0
25 posts scraped for a total of 327
Current after: t3_c9a3fo
25 posts scraped for a total of 352
Current after: t3_c9ag2q
25 posts scraped for a total of 377
Current after: t3_c9npbo
25 posts scraped for a total of 402
Current after: t3_c93n2n
17 posts scraped for a tota

In [28]:
dfpogo = pd.DataFrame(posts_pogo)
dfpogo.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,...,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,AutoModerator,,automod,[],,...,,"Weekly questions, bugs, and gameplay megathrea...",0,15,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
1,[],False,,,False,AutoModerator,,automod,[],,...,,DEAR NIANTIC - ideas and suggestions for the devs,0,3,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
2,[],True,,,False,haveyouseenthisclown,,valor,[],e35d36dc-483a-11e6-9cf1-0e4e72062cff,...,,"SCORE! Thanks, Mom!",0,418,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
3,[],True,,,False,Rykun3159,,,[],,...,140.0,Pokemon Go Team Rocket event details leaked by...,0,78,https://piunikaweb.com/2019/07/09/pokemon-go-t...,[],,False,all_ads,6
4,[],True,,,False,Razorfunk,,,[],,...,,Make LUCKY TRADES available from afar like Bat...,0,433,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6


Will be using the same column as the previous dataframe when determining what rows are duplicate

In [29]:
dfpogo.shape

(621, 103)

In [30]:
dfpogo.drop_duplicates(subset='permalink',inplace=True)

In [31]:
dfpogo.shape

(419, 103)

In [36]:
dfpogo.to_csv('Pogo_scrape_0708.csv',index=False)

**Reddit scrapping addendum, 07/09/19**

In [3]:
#code run @ 6:44 PM on 7/9/19
posts_silph_add = []
after = ""
for _ in range(10):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/TheSilphRoad.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_silph_add.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_silph_add)}")
    print(f"Current after: {after}")
    time.sleep(1)

26 posts scraped for a total of 26
Current after: t3_cb84s8
25 posts scraped for a total of 51
Current after: t3_cb84gd
25 posts scraped for a total of 76
Current after: t3_caswef
25 posts scraped for a total of 101
Current after: t3_cakjf8
25 posts scraped for a total of 126
Current after: t3_cahifj
25 posts scraped for a total of 151
Current after: t3_caafco
25 posts scraped for a total of 176
Current after: t3_cako46
25 posts scraped for a total of 201
Current after: t3_c9tija
25 posts scraped for a total of 226
Current after: t3_ca966s
25 posts scraped for a total of 251
Current after: t3_ca8ewl


In [4]:
#code run @ 6:45 PM on 7/9/19
posts_pogo_add = []
after = ""
for _ in range(10):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/pokemongo.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_pogo_add.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_pogo_add)}")
    print(f"Current after: {after}")
    time.sleep(1)

27 posts scraped for a total of 27
Current after: t3_cb92js
25 posts scraped for a total of 52
Current after: t3_cakw4j
25 posts scraped for a total of 77
Current after: t3_cb2awy
25 posts scraped for a total of 102
Current after: t3_cb11au
25 posts scraped for a total of 127
Current after: t3_caufo1
25 posts scraped for a total of 152
Current after: t3_cavjo5
25 posts scraped for a total of 177
Current after: t3_caitgj
25 posts scraped for a total of 202
Current after: t3_caltuj
25 posts scraped for a total of 227
Current after: t3_ca745c
25 posts scraped for a total of 252
Current after: t3_ca4xgq


In [5]:
#import previous day's scrape
dfpogo = pd.read_csv('Pogo_scrape_0708.csv')

In [6]:
#import previous day's scrape
dfsilph = pd.read_csv('Silph_scrape_0708.csv')

In [10]:
#convert current day's scrape to a dataframe
dfsilphadd = pd.DataFrame(posts_silph_add)

In [11]:
#convert current day's scrape to a dataframe
dfpogoadd = pd.DataFrame(posts_pogo_add)

In [12]:
dfpogo.shape

(419, 103)

In [13]:
dfpogoadd.shape

(252, 103)

In [14]:
#combine dataframes
dfpogo = pd.concat([dfpogo,dfpogoadd],sort='False',ignore_index=True)

In [16]:
dfpogo.shape

(671, 103)

In [17]:
#remove duplicates
dfpogo.drop_duplicates(subset='permalink',inplace=True)

In [18]:
dfpogo.shape

(543, 103)

In [19]:
dfpogo.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,...,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,AutoModerator,,automod,[],,...,,"Weekly questions, bugs, and gameplay megathrea...",0,15,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
1,[],False,,,False,AutoModerator,,automod,[],,...,,DEAR NIANTIC - ideas and suggestions for the devs,0,3,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
2,[],True,,,False,haveyouseenthisclown,,valor,[],e35d36dc-483a-11e6-9cf1-0e4e72062cff,...,,"SCORE! Thanks, Mom!",0,418,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6
3,[],True,,,False,Rykun3159,,,[],,...,140.0,Pokemon Go Team Rocket event details leaked by...,0,78,https://piunikaweb.com/2019/07/09/pokemon-go-t...,[],,False,all_ads,6
4,[],True,,,False,Razorfunk,,,[],,...,,Make LUCKY TRADES available from afar like Bat...,0,433,https://www.reddit.com/r/pokemongo/comments/ca...,[],,False,all_ads,6


In [20]:
dfsilph.shape

(837, 104)

In [21]:
dfsilphadd.shape

(251, 104)

In [22]:
#combine dataframes
dfsilph = pd.concat([dfsilph,dfsilphadd],sort='False',ignore_index=True)

In [23]:
dfsilph.shape

(1088, 104)

In [24]:
#remove duplicates
dfsilph.drop_duplicates(subset='permalink',inplace=True)

In [25]:
dfsilph.shape

(935, 104)

In [26]:
#save off to CSV
dfpogo.to_csv('Pogo_scrape_0708_0709.csv',index=False)

In [27]:
#save off to CSV
dfsilph.to_csv('Silph_scrape_0708_0709.csv',index=False)

**Reddit scrapping addendum, 07/10/19**

In [4]:
#code run @ 4:12 PM on 7/10/19
posts_silph_add = []
after = ""
for _ in range(10):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/TheSilphRoad.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_silph_add.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_silph_add)}")
    print(f"Current after: {after}")
    time.sleep(1)

26 posts scraped for a total of 26
Current after: t3_cbcpv4
25 posts scraped for a total of 51
Current after: t3_cb19nh
25 posts scraped for a total of 76
Current after: t3_cblhdg
25 posts scraped for a total of 101
Current after: t3_cb9emb
25 posts scraped for a total of 126
Current after: t3_cbbtam
25 posts scraped for a total of 151
Current after: t3_cb76fe
25 posts scraped for a total of 176
Current after: t3_casejj
25 posts scraped for a total of 201
Current after: t3_cb8jf3
25 posts scraped for a total of 226
Current after: t3_cax9os
25 posts scraped for a total of 251
Current after: t3_caxr69


In [5]:
#code run @ 4:14 PM on 7/10/19
posts_pogo_add = []
after = ""
for _ in range(10):
    if after == "":
        params = []
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/r/pokemongo.json'
    res = requests.get(url,params=params,headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        for post in the_json['data']['children']:
            posts_pogo_add.append(post['data'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
        
    print(f"{len(the_json['data']['children'])} posts scraped for a total of {len(posts_pogo_add)}")
    print(f"Current after: {after}")
    time.sleep(1)

27 posts scraped for a total of 27
Current after: t3_cbdxsq
25 posts scraped for a total of 52
Current after: t3_cb24jc
25 posts scraped for a total of 77
Current after: t3_cb7utm
25 posts scraped for a total of 102
Current after: t3_cbgjf2
25 posts scraped for a total of 127
Current after: t3_cbbq61
25 posts scraped for a total of 152
Current after: t3_cb5uaw
25 posts scraped for a total of 177
Current after: t3_cb5x7v
25 posts scraped for a total of 202
Current after: t3_cb6kbj
25 posts scraped for a total of 227
Current after: t3_cackjw
25 posts scraped for a total of 252
Current after: t3_caugei


In [6]:
dfpogo = pd.read_csv('Pogo_scrape_0708_0709.csv')

In [7]:
dfsilph = pd.read_csv('Silph_scrape_0708_0709.csv')

In [8]:
dfsilphadd = pd.DataFrame(posts_silph_add)

In [9]:
dfpogoadd = pd.DataFrame(posts_pogo_add)

In [10]:
dfpogo.shape

(543, 103)

In [11]:
dfpogoadd.shape

(252, 103)

In [12]:
dfpogo = pd.concat([dfpogo,dfpogoadd],sort='False',ignore_index=True)

In [13]:
dfpogo.shape

(795, 103)

In [14]:
dfpogo.drop_duplicates(subset='permalink',inplace=True)

In [15]:
dfpogo.shape

(655, 103)

In [16]:
dfsilph.shape

(935, 104)

In [17]:
dfsilphadd.shape

(251, 103)

In [18]:
dfsilph = pd.concat([dfsilph,dfsilphadd],sort='False',ignore_index=True)

In [19]:
dfsilph.shape

(1186, 104)

In [20]:
dfsilph.drop_duplicates(subset='permalink',inplace=True)

In [21]:
dfsilph.shape

(1048, 104)

In [22]:
dfpogo.to_csv('Pogo_scrape_fin.csv',index=False)

In [23]:
dfsilph.to_csv('Silph_scrape_fin.csv',index=False)