# Do it yourself Reddit Scraper

## Author: Matthew Leong

## Set up:

In [1]:
import praw
import pandas as pd

## Background:

Reddit is a social media platform that's essentially a forum. The forum is comprised of many different forums dedicated to specific topics called subreddits. Users who sign up on Reddit typically give an anonymous username meaning that it's not like Facebook nor Twitter and there is a certain element of anonymity towards the posts. There are default subreddits that every account is subscribed to when an account is created but otherwise, a user will have to search out subreddits that cater towards their own tastes. Certain subreddits also make themselves difficult to find in order to maintain a more hardcore community. 

With that in mind, Reddit is ideal towards somewhat isolating certain demographics. We consider the casual demographic to be r/gaming since it is a default subreddit and is mostly humor and images. We consider r/games to be the hardcore more dedicated demographic of gamers since users have to go out of their way to find that subreddit. We also decide to look at the dedicated fanbases for Xbox and PS5 in their own respective subreddits. For Xbox, the xboxone subreddit is more active than the xboxseriesx subreddit so we elect r/xboxone to be representative of the Xbox fanbase. 

Additionally, Reddit differs itself from other forums by having an upvote and downvote system. In theory, upvotes are for things in topic and downvotes are for out of topic but in practice, redditors use it to upvote sentiments they agree with and downvote sentiments that they don't agree with. Posts and comments that are upvoted are seen by more people as those appear first when sorting by hot or best and are often the default way of displaying things. Thus, we decide to use this behavior to our advantage by analyzing comments and submissions that pass a certain number of upvotes in order to capture the most popular sentiments for each demographic.

In [2]:
#Create praw reddit instance
reddit = praw.Reddit(client_id='hae-jv5qr0VXAQ', #ID that's 14 characters long. See reddit user settings app Reddit Comment Scraper
                     client_secret='q2DL5Jo7NrZZxCYkPBFsKoUV9XY', #id that is longer
                     user_agent='http://localhost:8080')

### Misc test code

In [5]:
# Proof of concept: get top 10 hottest posts from games subreddit
hot_posts = reddit.subreddit('games').hot(limit=10)
for post in hot_posts:
    print(post.title)

Weekly /r/Games Discussion - What have you been playing, and what are your thoughts? - October 04, 2020
Daily /r/Games Discussion - Free Talk Friday - October 02, 2020
How esports is quietly spawning a whole new generation of problem gamblers
Cyberpunk 2077 Physical Edition Bonus Content leaks online
Why has Nintendo "gimped" the Switch so hard?
13 Sentinels: Aegis Rim is the best game that no one is talking about
Crash Bandicoot 4: It's About Time is #1 in boxed sales for the UK this week, just 1,000 copies ahead of Star Wars Squadrons and well under N-Sane Trilogy's first week
How Much Does Multiplayer Population Matter? -Raycevick
PlayStation 5: First Impressions from Japanese Outlets || Megathread
E3 2000 - Scott The Woz


In [10]:
#Iteration proof of concept
posts = []
ml_subreddit = reddit.subreddit('games')
for post in ml_subreddit.top(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

                                               title  score      id subreddit  \
0  John @Totalbiscuit Bain July 8, 1984 - May 24,...  43721  8lwynt     Games   
1  Gods Unchained: "@Blizzard_Ent just banned @bl...  27934  dexggm     Games   
2  Belgium says loot boxes are gambling, wants th...  24926  7elwyb     Games   
3  Welcoming the Talented Teams and Beloved Game ...  22094  iwzzh5     Games   
4                      Bungie Splits With Activision  21842  aenhst     Games   
5  Halo Master Chief Collection coming to PC (Steam)  21195  b0d356     Games   
6  /r/Games is closed for April Fool’s. Find out ...  21049  b7ubwm     Games   
7  Titanfall 2 will not have a season pass, all p...  20941  59ntxo     Games   
8  Blizzard Taiwan deleted Hearthstone Grandmaste...  20851  dej74n     Games   
9           Plague Inc. removed from China app store  20323  fadwwa     Games   

                                                 url  num_comments  \
0  https://twitter.com/GennaBain/statu

In [None]:
posts

In [12]:
#Example url 
url = "https://www.reddit.com/r/Games/comments/ipfelj/xbox_series_x_launches_nov_10_for_499_pre_orders/"
submission = reddit.submission(url=url)

In [14]:
#Example iterative loop to get all the comments of a post
submission.comments.replace_more(limit=None)
comments_list = []
for comment in submission.comments.list():
    comments_list.append(comment.body)

In [16]:
comments_list

['PS5 has to be $500 too, right?',
 "As a PC guy with no real horse in this race, that's an insane amount of power for only $499. I'll be interested to see how Sony responds.",
 "Well there it is. It's nice to know the release date. Also kind of feels like MS just said f\\*\\*k it and decided to dump all the info due to the leaks yesterday",
 'Did my coffee not kick in yet or is EA Play now included in GPU?  \n\n>To provide even more value, we are teaming up with Electronic Arts to provide Xbox Game Pass Ultimate and PC members with an EA Play membership at no additional cost starting this holiday. This means Ultimate members can enjoy EA Play on Xbox One, Xbox Series X and Series S, and Windows 10 PCs, and Xbox Game Pass for PC members get EA Play on Windows 10. In addition to the 100+ games in the Xbox Game Pass library today, Ultimate and PC members will be able to play more than 60 of EA’s biggest and best console and PC games like FIFA 20, Titanfall 2 and Need for Speed Heat, as w

### Scraping reddit games

We decide to take the top 100 posts and take out any posts that do not have at least 1000 upvotes. Due to the way the Praw scrapes the data, as long as there is one correct instance of the keyword whether in the post itself or the comments, the post can be obtained by the algorithim. Thus we do not need to control for mispellings or alternate ways of saying the product i.e. PS5 and Playstation 5.

In [70]:
#Create Url list for xbox series x
rgames_list = []
keyword_list = ['Xbox Series X','PS5']
for keywords in keyword_list:
    for submission in reddit.subreddit("games").search(keywords,sort='top', time_filter='all',limit = 100):
        if submission.score >= 1000:
            #permalink gets the reddit thread. url gets the direct link
            #created_tuc stores date in unix time. Look at wikipedia. prob don't need the info though.
            rgames_list.append([submission.permalink,submission.url,submission.title,submission.num_comments,
                                        submission.score,submission.author,submission.selftext,submission.created_utc])

#Store it in a dataframe. Create csv.      
rgames_df = pd.DataFrame(rgames_list,
                              columns=['permalink', 'url', 'title', 'num_comments', 'score', 'author', 'selftext', 'created'])

In [71]:
#There may be some duplicate entries.
rgames_df = rgames_df.drop_duplicates('permalink')
print("Obtained this many posts:", len(rgames_df))

Obtained this many posts: 144


In [72]:
#Store result as a dataframe.
rgames_df.to_csv('rgames_submissions.csv',index = False)

For comments, we elect to take into consideration only comments with at least 50 upvotes

In [73]:
rgames_comments_list = []
for url in rgames_list:
    #need a url concatentation in order to iterate.
    submission = reddit.submission(url = 'https://www.reddit.com/'+url[0])
    
    #Limit 0 removes the more_comments. These comments are typically hidden and have low score values.
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        if comment.score >= 50:
            rgames_comments_list.append([comment.submission,comment.author,comment.body,comment.score,comment.created_utc,url[0]])

In [74]:
len(rgames_comments_list)

6078

In [75]:
#Store it in a dataframe. Create csv.      
rgames_comments_df = pd.DataFrame(rgames_comments_list,
                              columns=['submission', 'author', 'body', 'score', 'created','url_of_post'])
rgames_comments_df.to_csv('rgames_comments.csv',index = False)

In [76]:
rgames_comments_df

Unnamed: 0,submission,author,body,score,created,url_of_post
0,ipfelj,Smallgenie549,"PS5 has to be $500 too, right?",1567,1.599657e+09,/r/Games/comments/ipfelj/xbox_series_x_launche...
1,ipfelj,niallmul97,"As a PC guy with no real horse in this race, t...",5805,1.599657e+09,/r/Games/comments/ipfelj/xbox_series_x_launche...
2,ipfelj,narutomaki,Well there it is. It's nice to know the releas...,822,1.599657e+09,/r/Games/comments/ipfelj/xbox_series_x_launche...
3,ipfelj,LIGHT_COLLUSION,Did my coffee not kick in yet or is EA Play no...,1585,1.599657e+09,/r/Games/comments/ipfelj/xbox_series_x_launche...
4,ipfelj,firesyrup,I'm surprised they didn't wait for Sony to rev...,384,1.599657e+09,/r/Games/comments/ipfelj/xbox_series_x_launche...
...,...,...,...,...,...,...
6073,j4wsgi,ketchup92,"No the PS4 was always loud, especially the fir...",217,1.601812e+09,/r/Games/comments/j4wsgi/playstation_5_first_i...
6074,j4wsgi,Titan7771,Apparently these reviewers can’t even take pic...,60,1.601816e+09,/r/Games/comments/j4wsgi/playstation_5_first_i...
6075,j4wsgi,tymandude1,Apparently the only way you're allowed to impl...,59,1.601816e+09,/r/Games/comments/j4wsgi/playstation_5_first_i...
6076,j4wsgi,varnums1666,"I, for one, loved having a very big map button.",51,1.601835e+09,/r/Games/comments/j4wsgi/playstation_5_first_i...


### Scraping Reddit Gaming

We choose to use the same criteria here. Top 100 posts containing those words with at least 1000 upvotes

In [66]:
#Create Url list for xbox series x
rgaming_list = []
keyword_list = ['Xbox Series X','PS5']
for keywords in keyword_list:
    for submission in reddit.subreddit("gaming").search(keywords,sort='top', time_filter='all',limit = 100):
        if submission.score >= 1000:
            #permalink gets the reddit thread. url gets the direct link
            #created_tuc stores date in unix time. Look at wikipedia. prob don't need the info though.
            rgaming_list.append([submission.permalink,submission.url,submission.title,submission.num_comments,
                                        submission.score,submission.author,submission.selftext,submission.created_utc])

#Store it in a dataframe. Create csv.      
rgaming_df = pd.DataFrame(rgaming_list,
                              columns=['permalink', 'url', 'title', 'num_comments', 'score', 'author', 'selftext', 'created'])

print("Obtained this many posts:", len(rgaming_df))

rgaming_df = rgaming_df.drop_duplicates('permalink')

#Store result as a dataframe.
rgaming_df.to_csv('rgaming_submissions.csv',index = False)

Obtained this many posts: 42


In [78]:
rgaming_comments_list = []
for url in rgaming_list:
    #need a url concatentation in order to iterate.
    submission = reddit.submission(url = 'https://www.reddit.com/'+url[0])
    
    #Limit 0 removes the more_comments. These comments are typically hidden and have low score values.
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        if comment.score >= 50:
            rgaming_comments_list.append([comment.submission,comment.author,comment.body,comment.score,comment.created_utc,url[0]])

print("Obtained this many comments:", len(rgaming_comments_list))
rgaming_comments_df = pd.DataFrame(rgaming_comments_list,
                              columns=['submission', 'author', 'body', 'score', 'created','url_of_post'])
rgaming_comments_df.to_csv('rgaming_comments.csv',index = False)

Obtained this many comments: 1561


### Scraping Reddit xboxone

We choose to scrape reddit xboxone rather than xboxseries x due to difference in users subscribed. In xboxseries x there are about only 165,000 users subscribed while xboxone has over 2.4 million. This is the dedicated xbox fanbase demographic.

In [79]:
#Create Url list for xbox series x
rxboxone_list = []
keyword_list = ['Xbox Series X','PS5']
for keywords in keyword_list:
    for submission in reddit.subreddit("xboxone").search(keywords,sort='top', time_filter='all',limit = 100):
        if submission.score >= 1000:
            #permalink gets the reddit thread. url gets the direct link
            #created_tuc stores date in unix time. Look at wikipedia. prob don't need the info though.
            rxboxone_list.append([submission.permalink,submission.url,submission.title,submission.num_comments,
                                        submission.score,submission.author,submission.selftext,submission.created_utc])

#Store it in a dataframe. Create csv.      
rxboxone_df = pd.DataFrame(rxboxone_list,
                              columns=['permalink', 'url', 'title', 'num_comments', 'score', 'author', 'selftext', 'created'])

rxboxone_df = rxboxone_df.drop_duplicates('permalink')
print("Obtained this many posts:", len(rxboxone_df))
#Store result as a dataframe.
rxboxone_df.to_csv('rxboxone_submissions.csv',index = False)

Obtained this many posts: 93


In [80]:
rxboxone_comments_list = []
for url in rxboxone_list:
    #need a url concatentation in order to iterate.
    submission = reddit.submission(url = 'https://www.reddit.com/'+url[0])
    
    #Limit 0 removes the more_comments. These comments are typically hidden and have low score values.
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        if comment.score >= 50:
            rxboxone_comments_list.append([comment.submission,comment.author,comment.body,comment.score,comment.created_utc,url[0]])

print("Obtained this many comments:", len(rxboxone_comments_list))
rxboxone_comments_df = pd.DataFrame(rxboxone_comments_list,
                              columns=['submission', 'author', 'body', 'score', 'created','url_of_post'])
rxboxone_comments_df.to_csv('rxboxone_comments.csv',index = False)

Obtained this many comments: 1776


## Scraping reddit PS5

The reddit for PS5 on the other hand is much bigger than the xbox series X subreddit and managed to gain traction over the previous PS4 subreddit. This may be a sign that Sony is going to win the console wars this time around.

In [81]:
#Create Url list for xbox series x
rps5_list = []
keyword_list = ['Xbox Series X','PS5']
for keywords in keyword_list:
    for submission in reddit.subreddit("PS5").search(keywords,sort='top', time_filter='all',limit = 100):
        if submission.score >= 1000:
            #permalink gets the reddit thread. url gets the direct link
            #created_tuc stores date in unix time. Look at wikipedia. prob don't need the info though.
            rps5_list.append([submission.permalink,submission.url,submission.title,submission.num_comments,
                                        submission.score,submission.author,submission.selftext,submission.created_utc])

#Store it in a dataframe. Create csv.      
rps5_df = pd.DataFrame(rps5_list,
                              columns=['permalink', 'url', 'title', 'num_comments', 'score', 'author', 'selftext', 'created'])

rps5_df = rps5_df.drop_duplicates('permalink')
print("Obtained this many posts:", len(rps5_df))
#Store result as a dataframe.
rps5_df.to_csv('rps5_submissions.csv',index = False)

Obtained this many posts: 110


In [82]:
rps5_comments_list = []
for url in rps5_list:
    #need a url concatentation in order to iterate.
    submission = reddit.submission(url = 'https://www.reddit.com/'+url[0])
    
    #Limit 0 removes the more_comments. These comments are typically hidden and have low score values.
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        if comment.score >= 50:
            rps5_comments_list.append([comment.submission,comment.author,comment.body,comment.score,comment.created_utc,url[0]])

print("Obtained this many comments:", len(rps5_comments_list))
rps5_comments_df = pd.DataFrame(rps5_comments_list,
                              columns=['submission', 'author', 'body', 'score', 'created','url_of_post'])
rps5_comments_df.to_csv('rps5_comments.csv',index = False)

Obtained this many comments: 4016
