# **Reddit API Web Scraper**
***
## Introduction
Using the reddit API, you can extract not only subreddit threads but also the comments for each threads. This notebook will grab every comment for each thread for a specified number of pages. 

I'll create a dictionary that contains thread titles as keys, and the comments as a sub-dictionary of values. I'll store each of the comment trees in separate lists so that I can maintain some structure within each thread. Essentially, each sub-list will contain the parent comment and its replies, e.g:

{'thread': {'comments': [[parent, reply, reply,...] , [parent,reply,...] , ...]}}


## Steps
The basic steps for this are:

1. Grab the first subreddit page as a .json file (as simple as requesting www.reddit.com/r/subreddit/.json).  
&nbsp;
2. Take this first page, and extract thread metadata (title, number of comments, etc.) and store in the master dictionary. This master dictionary is also where the comment text will be stored. Repeat this for as many pages as desired (up to 100 max I believe).  
&nbsp;
3. After all thread metadata has been collected, start iterating through the threads. For each thread:
    1. Grab all of the comment data that a user would see initially (let's call these main comments).
    2. Take a parent comment and start a sub-list in the thread's comment key in the master dictionary.
    3. Iterate through each child in the comment tree and append it into the sub-list.
    4. If a child is hidden under "load more comments", call the API again to expand the comments.
    5. Repeat steps B-D until all of the main comments have been grabbed.
    6. Now, there are parent comments hidden under "load more comments". Call the API to pull a portion of these comments, and apply the same steps that you did to the main comments.
    7. Repeat this until all comments have been retrieved.  
&nbsp;
4. Save the master dictionary of threads and comments to a .json file.

## API Notes
Detailed information on the Reddit API functions can be found [here](https://www.reddit.com/dev/api/). You can find details on the API rules [here](https://github.com/reddit/reddit/wiki/API).
***

# Code
***
Below is a walk through on the code I used to perform my comment scraping.

## Necessary Packages

In [18]:
import time
import requests
import json
from tqdm import tqdm_notebook

# I created a local file with my OAuth2 credentials, so that I can share this without giving up the information.
from reddit_oauth import username, password, client_id, client_secret

## Functions
These are the functions I created to carry about the scraping. Most of them are recursive in nature, since the API doesn't really let you perform everything in one attempt. The recursion will allow you to keep chipping away at the remaining comments until each thread is fully scraped.

In [19]:
def get_parents(parents):
    """Iterate through each comment tree parent, and create a list of all comments within the tree"""
    

    for each in tqdm_notebook(parents,desc='Parents',leave=False):
        
        # If the parent is 'more' type, it will be appended into a separate list to be expanded later
        if check_more(each): 
            continue
            
        # Start a new list for each new parent thread, and place it in the master dictionary
        parent = [each['data']['body'].lower()]
        v['comments'].append(parent)
        get_children(parent,each)

def get_children(parent,comment):
    """Given a parent tree, retrieve all of the child comments (if there are any)"""

    
    if comment['data']['replies'] != '':
        children = comment['data']['replies']['data']['children']
        for child in children:
            
            # If the child comment is hidden under 'load more comments' expand it out here
            if check_more(child):
                
                # This is a special case when a tree gets particularly deep (around 10+ replies) and the comment 
                # is listed under "Continue this thread" to the viewer. This comment data is stored differently
                # so it is easiest to simply skip it.
                if child['data']['count']==0:
                    continue
                get_more_children(parent,child)
                continue
            else:
                # If there aren't any 'more' comments, recurse through it normally.
                parent.append(child['data']['body'].lower())
                get_children(parent,child)
        
def get_more_parents(more_parents):
    """After creating the list of all 'more' parent comments, this function will call the API to access the comments within each of these new trees"""
    
    for parent_id in tqdm_notebook(more_parents,desc='More Parents',leave=False):
        
        # I need to make a separate API request for each individual parent
        time.sleep(1.001)
        parent_comment = requests.get(url,headers=headers,params={'comment':parent_id}) 
        parent_comments = parent_comment.json()[1]['data']['children']
        get_parents(parent_comments)
              

def get_more_children(parent,comment):
    """Given a parent tree that contains 'more' child, expand out the 'more' children and append to the tree"""

    more_comment_ids = comment['data']['children']
    
    # I need to make a separate API call in order to expand the 'more' comments
    time.sleep(1.001)
    more_comments_json = requests.get('https://oauth.reddit.com/api/morechildren',headers=headers,
                        params={'children':more_comment_ids,'link_id':link_id})
    more_comments = more_comments_json.json()['jquery'][14][3][0]
    
    for more in more_comments:
        
        # If the tree has a lot of replies, sometimes the first 'more' API request doesn't grab every comment
        # and we need to call the function recursively until we retrieve them all.
        if check_more(more):
            if more['data']['count']==0:
                continue
            get_more_children(parent,more)
            continue
            
        parent.append(more['data']['body'].lower())
        get_children(parent,more)
            
def check_more(comment):
    """Take an input comment and return True if the comment type is 'more'"""

    if comment['kind']=='more':
        
        # Comments with 'parent_id'=='t3' are parent comments, not children.
        # These need to be sent into the ['more'] dictionary to be expanded later.
        if comment['data']['parent_id'][0:2]=="t3":
            v['more'].extend(tuple(comment['data']['children']))
        return True

## Reddit API Credentials
In order to use the Reddit API, you need to follow their [Oauth2 Verification Procedure](https://github.com/reddit/reddit/wiki/OAuth2).

In [20]:
# Credentials to generate my token. 
client_auth = requests.auth.HTTPBasicAuth(client_id, client_secret)
post_data = {"grant_type": "password", "username": username, "password": password,"redirect_uri":'http://localhost'}

# I need to modify my user agent in order to make the intial .json pull using BeautifulSoup
user_agent = {"User-Agent": "Comment Scraper app by /u/" + username}

# Generate the API access token
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=user_agent)

# This is the header I'll use to access the API
headers = {"Authorization": response.json()['token_type'] + " " + response.json()['access_token'],
           "User-Agent": "Comment Scraper app by /u/" + username}

## Scraping
Here is where most of the work actually happens. I create my master dictionary with thread metadata, and then use my scraper functions to populate the master dictionary with the text data.

In [22]:
# Here you can enter whichever subreddit you'd like, and however many pages you wish to scrape.
subreddit = 'summonerschool'
num_pages = 100

master = {}

# The after variable is what we'll use to step through the pages. Reddit doesn't use page numbers in their .json
# files, so the 'after' id will allow the API to know which pages we have already scraped.
after = None

# Create the master dictionary and store the metadata.
for pages in range(num_pages):
    
    # This is the webpage I want to scrape. For this intial scrape, I want to make sure
    # I am updating my user-agent because I'm not accessing the reddit API just yet.
    url = 'http://www.reddit.com/r/' + subreddit + '/.json'
    main_json = requests.get(url,headers=user_agent,params={'after':after})
    
    # Now I update my after value so I can get a new set of threads for the next iteration
    after = main_json.json()['data']['after']

    for each in main_json.json()['data']['children']:
        master.setdefault(each['data']['title'],{'id':each['data']['id'],'json':[],'comments':[],'more':[],'expected_comments':each['data']['num_comments'],'actual_comments':0})   

# Iterate through each thread in the master dictionary, and scrape the comments
for k,v in tqdm_notebook(master.items(),desc='Thread',leave=False):
    
    
    url = 'https://oauth.reddit.com/r/' + subreddit + '/comments/' + v['id']
    time.sleep(1.001)
    v['json'] = requests.get(url,headers=headers)
    
    # This is a quality check. If the .json is scraped successfully its status code is 200, so if any of the
    # .json files are not 200, raise an error and stop the loop.
    if v['json'].status_code!=200:
        print("JSON Retrieval Error")
        raise KeyboardInterrupt
    
    # The link_id is needed when calling the get/api/morechildren. 
    link_id = v['json'].json()[0]['data']['children'][0]['data']['name']
    
    # This is where the recursion begins and the comments are scraped.
    parents = v['json'].json()[1]['data']['children']
    get_parents(parents)
    get_more_parents(v['more'])
    
    # Store number of comments retrieved in each thread to compare to expected number.
    for comments in v['comments']:
        v['actual_comments'] += len(comments)



***

# Data Verification
***
I created just a few attributes to check to examine the quality of the scraping.

In [23]:
print("There are {} total threads".format(len(master.keys())))

There are 954 total threads



Check how many comments I have total:

In [24]:
i=0
for k,v in master.items():
    i += len(v)
print("There are {} total comments".format(i))

There are 5724 total comments


Lastly, I can compare the number of actual comments scraped with the expected number of comments.

In [25]:
{k+" | Actual: "+str(v['actual_comments'])+" Expected: "+str(v['expected_comments']) for k,v in sorted(master.items())[0:20]}

{'"Gank Timers" &amp; Champion pool help | Actual: 14 Expected: 14',
 '"I\'m Just a bit Rusty" Losing Streak | Actual: 3 Expected: 3',
 '"Janna is bad unless you are a Pro" | Actual: 141 Expected: 142',
 '"Reliable" champs like Wu, suggestions? | Actual: 10 Expected: 10',
 '(Ezreal) Roadblocks on the Search for an S | Actual: 18 Expected: 18',
 '**How to Shotcall?** | Actual: 12 Expected: 12',
 '10 Rune Pages - Requesting input on pages. | Actual: 5 Expected: 5',
 '100k Subs Celebration: INFOGRAPHIC DESIGN CONTEST!! | Actual: 5 Expected: 17',
 '11 Thresh Questions | Actual: 17 Expected: 17',
 "17% win rate with Jhin. I love the champion but I'm not so sure if I should be playing him anymore. | Actual: 19 Expected: 19",
 '30+ year old players out there? Share your game xp! | Actual: 21 Expected: 21',
 '3v3 tactic? | Actual: 1 Expected: 1',
 '5 Tips To Get More Jungle Ganks | Actual: 50 Expected: 51',
 '5 man bottom | Actual: 10 Expected: 10',
 '50 Shades of Bronze. | Actual: 166 Expecte

The counts are mostly right, some are a bit off. Couple of reasons:

1. Sometimes more comments are posted after I pull the intial count.
2. If a thread is especially long, the comments are stored in a "continue thread" link, which I'm not particularly interested in.
***

# Save file
***
Finally, I'll save the text data just so I don't have to run this again unless I want more data. I'm going to dump it into a json with just the thread name and comments

In [None]:
save = {}
for k,v in master.items():
    save.setdefault(k,v['comments'])

json.dump(save,open('thread_comments_100pg.txt','w'))