<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 3-4: ACCESSING REDDIT API <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>

<img style="width: 240px; height: 120px; float: right; margin: 0 0 0 0;" src="http://www.merritt.edu/wp/histotech/wp-content/uploads/sites/275/2018/08/berkeley-logo.jpg" />
</div>

# Accessing the Reddit API

The Reddit API allows you to do lots of things, such as automatically post as a user. It also allows you to retrieve data from Reddit, such as subreddit posts and comments. 

There are restrictions in place: Reddit's API only allows you to retrieve 1000 posts (and associated comments) per task. While we can create a script that takes note of the timecodes of posts so as to scrape the entiry of a subreddit in multiple tasks, for now we will just download 1000 posts from our dataset (or fewer, if your subreddit has fewer than 1000 posts).

## Signing up to use the Reddit API

### 1. Sign up

Go to http://www.reddit.com and **sign up**

### 2. Create an app
Go to https://ssl.reddit.com/prefs/apps/ and click on `create app`:

<img style="width: 1000px; height: 361px; float: right; margin: 2 2 2 2;" src="img/reddit-1.png" />

### 3. Note details 
Note the client ID, client secret, and your username/password for Reddit:

<img style="width: 1000px; height: 395px; float: right; margin: 2 2 2 2;" src="img/reddit-2.png" />

## Using the Reddit API

With the details we just created, we can access the Reddit API using PRAW [Python Reddit API Wrapper].
We're downloading 1000 posts and their associated comments.

For the purpose of this exercise, we'll download them in one data file, but it's common practice to download posts and comments in two different relational databases (can you think of why this is?)

First, we enter the user details of the app we just created. Then, we run a function that retrieves the post and its associated metadata, as well as the comments. We save the information in a CSV.

**Note:** you might want to add other metadata elements to your function, or organize it differently (for instance, you could capture the comments separately). For a list of the attibutes you can use, check:

* https://praw.readthedocs.io/en/latest/code_overview/models/submission.html for submissions/posts
* https://praw.readthedocs.io/en/latest/code_overview/models/comment.html for comments

In [None]:
import praw
import csv
from datetime import datetime
import json
import codecs
import os 

maxCount = 1000  # change depending on how much data you need (max is 1000)

def main():
    """This uses PRAW, a Python package for using the Reddit API, in order to create an access 
    token to scrape data from Reddit."""
    # Change the name of this variable to your preferred filename
    fileName = "SUBREDDIT_NAME_HERE" + "_" + str(maxCount) + "_" + datetime.now().strftime('%Y%m%d') + ".csv"
    writer = csv.writer(open(fileName, 'wt', encoding = 'utf-8'))
    writer.writerow(['no.', 'url', 'date', 'author', 'score', 'flair', 'num_comments', 'title', 'body', 'comments'])   
    # Change the name of these variables to those of your Reddit app
    reddit = praw.Reddit(client_id='CLIENT_ID_HERE',
                     client_secret='CLIENT_SECRET_HERE',
                     password='REDDIT_PSW_HERE',
                     user_agent='reddit_posts',
                     username='REDDIT_USERNAME_HERE'
                     )
    print("Retrieving data...", end="", flush=True)
    get_data(reddit, writer)
    print("Done!" + "\n" + "Found " + str(itemCount) + " posts" + "\n" + "Found " + str(commentCount) + " comments")

def get_data(reddit, writer):
    global itemCount 
    itemCount = 0
    global commentCount 
    commentCount = 0
    params = {'sort':'new', 'limit':None, 'syntax':'lucene'}
    # Change the name of this variable to your preferred subreddit name (e.g. "changemymind")
    for submission in reddit.subreddit('SUBREDDIT_NAME_HERE').top(limit=None): 
    # limit=None sets max to 1000. 
    # Instead of .top you can also try .hot, .controversial, or .search('SEARCHTERM', **params)
    # E.g. for submission in reddit.subreddit('amitheasshole').search('flair:"YTA"', **params):
        itemCount += 1
        timestamp = submission.created
        date = datetime.fromtimestamp(timestamp).strftime('%Y' + '-' + '%m' + '-' + '%d')
        title = submission.title
        url = submission.url
        body = submission.selftext
        author = submission.author
        flair = submission.link_flair_text
        score = submission.score
        num_comments = submission.num_comments
        commentList = []
        submission.comments.replace_more(limit=None)
        for comment in submission.comments.list():
            if comment.author != None:
                commentCount += 1
                commentList.append(comment.body)
        comments = ' '.join(commentList)
        writer.writerow( (itemCount, url, date, author, score, flair, num_comments, title, body, comments) )
        print(".", end="", flush=True)
        if itemCount == maxCount:
            break

main()