# Data Collection Part I: Reddit Posts

This project is an analysis of user comments of individual posts from social media platforms. Specifically, I set out to collect data from posts that had gone viral. 

There is no formal definition of virality, and every platform defines the popularity of their content in different ways. Reddit utilizes two metrics: ups vs. downs and score. Ups and downs are akin to likes and dislikes: users click on arrow icons within the post to weigh in. The score is simply the result of subtracting the downs from the ups. 

Reddit posts are searchable based on post type, i.e. Hot, Rising, New, Controversial and Top, and timeframe, i.e. Past 24 hours, Week, Month, Year, and All-Time. The search criteria for this API call was "top posts of all time." 

#### Data Collection Methodology

In order to gather each post's data and convert it into a usable format, I used the Reddit API wrapper PRAW and requested data in the form of a JSON string. Reddit stores comment data as a separate object from the post to which it belongs. I chained two requests to the API together in order to collect the right comment thread for each post.

The comment thread is a complicated structure in and of itself, chiefly because each comment could potentially have one or more nested replies, and each of those replies could contain nested replies of their own. In order to keep the data output manageable, I limited the number of top-level comments to 50, and did not request replies. While I would have preferred to collect a limited number of replies for each comment, PRAW does not allow for specifying a maximum number of replies. This would have been problematic from a logistic perspective, as some of the top-level comments had hundreds of replies. 

The structure of the JSON output is similar to that of the original content: a dictionary of data about the post, with the comments nested as a sub-level value to a "comments" key. I wanted to limit the amount of metadata coming in for each object, and try to tackle some of the data cleaning ahead of time. The output was written to a JSON file, and subsequently read in again at the bottom of this notebook, just to make sure the output was readable.

In [43]:
## Packages and libraries:

import config as co
import os
import codecs
import praw
import json
import time
import pandas as pd
import numpy as np
import itertools as it

In [128]:
## API Call function:

reddit = praw.Reddit(client_id=co.reddit_client_id,
                     client_secret=co.reddit_client_secret,
                     user_agent='bot_v.02')

subreddit = reddit.subreddit('all')

top_posts = subreddit.top(time_filter='all', limit=300)

list_of_items = []
fields = ('id', 'title', 'ups', 'score', 'created_utc', 'num_comments')
comment_fields = ('body')

for submission in top_posts:
    if not submission.stickied:
        to_dict = vars(submission)
        sub_dict = {field:to_dict[field] for field in fields}

        submission.comments.replace_more(limit=0)
        
        sub_dict['comments'] = []
        
        for comment in submission.comments.list()[:50]:
            temp_comm_dict = {}
            to_comm_dict = vars(comment)
            temp_comm_dict = {comment_fields:to_comm_dict[comment_fields]}
            
            sub_dict['comments'].append(temp_comm_dict)

        list_of_items.append(sub_dict)

## Write the output to a JSON file in the local directory:

with open('./Data/REDDITdata_020919.json', 'a') as f:
        f.write(json.dumps([list_of_items]))
            

In [129]:
## Read in and check the JSON output:

reddit_data = []
notParsed = []
reddit_file = open('./Data/REDDITdata_020919.json',"r")
for line in reddit_file:    
#     if line.strip(): 
    try:
        post=json.loads(line)
        reddit_data.append(post)
    except:
        notParsed.append(line)
        continue
print(len(reddit_data))
print('Could not parse: ', len(notParsed))

1
Could not parse:  0


In [123]:
df = pd.io.json.json_normalize(reddit_data[0][0])
pd.set_option('display.max_colwidth', -1)
df.comments.head()

0    [{'body': 'Remember when the highest upvoted post you saw in a week had 5000 points?  

EDIT: For those that are just getting to this post and are confused, when I posted this comment the OP was over 21,000 points. Yes, I know it currently says 11,000 total votes. As many people replied to me, reddit's algorithms fudge the votes in interesting ways to try to keep the front page changing.  

EDIT 2: Yes ladies and gents, I know where it's at now. Insane. ', 0: {'body': 'This post has almost 200K upvotes, wtf?'}}, {'body': 'Can't wait to upvote this 17 different times later this week.'}, {'body': 'Nice watermark PP. Clever.'}, {'body': 'The mirrored text brought this to a different level. '}, {'body': '[deleted]'}, {'body': 'Thank you /u/iH8myPP for wasting your weekend so we may smile on a Monday.'}, {'body': 'https://gfycat.com/RecentIdleAbalone'}, {'body': 'I didn't realize how much I missed Groot until now.'}, {'body': 'Source:  Guardians of the Galaxy 2 (Trailer)'}, {'body': 'G