# Scraping Reddit API

A python wrapper library exists so that was used.  Documentation can be found [here](https://praw.readthedocs.io/en/latest/)

In [119]:
import praw
import time 
import calendar
import yaml
from pathlib import Path
from decimal import Decimal
import boto3
home = str(Path.home())

Reading the login info from a file on my computer so as to avoid pushing my keys up to github.  While this api is free you want to make sure you do not make the mistake when that is not the case so get into the habit of reading in things like this early.  There are many ways to do it.  Examples of why this is a good habit [here](https://medium.com/@nagguru/exposing-your-aws-access-keys-on-github-can-be-extremely-costly-a-personal-experience-960be7aad039)

In [124]:
conf = yaml.safe_load(open(home+'/keys/api_keys.yml'))
clientid = conf['reddit']['client_id']
clientsecret = conf['reddit']['client_secret']

In [125]:
reddit = praw.Reddit(client_id=clientid,
                     client_secret=clientsecret,
                     user_agent='DamitDan01')

There are many ways to make use of the API in this I just want to see what a record looks like and how it is setup

In [None]:
for submission in reddit.subreddit('AskReddit').new(limit=5):
    print(submission.title)

So a generator object is created from the `reddit.subreddit('AskReddit').new(limit=5)` command with a Submission object being returned.  I just want one submission to work with to make sure it is what I expect

In [128]:
submission = list(reddit.subreddit('AskReddit').hot(limit=5))[3]

In [127]:
type(submission)

praw.models.reddit.submission.Submission

In [135]:
d = submission.__dict__
for k in sorted(d):
    print('{}: {},'.format(k, d[k]))

_comments_by_id: {},
_fetched: False,
_reddit: <praw.reddit.Reddit object at 0x1a22b29e10>,
all_awardings: [],
allow_live_comments: True,
approved_at_utc: None,
approved_by: None,
archived: False,
author: kitterbug,
author_flair_background_color: None,
author_flair_css_class: None,
author_flair_richtext: [],
author_flair_template_id: None,
author_flair_text: None,
author_flair_text_color: None,
author_flair_type: text,
author_fullname: t2_b2nhy,
author_patreon_flair: False,
awarders: [],
banned_at_utc: None,
banned_by: None,
can_gild: False,
can_mod_post: False,
category: None,
clicked: False,
comment_limit: 2048,
comment_sort: best,
content_categories: None,
contest_mode: False,
created: 1571432989.0,
created_utc: 1571404189.0,
discussion_type: None,
distinguished: None,
domain: self.AskReddit,
downs: 0,
edited: False,
gilded: 0,
gildings: {},
hidden: False,
hide_score: False,
id: djnaa7,
is_crosspostable: False,
is_meta: False,
is_original_content: False,
is_reddit_media_domain: Fals

There are way more fields then I will need.  It is best to take as much information as possable when scraping however DynamoDB is not as easy to put things into as mongo so I am going to just take a subsample of the data of what I think will be useful.

The info I will need as I am doing NLP will be the title, the text, if it was given gold, and the time of posting. I will also get the unique identifyers for the posts and the comments so I can go back and get more information if needed.

After reading the documents about comments it looks like the comments attribut is actually a generator object that we can set some boundrys on.  So to get the top comments of a post I will use the code below

In [136]:
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
    print(comment.body)

my lack of a sore throat and ability to swallow without pain.
There isn't a piece of popcorn stuck deep inbetween my teeth.
no cut in my mouth. 

sometimes I get a cut in my mouth and it will cause pain that basically takes over my entire consciousness and last for over a week. Usually starts as accidentally biting my cheek while eating something, then it gets a little swollen, causing me to accidentally bite it more, then it wont heal and it makes it so I cant eat or talk or sleep. Have had this problem since I was a little kid. Miserable.
That I can still get erections. I dread the day that guy won't salute anymore.
Wait, you guys are able to breathe trough your notrils? 

This comment was created by the allergicgang.
*cries in deviated septum*
That I don't currently have the extreme urge to go to the bathroom.
That until I read your fucking post, I was breathing automatically without thinking about it.
I don’t have a toothache.
How clean my new glasses are. . . but *not* the fact th

In [137]:
com = list(submission.comments.list())[0]

In [138]:
d = com.__dict__
for k in sorted(d):
    print('{}: {},'.format(k, d[k]))

_fetched: True,
_reddit: <praw.reddit.Reddit object at 0x1a22b29e10>,
_replies: <praw.models.comment_forest.CommentForest object at 0x1a22c34150>,
_submission: djnaa7,
all_awardings: [],
approved_at_utc: None,
approved_by: None,
archived: False,
associated_award: None,
author: bellaChil,
author_flair_background_color: None,
author_flair_css_class: None,
author_flair_richtext: [],
author_flair_template_id: None,
author_flair_text: None,
author_flair_text_color: None,
author_flair_type: text,
author_fullname: t2_1rfgnjc6,
author_patreon_flair: False,
awarders: [],
banned_at_utc: None,
banned_by: None,
body: my lack of a sore throat and ability to swallow without pain.,
body_html: <div class="md"><p>my lack of a sore throat and ability to swallow without pain.</p>
</div>,
can_gild: True,
can_mod_post: False,
collapsed: False,
collapsed_reason: None,
controversiality: 0,
created: 1571433411.0,
created_utc: 1571404611.0,
depth: 0,
distinguished: None,
downs: 0,
edited: False,
gilded: 0,
gil

I only want to get posts that are at least 4 hours old so I will have to do a check that the post is that old

In [None]:
hours_ago = 4
if (calendar.timegm(time.gmtime()) - (hours_ago * 60*60)) > submission.created_utc:
    pass

In [None]:
Here is the inital setup for scraping the api.

In [None]:
def scrape_sub(sub, reddit_connection):
    
    post_limit = 50

    for submission in reddit_connection.subreddit(sub).hot(limit=post_limit):
        
        try:
            post_id = submission.name
        except:
            post_id = '*unknown'
        post = {
            'post_id':submission.name,
            'votes_up':submission.ups,
            'votes_down':submission.downs,
            'votes_ratio':Decimal(str(submission.upvote_ratio)),
            'gilded':submission.gilded,
            'score':Decimal(str(submission.score)),
            'post_date':Decimal(str(submission.created)),
            'post_user':submission.author_fullname,
            'awards':submission.all_awardings,   
            'body':submission.selftext,
            'title':submission.title,
            'url':submission.url,
            'subreddit':sub
        }

        yield post
        

Notes:
 * post_id and post_user are uniqe identifiers setup by reddit
 * DynamoDB does not handle floats without them first being cast as Decimal which can then lead to rounding issues so sometimes it is necessary to cast them as a string first.  Next time I will most likely just treat everything as a string or int.
 * A user or post can be deleted so this field will not exist breaking the code. In some instances the user name is gone but the post body exists, other times not.  I will filter bad data later.
 * Guilded should have the data to see if it got gold


This worked but did not get to the comments.  The below code did this for me.

In [None]:
def get_comments(submission):
    
    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list(): 
        
        try:
            full_name_c = submission.author_fullname
        except: 
            full_name_c = '*unknown'

        com = {
            'comment_id': comment.name,
            'post_id': submission.name,
            'comment_parent': comment.parent_id,
            'votes_up':comment.ups,
            'votes_down':comment.downs,
            'score':Decimal(str(comment.score)),
            'post_date':Decimal(str(comment.created)),
            'post_user':full_name_c,
            'awards':comment.all_awardings,   
            'body':comment.body,
            'comment_depth':comment.depth,
            'subreddit':sub
        }

    # Store in db 

            
    

def scrape_sub(sub):
    
    post_limit = 50

    for submission in reddit_connection.subreddit(sub).hot(limit=post_limit):
        
        try:
            post_user = submission.name
        except:
            post_user = '*unknown'
        post = {
            'post_id':submission.name,
            'votes_up':submission.ups,
            'votes_down':submission.downs,
            'votes_ratio':Decimal(str(submission.upvote_ratio)),
            'gilded':submission.gilded,
            'score':Decimal(str(submission.score)),
            'post_date':Decimal(str(submission.created)),
            'post_user':post_user,
            'awards':submission.all_awardings,   
            'body':submission.selftext,
            'title':submission.title,
            'url':submission.url,
            'subreddit':sub
        }

        # Store record into DynamoDB
        
        get_comments(submissioin)
        
        time.sleep(1)

        
        

In [111]:
dynamodb = boto3.resource('dynamodb')
post_table = dynamodb.Table('posts')
comment_table = dynamodb.Table('comments')

def get_comments(submission, reddit_connection, comment_table):
    
    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list(): 
        
        try:
            full_name_c = submission.author_fullname
        except: 
            full_name_c = '*unknown'

        com = {
            'comment_id': comment.name,
            'post_id': submission.name,
            'comment_parent': comment.parent_id,
            'votes_up':comment.ups,
            'votes_down':comment.downs,
            'score':Decimal(str(comment.score)),
            'post_date':Decimal(str(comment.created)),
            'post_user':full_name_c,
            'awards':comment.all_awardings,   
            'body':comment.body,
            'comment_depth':comment.depth,
            'subreddit':sub
        }

        comment_table.put_item(Item=com)

            
def scrape_sub(sub, reddit_connection, post_table, comment_table):
    
    post_limit = 50

    for submission in reddit_connection.subreddit(sub).hot(limit=post_limit):
        
        try:
            post_user = submission.name
        except:
            post_user = '*unknown'
        post = {
            'post_id':submission.name,
            'votes_up':submission.ups,
            'votes_down':submission.downs,
            'votes_ratio':Decimal(str(submission.upvote_ratio)),
            'gilded':submission.gilded,
            'score':Decimal(str(submission.score)),
            'post_date':Decimal(str(submission.created)),
            'post_user':post_user,
            'awards':submission.all_awardings,   
            'body':submission.selftext,
            'title':submission.title,
            'url':submission.url,
            'subreddit':sub
        }

        post_table.put_item(Item=post)
        
        get_comments(submissioin)
        
        # There is a 60 request a minut limit on reddit want to stay under that 
        time.sleep(1)

        
        

It still needs some try except to continue running if issue is found also I do not want to rescrape data so I need to check if the record exists before scraping the comments and trying to add it to the DB. This may not be the best method but it works. 

In [None]:
if 'Item' in post_table.get_item(Key={'post_id': submission.name}).keys():
            continue

In [None]:
df.