# Project 3: Web APIs & Classification

## Problem Statement

Using Reddit's API, we have collected posts from two subreddits:
* Technology
* Today I Learned (TIL)

Then we will use NLP to train a classifier on which subreddit a given post came from (A binary classification problem).

In [29]:
import requests
import pandas as pd
import time
import random

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

In [30]:
#The subreddits URLs
url = [
    {'Title': 'todayilearned', 'url' = 'https://www.reddit.com/r/todayilearned.json'},
    {'Title': 'technology', 'url' = 'https://www.reddit.com/r/technology.json'},
]


# Changing the default user to ensure access to Reddit's 
res = requests.get(url1, headers={'User-agent': 'Pony Inc 1.0'})


Reddit knows that you are using a Chrome browser on a Mac is trying to access the address https://www.reddit.com/r/boardgames.json However, Python has its own default user agent. Since there are so many scripts out there that are already 'hitting' reddit's API, reddit is basically shutting down all Python scripts from accessing its API.

We will change our request a little bit to make it not use the default user agent. 

In [31]:
reddit_dict = res.json()

# The dictionary structure of reddit_dict

reddit_dict = {
    kind:,
    data: {
        modhash:, 
        dist:, 
        children:{ # has 26 elements
            approved_at_utc:, 
            subreddit:,  #The cell directly above gives you the class label, aka your target.
            selftext:,   #Mapping to the first post
            author_fullname:, 
            saved:, 
            mod_reason_title:, 
            gilded:, 
            clicked:, 
            title:,  # The title of the post.
            link_flair_richtext:, 
            subreddit_name_prefixed:, 
            hidden:, 
            pwls:, 
            link_flair_css_class:, 
            downs:, 
            thumbnail_height:, 
            hide_score:, 
            name:, 
            quarantine:, 
            link_flair_text_color:, 
            author_flair_background_color:, 
            subreddit_type:, 
            ups:, 
            total_awards_received:, 
            media_embed:, 
            thumbnail_width:, 
            author_flair_template_id:, 
            is_original_content:, 
            user_reports:, 
            secure_media:, 
            is_reddit_media_domain:, 
            is_meta:, 
            category:, 
            secure_media_embed:, 
            link_flair_text:, 
            can_mod_post:, 
            score:, 
            approved_by:, 
            thumbnail:, 
            edited:, 
            author_flair_css_class:, 
            author_flair_richtext:, 
            gildings:, 
            post_hint:, 
            content_categories:, 
            is_self:, 
            mod_note:, 
            created:, 
            link_flair_type:, 
            wls:, 
            banned_by:, 
            author_flair_type:, 
            domain:, 
            selftext_html:, 
            likes:, 
            suggested_sort:, 
            banned_at_utc:, 
            view_count:, 
            archived:, 
            no_follow:, 
            is_crosspostable:, 
            pinned:, 
            over_18:, 
            preview:, 
            all_awardings:, 
            media_only:, 
            can_gild:, 
            spoiler:, 
            locked:, 
            author_flair_text:, 
            visited:, 
            num_reports:, 
            distinguished:, 
            subreddit_id:, 
            mod_reason_by:, 
            removal_reason:, 
            link_flair_background_color:, 
            id:, 
            is_robot_indexable:, 
            report_reasons:, 
            author:, 
            num_crossposts:, 
            num_comments:, 
            send_replies:, 
            whitelist_status:, 
            contest_mode:, 
            mod_reports:, 
            author_patreon_flair:, 
            author_flair_text_color:, 
            permalink:, 
            parent_whitelist_status:, 
            stickied:, 
            url:, 
            subreddit_subscribers:, 
            created_utc:, 
            media:, 
            is_video:            
        },
        after:, 
        before:
    }
}
        

In [32]:
reddit_dict['data'].keys() #The most important keys are children and after.
reddit_dict['data']['children'][0]['data']['selftext'] #That's mapping to the first post.

''

In [33]:
posts = [p['data'] for p in reddit_dict['data']['children']] #We want to get all these posts into a Pandas DataFrame and thereafter we can save it to a CSV.

pd.DataFrame(posts)

pd.DataFrame(posts).to_csv('posts.csv')

reddit_dict['data']['after'] #This is the name of the last post.

't3_by2qyg'

In [34]:
reddit_dict['data']['before']

This is the name of the last post.

In [19]:
pd.DataFrame(posts)['name']

0     t3_by138z
1     t3_by2u5q
2     t3_by3iw1
3     t3_bxzaff
4     t3_by25jg
5     t3_bxxyhl
6     t3_by2vxh
7     t3_bxxc4t
8     t3_bxvy3m
9     t3_by0o2l
10    t3_by36j9
11    t3_by2znc
12    t3_by2wbz
13    t3_by2s0f
14    t3_by3msj
15    t3_bxxflh
16    t3_bxzog4
17    t3_by4pqu
18    t3_bxyic5
19    t3_by36n3
20    t3_by4sog
21    t3_bxyd6v
22    t3_by5529
23    t3_by2qyg
24    t3_by26eg
Name: name, dtype: object

In [20]:
reddit_dict['data']['after']

't3_by26eg'

This is the new URL that gives you the next 25 posts.

In [42]:
url + '?after=' + reddit_dict['data']['after'] #This is the new URL that gives you the next 25 posts.

'https://www.reddit.com/r/boardgames.json?after=t3_bxkt6m'

## Looping through the posts, 25 posts at a time

In [43]:
posts = []
after = None

for a in range(20):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(str(a))
    time.sleep(sleep_duration)

https://www.reddit.com/r/boardgames.json
4
https://www.reddit.com/r/boardgames.json?after=t3_bxkt6m
5
https://www.reddit.com/r/boardgames.json?after=t3_bxbxr2
3
https://www.reddit.com/r/boardgames.json?after=t3_bx4i0s
4


In [40]:
posts = []
after = None

for a in range(35):
    if after == None:
        current_url = url1
    else:
        current_url = url1 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv('boardgames.csv')
        current_df = pd.DataFrame()
        
    else:
        pd.DataFrame(posts).to_csv('boardgames.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(str(a))
    time.sleep(sleep_duration)

https://www.reddit.com/r/todayilearned.json
0
https://www.reddit.com/r/todayilearned.json?after=t3_by2qyg
1
https://www.reddit.com/r/todayilearned.json?after=t3_bxye8y
2
https://www.reddit.com/r/todayilearned.json?after=t3_by5zzt
3
https://www.reddit.com/r/todayilearned.json?after=t3_by22xl
4
https://www.reddit.com/r/todayilearned.json?after=t3_bxx7m4
5
https://www.reddit.com/r/todayilearned.json?after=t3_by10l4
6
https://www.reddit.com/r/todayilearned.json?after=t3_bxgrjm
7
https://www.reddit.com/r/todayilearned.json?after=t3_bxhw1v
8
https://www.reddit.com/r/todayilearned.json?after=t3_bxhh0h
9
https://www.reddit.com/r/todayilearned.json?after=t3_bxgkeb
10
https://www.reddit.com/r/todayilearned.json?after=t3_bxft79
11
https://www.reddit.com/r/todayilearned.json?after=t3_bxata2
12
https://www.reddit.com/r/todayilearned.json?after=t3_bxb6jr
13
https://www.reddit.com/r/todayilearned.json?after=t3_bx3in4
14
https://www.reddit.com/r/todayilearned.json?after=t3_bx8tki
15
https://www.reddit

In [41]:
len(posts)

865

In [37]:
pd.DataFrame(posts).to_csv('boardgames.csv', index = False)