# Data Collection from Reddit (/r/Jokes)

_This notebook is dedicated to collecting data from the Reddit API. For this specific project, I have decided to compare the two subreddits, /r/Jokes and /r/AntiJokes. The motivation for choosing these two subreddits is twofold: one, to see if I can use machine learning classification models to distinguish if a post if from one of the two subreddits and two, to begin thinking about whether or not humor is something that can be reduced to an algorithm of word counting. I realize that the second goal is extremely lofty and is way beyond the scope of this project, but I hope to gain even a little bit of insight._ 

In [1]:
import requests
import time
import pandas as pd

In [20]:
headers = {'User-agent' : 'Roy Bot 0.6'}

In [None]:
# ['selftext'] returns content of the post
# ['subreddit'] returns the subreddit, which is the target
# ['title'] returns the title
# ['score'] returns upvotes - downvotes
# ['permalink'] returns the permanent link to get comments
# ['author'] returns the name of the author of the post
# ['num_comments'] returns the number of comments

In [4]:
posts = []
after = None

In [21]:
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after' : after}
    url = "https://www.reddit.com/r/jokes.json"
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        data = res.json()
        for j in range(25):
            entry = {}
            entry['selftext'] = data['data']['children'][j]['data']['selftext']
            entry['subreddit'] = data['data']['children'][j]['data']['subreddit']
            entry['title'] = data['data']['children'][j]['data']['title']
            entry['score'] = data['data']['children'][j]['data']['score']
            entry['permalink'] = data['data']['children'][j]['data']['permalink']
            entry['author'] = data['data']['children'][j]['data']['author']
            entry['num_comments'] = data['data']['children'][j]['data']['num_comments']
            posts.append(entry)
            after = data['data']['after']
    else: 
        print(res.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


IndexError: list index out of range

In [22]:
len(posts)

5270

In [23]:
df = pd.DataFrame(posts)

_I decided to collect a little over 5000 posts for this project, for each subreddit. I'm collecting as much as possible because I think there will be overlap jokes, which means I have an even less amount of unique posts._

In [24]:
df.to_csv('./datasets/jokes.csv', index=False)

In [25]:
df.head()

Unnamed: 0,author,num_comments,permalink,score,selftext,subreddit,title
0,love_the_heat,270,/r/Jokes/comments/7c3dev/by_popular_demand_we_...,3412,**Guaranteed reposts.** \n\nhttps://discord.gg...,Jokes,"By popular demand, we now have a discord serve..."
1,Carljohnson09,219,/r/Jokes/comments/a629g5/husband_was_screwing_...,17201,Wife: (sobbing) You can't do this to me!\n\nHu...,Jokes,Husband was screwing his secretary up the ass ...
2,dandan_56,48,/r/Jokes/comments/a63k1m/why_does_batman_wear_...,1419,Batman doesn't want to get shot.\n\nWhy does R...,Jokes,Why does Batman wear Dark clothing?
3,JustKeepScrollingDad,203,/r/Jokes/comments/a605lp/a_man_is_in_court_the...,15281,"""Guilty"", said the man in the dock. At this po...",Jokes,"A man is in court. The Judges says,""on the 3rd..."
4,boced,350,/r/Jokes/comments/a5xqnf/a_poor_old_lady_was_f...,17905,"As she rummaged through her dusty belongings, ...",Jokes,A poor old lady was forced to sell her valuabl...


In [27]:
df.tail()

Unnamed: 0,author,num_comments,permalink,score,selftext,subreddit,title
5265,naqibam,1,/r/Jokes/comments/a5ivg0/there_were_four_peopl...,5,"A few hours into the flight, the pilot comes o...",Jokes,There were four people on an airplane. The pil...
5266,BoisonBerries,7,/r/Jokes/comments/a5fqr2/you_know_what_the_rea...,20,"One sees you later, the other sees you in a wh...",Jokes,You know what the REAL difference between an a...
5267,ThePotatoTheory,1,/r/Jokes/comments/a5gpbw/what_was_the_best_par...,12,The red flags never came as a surprise.,Jokes,What was the best part of dating in Soviet Rus...
5268,daveguitaruno,0,/r/Jokes/comments/a5karl/i_used_to_work_in_a_m...,2,"I asked for a pay rise, but management stuck t...",Jokes,I used to work in a messy glue and munitions f...
5269,Jesse0016,54,/r/Jokes/comments/a4edkb/why_does_killing_peop...,2632,It’s the only time I’m ever wanted,Jokes,Why does killing people in GTA make me happy?


_Using the Reddit API, I was able to secure about 5000+ posts from the /r/Jokes subreddit. This data, exported to a .csv file, will need to be cleaned and preprocessed in a later notebook._