# Data Gathering 

To efficiently gather data from both subreddits we can use the Pushshift Reddit API. 

We are interested in getting a reasonably large and similar number of posts from each subreddit. 
In addition, it would be ideal if the posts from both subreddits were from the same time period. This will reduce the impact of confounding variables such as change in writing conventions over time or differences in topics covered over time. 

If both conditions are satisfied it increases confidence in the ability of any model to generalize to not just our sample but to all posts on the subreddits. 

In [1]:
#imports
import pandas as pd 
import numpy as np 
import requests 
import datetime as dt 
import time

In [2]:
#set options so entire dataframe is viewable
pd.options.display.max_columns = 10000
pd.options.display.max_rows = 10000

## Initial Pull

Lets begin with a single pull request to see the general output format. 

In [3]:
# base url 
base_url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# params to add to url 
params= {
    'subreddit':'TheOnion'
    ,'size':10,
    
    
}

In [5]:
# GET request 
req = requests.get(base_url,params)

In [6]:
# confim request successful 
req.status_code

200

In [7]:
# save json data
data = req.json()

In [8]:
# make dictionary of data fields
posts = data['data']

In [9]:
#confirm size of pull 
len(posts)

10

In [11]:
# convert to data frame 
df = pd.DataFrame(posts)

In [12]:
# take a look at first few entries 
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls
0,[],False,ymcameron,,[],,text,t2_efoqm,False,False,False,[],False,False,1628011451,theonion.com,https://www.reddit.com/r/TheOnion/comments/ox8...,{},ox89dr,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/TheOnion/comments/ox89dr/cult_leader_warns_...,False,link,"{'enabled': False, 'images': [{'id': 'FWxbvn04...",6,1628011462,1,,True,False,False,TheOnion,t5_2qhmj,162868,public,https://b.thumbs.redditmedia.com/7MlLSd1fLOYN9...,78,140,Cult Leader Warns Followers Things Need To Get...,0,[],1.0,https://www.theonion.com/cult-leader-warns-fol...,https://www.theonion.com/cult-leader-warns-fol...,all_ads,6
1,[],False,stabbyGamer,,[],,text,t2_2w0qd062,False,False,False,[],False,False,1628004523,theonion.com,https://www.reddit.com/r/TheOnion/comments/ox5...,{},ox5teh,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/TheOnion/comments/ox5teh/congress_advises_n...,False,link,"{'enabled': False, 'images': [{'id': 'SsPI2jNH...",6,1628004534,1,,True,False,False,TheOnion,t5_2qhmj,162869,public,https://b.thumbs.redditmedia.com/21lSX3gx4m9FH...,78,140,Congress Advises Newly Evicted Americans To Ju...,0,[],1.0,https://www.theonion.com/congress-advises-newl...,https://www.theonion.com/congress-advises-newl...,all_ads,6
2,[],False,mothershipq,,[],,text,t2_4negm,False,False,True,[],False,False,1627959893,theonion.com,https://www.reddit.com/r/TheOnion/comments/owu...,{},owuic3,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/TheOnion/comments/owuic3/woman_can_always_t...,False,link,"{'enabled': False, 'images': [{'id': 'dOw6F_C3...",6,1627959904,1,,True,False,False,TheOnion,t5_2qhmj,162871,public,https://b.thumbs.redditmedia.com/lrPghZtwtJH3M...,78,140,Woman Can Always Tell Period Coming By Way Doo...,0,[],1.0,https://www.theonion.com/woman-can-always-tell...,https://www.theonion.com/woman-can-always-tell...,all_ads,6
3,[],False,prisoner36,,[],,text,t2_bo221ur1,False,False,False,[],False,False,1627932938,theonion.com,https://www.reddit.com/r/TheOnion/comments/owm...,{},owm15x,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/TheOnion/comments/owm15x/kevin_durant_forms...,False,link,"{'enabled': False, 'images': [{'id': 'DxUAoHW7...",6,1627932949,1,,True,False,False,TheOnion,t5_2qhmj,162859,public,https://b.thumbs.redditmedia.com/SIT-Z0gmpj12B...,78,140,Kevin Durant forms burner country to compete i...,0,[],1.0,https://www.theonion.com/kevin-durant-forms-bu...,https://www.theonion.com/kevin-durant-forms-bu...,all_ads,6
4,[],False,LukeVenable,,[],,text,t2_kpqkr,False,False,False,[],False,False,1627928486,theonion.com,https://www.reddit.com/r/TheOnion/comments/owk...,{},owkh9i,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/TheOnion/comments/owkh9i/bp_launches_enviro...,False,link,"{'enabled': False, 'images': [{'id': 'DYKmh1EM...",6,1627928498,1,,True,False,False,TheOnion,t5_2qhmj,162860,public,https://b.thumbs.redditmedia.com/TeHJfS63eor0n...,78,140,BP Launches Environmental Campaign Pledging To...,0,[],1.0,https://www.theonion.com/bp-launches-environme...,https://www.theonion.com/bp-launches-environme...,all_ads,6


In [13]:
# look at all datafields provided from API 
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned',
       'post_hint', 'preview', 'pwls', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail',
   

For the purpose of our analysis we do not need most of the fields provided. The main piece of information we are after is the title of the post. We will also collect some meta data as it might be useful.

The following function was taken from the pushshift demo lesson and modified to better match the needs of this analysis. 

## Function to automate multiple queries  

In [49]:
# function taken from pushshift_demo lesson and modified 
def query_pushshift(subreddit, kind = 'submission', day_window = 30, 
                    n = 5,size=100,start=0):
    '''returns dataframe of submissions matching given criteria
    subreddit: subreddit to query
    kind: comment or submissions to query
    day_window: how many days to increment between successive pull requests
    n: the number of iterations or total number of queries
    size: the number of posts to fetch. API capped at 100 posts per request
    start: the day to start querying from'''
    
    # list of fields to save for each query
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 
                 'num_comments', 'score', 'is_self']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size={size}" # always pulling max of 500
    
    # empty list to store posts from each iteration
    posts = []
    
    # for loop to iteratively query the API based on given parameters
    for i in range(start,start+(day_window*n),day_window): # sets timeframe of each query
        URL = "{}&after={}d".format(stem,i)       # url to pull from 
        print("Querying from: " + URL)
        response = requests.get(URL)           # get request 
        #error handling incase of any errors thrown by API while querying multiple times 
        try:
            assert response.status_code == 200 # make sure pull successful
            mine = response.json()['data']  # get data from pull response 
            df = pd.DataFrame.from_dict(mine)  # convert to dataframe
            posts.append(df)                   # add to list of posts
            time.sleep(2)                       # time between each query 
        except:
            time.sleep(5) # if not successful first time wait 5 seconds and try again 
            assert response.status_code == 200
            mine = response.json()['data']
            df = pd.DataFrame.from_dict(mine)
            posts.append(df)
            time.sleep(2)
    
    # combine all queries from for loop into one dataframe 
    full = pd.concat(posts, sort=False)
    
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)

    # create column for when the the submission was created 
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    # confirm successfully completed 
    return full 

# r/theonion Data 
For r/theonion we will query posts every 30 days going back in time. This is so we have observations from a wider time frame. This will help avoid modeling any localized pattern that might have occured for a short time period. 

In addition, according to a study of Google trends a typical story stayed relevant for a median of 7 days in 2018[source](https://www.niemanlab.org/2019/01/a-typical-big-news-story-in-2018-lasted-about-7-days-until-we-moved-on-to-the-next-crisis/). We do not have reason to believe that this has changed much (or increased at the very least) so we can assume that each query will lead to submissions about different topics. Therefore, by quering every 30 days we will increase the number of unique topics included in our sample which will allow our model to better generalize and not just be good at predicting subreddit for some topics. 

In [21]:
# starting from today query r/theonion 100 times each time getting 100 posts and going back 30 days
# for the next query 
onion_df = query_pushshift('TheOnion',n=100)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=0d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=150d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=210d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheOnion&size=100&after=240d
Querying from: https://api.pushshift.io/reddit/search/submission?subr

In [22]:
# check how many entries we got from onion subreddit 
onion_df.shape

(8486, 9)

In [24]:
# 5 oldest submissions in dataset 
onion_df['timestamp'].sort_values()[:5]

1    2013-10-03
0    2013-10-03
2    2013-10-05
3    2013-10-07
0    2013-11-06
Name: timestamp, dtype: object

In [26]:
# 5 most recent submissions in dataset 
onion_df['timestamp'].sort_values(ascending=False)[:5]

99    2021-08-02
98    2021-08-01
97    2021-07-30
96    2021-07-29
91    2021-07-28
Name: timestamp, dtype: object

- We have 8486 posts from r/theonion from 2013-2021. 

- The subreddit was started in 2008 so our sample contains observations from more then half of that time period. This gives us reason to believe that our sample set should be fairly representative of the general content posted on r/theonion. 

In [98]:
# save submissions from r/theonion to csv file for further analysis later
onion_df.to_csv('./data/the_onion3.csv',index=False)

# r/nottheonion Data

To get a better idea of what to expect from r/nottheonion lets begin with a single query before trying to automate the quering process. 

In [31]:
# one query from 30 days ago with 100 posts 
not_onion = query_pushshift('nottheonion',n=1,start=30)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=100&after=30d
Query Complete!


In [32]:
# confirm size of query 
not_onion.shape

(100, 9)

In [33]:
# check range of dates included in initial pull 
not_onion['timestamp'].value_counts()

2021-07-05    100
Name: timestamp, dtype: int64

- All 100 posts pulled were from the same day. This shows that r/nottheonion gets a lot more posts per day then r/theonion. 

- This is not surprising considering r/nottheonion has over 19 million members while r/theonion has under 200,000 members at the time of this writing. 

- It is worth noting that submissions on r/theonion can only come from one publication while submissions on r/nottheonion may come from any reputable publication which satisfy the criteria of being "Oniony" enough. This is likely also responsible for the difference in frequency of posts on both subreddits. 


In [36]:
# average posts per day in the Onion subreddit
np.mean(onion_df['timestamp'].value_counts()) 

5.6876675603217155

r/theonion gets about 5 posts per day based on our sample. 
In an attempt to have a similar number of observations from both subreddits we can get 10 posts every 3 days. 

Due to consistently getting errors while attempting to query r/notheonion in one attempt the process of gathering submissions from that subreddit was done in pieces. Each subsequent pull request was adjusted to get values starting from the last date we got submissions from. Finally all the seperate queries were then combined to get a final dataframe with all the posts from that subreddit. 

In [50]:
# initial pull 
not_the_onion1= query_pushshift('nottheonion',day_window=3,n=100,start=0,size=10)
not_the_onion1['timestamp'].sort_values()[:-5]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=0d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=3d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=6d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=9d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=12d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=15d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=18d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=21d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=24d
Querying from: https://api.pushshift.io/reddit/search/submi

9    2020-10-11
0    2020-10-11
1    2020-10-11
2    2020-10-11
3    2020-10-11
8    2020-10-11
5    2020-10-11
7    2020-10-11
4    2020-10-11
6    2020-10-11
9    2020-10-14
8    2020-10-14
7    2020-10-14
6    2020-10-14
5    2020-10-14
4    2020-10-14
3    2020-10-14
2    2020-10-14
1    2020-10-14
0    2020-10-14
0    2020-10-17
1    2020-10-17
2    2020-10-17
3    2020-10-17
7    2020-10-17
5    2020-10-17
6    2020-10-17
8    2020-10-17
9    2020-10-17
4    2020-10-17
0    2020-10-20
1    2020-10-20
2    2020-10-20
4    2020-10-20
3    2020-10-20
6    2020-10-20
7    2020-10-20
8    2020-10-20
9    2020-10-20
5    2020-10-20
1    2020-10-23
2    2020-10-23
3    2020-10-23
4    2020-10-23
0    2020-10-23
6    2020-10-23
7    2020-10-23
8    2020-10-23
9    2020-10-23
5    2020-10-23
0    2020-10-26
1    2020-10-26
2    2020-10-26
3    2020-10-26
5    2020-10-26
6    2020-10-26
7    2020-10-26
8    2020-10-26
9    2020-10-26
4    2020-10-26
0    2020-10-29
1    2020-10-29
3    202

In [51]:
# check how many entries pulled from first query 
not_the_onion1.shape

(970, 9)

In [53]:
# second pull request 
not_the_onion2= query_pushshift('nottheonion',day_window=3,n=100,start=300,size=10)
not_the_onion2['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=300d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=303d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=306d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=309d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=312d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=315d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=318d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=321d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=324d
Querying from: https://api.pushshift.io/reddit

4    2020-10-08
3    2020-10-08
2    2020-10-08
1    2020-10-08
0    2020-10-08
Name: timestamp, dtype: object

In [54]:
# check number of posts in second pull request
not_the_onion2.shape

(1000, 9)

In [56]:
# 3rd pull request 
not_the_onion3= query_pushshift('nottheonion',day_window=3,n=100,start=600,size=10)
not_the_onion3['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=600d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=603d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=606d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=609d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=612d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=615d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=618d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=621d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=624d
Querying from: https://api.pushshift.io/reddit

4    2019-12-13
3    2019-12-13
2    2019-12-13
1    2019-12-13
0    2019-12-13
Name: timestamp, dtype: object

In [57]:
# number of posts in 3rd pull request
not_the_onion3.shape

(1000, 9)

In [59]:
# 4th pull request
not_the_onion4= query_pushshift('nottheonion',day_window=3,n=100,start=900,size=10)
not_the_onion4['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=903d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=906d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=909d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=912d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=915d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=918d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=921d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=924d
Querying from: https://api.pushshift.io/reddit

4    2019-02-16
3    2019-02-16
2    2019-02-16
1    2019-02-16
0    2019-02-16
Name: timestamp, dtype: object

In [60]:
# number of posts in 4th pull request 
not_the_onion4.shape

(995, 9)

In [62]:
# 5th pull request
not_the_onion5= query_pushshift('nottheonion',day_window=3,n=100,start=1200,size=10)
not_the_onion5['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1200d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1203d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1206d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1209d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1212d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1215d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1218d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1221d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1224d
Querying from: https://api.pushshift.

4    2018-04-22
3    2018-04-22
2    2018-04-22
1    2018-04-22
0    2018-04-22
Name: timestamp, dtype: object

In [63]:
# number of posts in 5th pull request
not_the_onion5.shape

(1000, 9)

In [69]:
# 6th pull requst
not_the_onion6= query_pushshift('nottheonion',day_window=3,n=100,start=1500,size=10)
not_the_onion6['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1500d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1503d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1506d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1509d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1512d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1515d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1518d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1521d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1524d
Querying from: https://api.pushshift.

4    2017-06-26
3    2017-06-26
2    2017-06-26
1    2017-06-26
0    2017-06-26
Name: timestamp, dtype: object

In [70]:
# number of posts from 6th pull requst 
not_the_onion6.shape

(1000, 9)

In [72]:
# 7th pull request 
not_the_onion7= query_pushshift('nottheonion',day_window=3,n=100,start=1800,size=10)
not_the_onion7['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1800d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1803d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1806d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1809d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1812d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1815d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1818d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1821d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=1824d
Querying from: https://api.pushshift.

4    2016-08-30
3    2016-08-30
2    2016-08-30
1    2016-08-30
0    2016-08-30
Name: timestamp, dtype: object

In [73]:
# number of posts from 7th pull requst 
not_the_onion7.shape

(1000, 9)

In [75]:
# 8th pull request
not_the_onion8= query_pushshift('nottheonion',day_window=3,n=100,start=2100,size=10)
not_the_onion8['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2100d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2103d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2106d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2109d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2112d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2115d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2118d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2121d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2124d
Querying from: https://api.pushshift.

4    2015-11-04
3    2015-11-04
2    2015-11-04
1    2015-11-04
0    2015-11-04
Name: timestamp, dtype: object

In [76]:
# number of posts in 8th pull request 
not_the_onion8.shape

(1000, 9)

In [79]:
# 9th pull request 
not_the_onion9= query_pushshift('nottheonion',day_window=3,n=100,start=2400,size=10)
not_the_onion9['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2400d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2403d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2406d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2409d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2412d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2415d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2418d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2421d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2424d
Querying from: https://api.pushshift.

4    2015-01-08
3    2015-01-08
2    2015-01-08
1    2015-01-08
0    2015-01-08
Name: timestamp, dtype: object

In [80]:
# number of posts in 9th pull request
not_the_onion9.shape

(1000, 9)

In [82]:
# 10th pull request
not_the_onion10= query_pushshift('nottheonion',day_window=3,n=100,start=2700,size=10)
not_the_onion10['timestamp'].sort_values()[-5:]

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2700d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2703d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2706d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2709d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2712d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2715d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2718d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2721d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=nottheonion&size=10&after=2724d
Querying from: https://api.pushshift.

4    2014-03-14
3    2014-03-14
2    2014-03-14
1    2014-03-14
0    2014-03-14
Name: timestamp, dtype: object

In [83]:
# number of posts from 10th pull request
not_the_onion10.shape

(1000, 9)

In [89]:
# total number of posts from r/nottheonion
len(not_the_onion1)+len(not_the_onion2) + len(not_the_onion3) +len(not_the_onion4) + len(not_the_onion5)+len(not_the_onion6) + len(not_the_onion7)+ len(not_the_onion8) + len(not_the_onion9) + len(not_the_onion10)

9965

In [93]:
# date of earliest post in our sample of submissions from r/theonion
print(onion_df['timestamp'].sort_values()[:1])
print(len(onion_df))

1    2013-10-03
Name: timestamp, dtype: object
8486


- We have slightly more posts from r/nottheonion but this is representative of the fact that r/nottheonion gets a lot more submissions then r/theonion overall. Therefore this is not cause for concern. 
    
- The earliest post gathered from r/nottheonion is about 5 months after the earliest post obtained from r/theonion
    - This is not a perfect overlap but faily close considering the entire range of dates is about 8 years. 

- Thus we have achieved a balance of getting posts from the same time periods and getting the same number of posts from both subreddits. 

In [94]:
# combine pulls from r/nottheonion to make one dataframe
nto_df= pd.concat([not_the_onion1,not_the_onion2,not_the_onion3,not_the_onion4,not_the_onion5,
                   not_the_onion6,not_the_onion7,not_the_onion8,not_the_onion9,not_the_onion10],sort=False)

In [95]:
# check number of posts from r/nottheonion
nto_df.shape

(9965, 9)

In [97]:
# save gathered data as csv 
nto_df.to_csv('./data/not_the_onion.csv',index=False)

# Sampling Considerations 

- The sample size for both subreddits consists of posts from a majority of the time the subreddits have existed. 

- The observations from both subreddits are from the same time period for the most part. 

- The data was gathered by starting at the currect date and going back a set amount of time each time before collecting more data. 
    - Apart from picking the timeframe to get submissions from we did not have control over picking specific types of posts. 
    - The start date for collecting the date was more or less arbitrary (not planned to get specific results). 


- Therefore we have reason to believe that the submissions we gathered could be a random subset of all the posts from each subreddit. Thus we can confidently draw conclusions based on our sample.  
    