### Objective: Build a scrapper that scrapes at least a 1000 posts each from 2 different subreddits. This data will be the basis of our classification model that will predict which subreddit an unseen post belongs to

In [21]:
import pandas as pd
import numpy as np
import requests


Scraping directly from reddit's API has been unsuccessful as there is a cap of 1000 unique posts returned. The scrapper only returned about 800 unique posts from each subreddit despite getting 2,500 total posts. An alternative was to use the PushShift API. PushShift is website that maintains and tracks social media data, they maintain a database for all reddit posts and do not have a total cap on retrievable posts. A link to it's API documentation can be found here:
https://pushshift.io/api-parameters/

In [22]:
p = {"sort_type":['score','score','created_utc'],
     "score":[">100",">100",">100"],
     "sort":["desc","asc","desc"],
    }

def reddit_scrape(subreddit,params=p,before=1559347200,after=1514764800):
    '''
    This function scrapes reddit using the pushshift database to extract posts which follow criteria given by params
    dictionary.

    Before default date - 1 Jun 2019
    After default date - 1 Jan 2018

    We will use this date range to extract 1,500 posts from each subreddit for our training dataset. We will test on
    posts from 1 Jun 2019 onwards. There may be duplicates in this requested 1500 posts, the function will drop duplicates
    before returning

    Params argument must be a dict. Keys must be pushshift api parameters and values must be a list, each of equal length

    This function will return a dataframe
    '''
    dic = {'title':[],
           'subreddit':[],
            'score':[]}
    p = params
    assert len(p[list(p)[0]]) == len(p[list(p)[1]])
    assert len(p[list(p)[1]]) == len(p[list(p)[2]])
    its = len(p[list(p)[2]])
    url = "https://api.pushshift.io/reddit/search/submission/?after={}&before={}&limit=500".format(after,before)
    url +="&subreddit={}".format(subreddit)
    for i in range(its):
        temp_url = url + "&" + list(p)[0] + "=" + p[list(p)[0]][i]
        temp_url = temp_url +  "&" + list(p)[1] + "=" + p[list(p)[1]][i]
        temp_url = temp_url +  "&" + list(p)[2] + "=" + p[list(p)[2]][i]
        h = requests.get(temp_url,headers={'User-agent': "Agent1"}).json()
        for d in h['data']:
            dic['title'].append(d['title'])
            dic['subreddit'].append(d['subreddit'])
            dic['score'].append(d['score'])
    df = pd.DataFrame(dic)
    df=df.drop_duplicates('title')
    return df


In [23]:
# wn is Worldnews subreddit, til is TodayILearned subreddit
wn_train = reddit_scrape("worldnews",p)
til_train = reddit_scrape("todayilearned",p)
wn_test = reddit_scrape("worldnews",p,after =1559347200,before=1575361406)
til_test = reddit_scrape("todayilearned",p,after =1559347200,before=1575361406)

In [24]:
train = wn_train.append(til_train)
train['subreddit'] = train['subreddit'].map({'worldnews':1,'todayilearned':0})

In [25]:
test = wn_test.append(til_test)
test['subreddit'] = test['subreddit'].map({'worldnews':1,'todayilearned':0})

In [26]:
train.head()

Unnamed: 0,title,subreddit,score
0,"Two weeks before his inauguration, Donald J. T...",1,188216
1,Mozilla launches 'Facebook Container' extensio...,1,138669
2,Italy bans unvaccinated children from school,1,123971
3,Bill and Melinda Gates sue company that was gr...,1,123027
4,'We Don't Know a Planet Like This': CO2 Levels...,1,121007


Statistics of our train & test datasets

In [27]:
print("Length of training data: {}".format(len(train)))
print("Length of test data: {}".format(len(test)))
print("No. of Worldnews subreddit in training data: {}".format(len(train.loc[train['subreddit']==1])))
print("No. of todayilearned subreddit in training data: {}".format(len(train.loc[train['subreddit']==0])))
print("No. of Worldnews subreddit in test data: {}".format(len(test.loc[test['subreddit']==1])))
print("No. of todayilearned subreddit in test data: {}".format(len(test.loc[test['subreddit']==0])))

Length of training data: 2954
Length of test data: 2873
No. of Worldnews subreddit in training data: 1470
No. of todayilearned subreddit in training data: 1484
No. of Worldnews subreddit in test data: 1425
No. of todayilearned subreddit in test data: 1448


In [28]:
train.drop('score',axis=1,inplace=True)
test.drop('score',axis=1,inplace=True)

In [29]:
train.to_csv("train.csv",index=False)
test.to_csv("test.csv",index=False)

In [30]:
til_train.head()

Unnamed: 0,title,subreddit,score
0,TIL that in 1916 there was a proposed Amendmen...,todayilearned,148135
1,"TIL After Col. Shaw died in battle, Confederat...",todayilearned,137547
2,TIL of Dr. Donald Hopkins. He helped eradicate...,todayilearned,134330
3,TIL A Japanese company has awarded its non-smo...,todayilearned,131958
4,TIL Madonna leaked a fake version of her album...,todayilearned,124754
