# Data Collection

### I have used `pushshift` for data collecion as we can get 1000 articles in a single API call and we don't need any credentials to set up. Using this ~4 lakh articles were scraped, from Jan. 10, 2018 to April 10, 2020. It is to be noted that `pushshift` do not offer `upvotes`, and `downvotes` in their API. It only give `score` (which is upvotes-downvotes)

In [1]:
#Imports

import pandas as pd
import requests
import json
import csv
import time
import datetime

### `getPushshiftData` function is used to create the URL and get the 1000 submissions between the entered timestamps

In [2]:
def getPushshiftData(after, before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?size=1000&after=' + \
        str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    print (url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']


### `writeSubData` function is used to write the required details from the scraped submissions to a CSV file. We have scrpaed 16 fields from the articles even though we will be using only 'flair', 'title' and 'selftext' for classification. This is done keeping EDA step in mind. I believe we can get some good insights about data by looking at the other fields. For example we can do virality analysis of the subreddit

In [3]:
def writeSubData(subm):
    #print(subm)
    subData = []  # list to store data points
    title = subm['title']
    url = subm['url']
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN"
    author = subm['author']
    stickied = subm['stickied']
    pinned = subm['pinned']
    over_18 = subm['over_18']
    try:
        selftext = subm['selftext']
    except KeyError:
        selftext = 'Nan'
    
    spoiler = subm['spoiler']
    sub_id = subm['id']
    score = subm['score']
    num_crossposts = subm['num_crossposts']
    is_video = subm['is_video']
    created = datetime.datetime.fromtimestamp(subm['created_utc'])  # 1520561700.0
    numComms = subm['num_comments']
    permalink = subm['permalink']

    #new_line= (sub_id,title,url,author,score,created,numComms,permalink,flair)

    with open('data_final.csv', 'a+', newline='') as csvfile:
        fieldnames = ['sub_id', 'title', 'url', 'author', 'score', 'created', 'numComms', 'permalink', 'flair',
                      'stickied', 'pinned', 'over_18', 'selftext', 'spoiler', 'num_crossposts', 'is_video']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        writer.writerow({'sub_id': sub_id, 'title': title, 'url': url, 'author': author, 'score': score, 'created': created, 'numComms': numComms, 'permalink': permalink, 'flair': flair, 'stickied': stickied,
                         'pinned': pinned,'over_18': over_18, 'selftext': selftext, 'spoiler': spoiler, 'num_crossposts': num_crossposts, 'is_video': is_video})


### This codeblock start scraping and writing to csv file, I will not run this here as I have already collected the data at this point

In [None]:
begin_time = datetime.datetime.now()
sub = 'india'
before = "1586524763"  # April 10, 2020 1:19:23 PM
after = "1515568144"  # Wednesday, January 10, 2018 7:09:04 PM
subCount = 0
subStats = {}

data = getPushshiftData(after, before, sub)

while len(data) > 0:
    for submission in data:
        #print (submission)
        writeSubData(submission)
        subCount += 1

    # print(len(data))
    # print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(after, before, sub)

print('finished scraping, total time taken:')

print(datetime.datetime.now() - begin_time)