# Gather Data
Subreddits: Oceans | Diving  
Using Reddit's pushshift API

**Purpose**  
Here I gather 3 sets of data per subreddit. Each dataset gathered is larger than the previous one, such that I can quantify the effect of increasing the size of the training data.

**Important**  
The data is unbalanced in favor of the oceans subreddit. This will be mitigated by using a stratified split when training the model. I have decided to group posts by date instead of forcing equal numbers of posts so that the model will not predict based on time references that are in exclusively one subreddit.

In [41]:
import requests
import datetime
from datetime import timezone
import pandas as pd
import os.path

### Functions

In [42]:
def gather_posts(subreddit, date):
    '''Accumulate all posts since given date.'''
    posts = []
    oldest_time = utc_timestamp

    while oldest_time > date:

        params = {'subreddit': subreddit, 'size': 100, 'before': oldest_time}
        try:
            some_posts = requests.get(url, params=params).json()['data']
            oldest_time = some_posts[-1]['created_utc']
            posts.append(some_posts)
        except: 
            oldest_time = some_posts[-1]['created_utc'] - 1209600 # Increase oldest_time by 2 weeks
    
    return posts

In [43]:
def build_df(gather_posts_results):
    '''Combines posts into a single list and creates a DataFrame'''
    together = [post for group in gather_posts_results for post in group]
    df = pd.DataFrame(together).loc[:, ['subreddit', 'selftext', 'title']]
    return df

In [44]:
# Credit for file check method here https://linuxize.com/post/python-check-if-file-exists/

def save_if_no_file_exists(df, filepath):
    if os.path.isfile(filepath):
        print(f'Data already gathered at {filepath}')
    else:
        df.to_csv(filepath, index=False)
        print(f'Data saved to {filepath}')

In [45]:
def file_exists(filepath):
    if os.path.isfile(filepath):
        return True
    return False

In [46]:
def gather_and_save_data(subreddit, oldest_date, filepath):
    '''Gather all posts since data and save to a file if that file does not exist already.'''
    if file_exists(filepath):
        print(f'Data already gathered at {filepath}')
        return None
    posts = gather_posts(subreddit, oldest_date)
    posts_df = build_df(posts)
    save_if_no_file_exists(posts_df, filepath)

### Set constants

In [47]:
# API endpoint
url = 'https://api.pushshift.io/reddit/search/submission/'

In [48]:
# UTC time stamp
jan_1_2020 = 1577840470
jan_1_2019 = 1546300800
jan_1_2018 = 1514764800

# Current time stamp
dt = datetime.datetime.now()
utc_time = dt.replace(tzinfo = timezone.utc) 
utc_timestamp = int(utc_time.timestamp())

Credit: https://www.geeksforgeeks.org/get-utc-timestamp-in-python/

**Ocean subreddit**

In [49]:
gather_and_save_data('oceans', jan_1_2020, '../data/raw/oceans.csv')
gather_and_save_data('oceans', jan_1_2019, '../data/raw/oceans-medium.csv')
gather_and_save_data('oceans', jan_1_2018, '../data/raw/oceans-large.csv')

Data already gathered at ../data/raw/oceans.csv
Data already gathered at ../data/raw/oceans-medium.csv
Data already gathered at ../data/raw/oceans-large.csv


**Diving subreddit**

In [51]:
gather_and_save_data('diving', jan_1_2020, '../data/raw/diving.csv')
gather_and_save_data('diving', jan_1_2019, '../data/raw/diving-medium.csv')
gather_and_save_data('diving', jan_1_2018, '../data/raw/diving-large.csv')

Data already gathered at ../data/raw/diving.csv
Data already gathered at ../data/raw/diving-medium.csv
Data already gathered at ../data/raw/diving-large.csv
