<span style="font-family:Trebuchet MS; font-size:2em;">Project 3 | NB1: Data Collection</span>

Riley Robertson | Reddit Classification Project | Market Research: Sports Fans in the U.S. and England

## Imports

**Primary Imports**

In [1]:
import requests
import pandas as pd
import datetime as dt
import time

**Alert Code for main API Request function**

I found a python module online that will notify me via Slack when my functions are complete.

Source: https://github.com/huggingface/knockknock#slack

In [2]:
import knockknock
kk_url = "https://hooks.slack.com/services/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"
kk_channel_name = 'jupyter-notebook'
kk_users = ['rileyrobertsond']

---

## Subreddit info

**NFL** - https://www.reddit.com/r/nfl

**Premier League** - https://www.reddit.com/r/PremierLeague


In [3]:
nfl = 'nfl'
epl = 'PremierLeague'

subs = [nfl, epl]


---

## Single Request Function

In [4]:
# Using Pushshift API (https://pushshift.io/api-parameters)
# returns a list of dictionaries from chosen subreddit
# each dictionary containing 1 reddit submission (post) or comment

def get_request(dict_params, request_type='submission'):
    if request_type == 'submission':
        url = f'https://api.pushshift.io/reddit/{request_type}/search'
        res = requests.get(url, dict_params)  
        return res.json()['data']

    if request_type == 'comment':
        return 'Comment scraping development in progress'  # res.json()['data']

    else:
        return 'Enter valid request_type (submission or comment)'

#### Test

**Submission (post) test**

In [5]:
# parameters setup
params = {'subreddit': epl, 'size': '1', 'is_self': True}

# assignment of function return to variable
get_req = get_request(params)

In [6]:
print(
f'''
subbreddit: {get_req[0]['subreddit']}
     title: {get_req[0]['title'][:50]}...
    author: {get_req[0]['author']}
   created: {get_req[0]['created_utc']}
  comments: {get_req[0]['num_comments']}
       url: {get_req[0]['url']}''')


subbreddit: PremierLeague
     title: When will Premier League teams figure out Thomas T...
    author: Lersbyte
   created: 1620259116
  comments: 0
       url: https://www.reddit.com/r/PremierLeague/comments/n5uajf/when_will_premier_league_teams_figure_out_thomas/


---

## Loop Function

I wanted to make the process of scraping as easy as possible, so I built a function that has evolved over the course of my work for OverArmor.

In [7]:
# Function to make [n] requests of 100 posts. 
# Each request will be for a period of [win] days and the periods will not overlap
# Exports .csv of API responses for each subreddit

@knockknock.slack_sender(webhook_url=kk_url, channel=kk_channel_name, user_mentions=kk_users)
def api_requests(n, win, subs_list, output_folder, size=100, is_self=True, is_video=False):

    # For Loop to scrape multiple subreddits:
    for sub in subs_list:
        
        # List instantiation for error/event logging
        report = []
        
        # Onscreen display during loops:        
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f'API request initiated. Scraping r/{sub} in progress...')
        print(f'   Current time: {cur_datetime}')
        print('')
        
        # Adding updates to event log
        report.append({'event': f'API request from r/{sub} began',
                       'datetime': f'{cur_datetime}',
                       'exception': ''
                       })

        # Setting parameters to use in data requests from the API. Parameters mostly set by function args. 
        params = {
            'subreddit': sub,
            'size': size,
            'is_self': is_self,
            'is_video': is_video,
            'selftext:not': '[removed]'      # thanks to Amanda for posting this in the groupwork channel
        }

        list_data = []
        
        for i in range(1, n+1):               # I used the demo notebook to figure out these time
            params['after'] = f'{i * win}d'   # parameters and loop structure using n and day_window

            # Try/Except to handle any errors in the get_requests
            try:
                new_data = get_request(params)

            # Adding errors to event log if get_request fails
            except Exception as e:                                              
                cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                report.append({'event': f'REQUEST FAILED: {i}',
                               'datetime': f'{cur_datetime}',
                               'exception': f'{e}'
                               })
                
                # Onscreen display - Failure of single request
                print(f'Request failed: {i}')
                print(f'  Current time: {cur_datetime}')
                print('')
                time.sleep(.25)
                continue
                
            # Adding newly pulled data to list for eventual DataFrame conversion and export
            list_data.extend(new_data)
    
            # Adding completion updates to event log
            cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            report.append({'event': f'Request complete: {i}',
                           'datetime': f'{cur_datetime}',
                           'exception': ''
                           })
            
            # Onscreen updates for every 10 requests to provide timing updates
            if i % 5 == 0:
                print(f'Request complete: {i}')
                print(f'    Current time: {cur_datetime}')
                print('')
            df_report = pd.DataFrame(report)
            df_report.to_csv(f'../git_ignore/output/report.csv', index=False)
            time.sleep(.25)

            
        # Setting up list of desired features to keep in primary output. Others will be excluded:
        features = ['subreddit', 'created_utc', 'author', 'num_comments',
                    'score', 'is_self', 'link_flair_text','title', 'selftext', 'full_link']
            
        # creating two DataFrames from list_data after all requests have been made and completed 
        df_output = pd.DataFrame(list_data)
        df_outputfull = pd.DataFrame(list_data)
        
        # filtering the primary DataFrame
        df_output = df_output[features]
        
        # creating simple date and time columns using utc code
        # https://docs.python.org/3/library/datetime.html#date-objects
        # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
        df_output['date'] = df_output['created_utc'].map(lambda x: dt.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
        df_output['time'] = df_output['created_utc'].map(lambda x: dt.datetime.fromtimestamp(x).strftime('%H:%M:%S'))

        # Exporting data with selected/filtered features
        # using if/else statement to accomodate varied input formatting
        df_output.reset_index(inplace=True)
        cur_datetime = dt.datetime.now().strftime('%Y_%m_%d-%H_%M_%S')

        if output_folder[-1] == '/':
            df_output.to_csv(f'{output_folder}{cur_datetime}_data_{sub}.csv', index=False,)
        else:
            df_output.to_csv(f'{output_folder}/{cur_datetime}_data_{sub}.csv', index=False,)     
            
        # Adding last update and then exporting event log as a CSV report
        # using if/else statement to accomodate varied input formatting
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        report.append({'event': f'r/{sub} data files saved to output folder', 'datetime': f'{cur_datetime}'})

        df_report = pd.DataFrame(report)
        cur_datetime = dt.datetime.now().strftime('%Y_%m_%d-%H_%M_%S')
        
        if output_folder[-1] == '/':
            df_report.to_csv(f'{output_folder}{cur_datetime}_report_{sub}.csv', index=False,)
        else:
            df_report.to_csv(f'{output_folder}/{cur_datetime}_report_{sub}.csv', index=False,)

        report = []

        # Onscreen display - completion of each subreddit scrape
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f'r/{sub} data and report files saved to output folder')
        print(f'   Current time: {cur_datetime}')
        print('')

## Test

A quick test with a small number of requests

In [8]:
api_requests(15, 2, ['nfl', 'PremierLeague'], output_folder='../data/0_from_api/')

API request initiated. Scraping r/nfl in progress...
   Current time: 2021-05-05 20:46:03

Request complete: 5
    Current time: 2021-05-05 20:46:10

Request complete: 10
    Current time: 2021-05-05 20:46:16

Request complete: 15
    Current time: 2021-05-05 20:46:21

r/nfl data and report files saved to output folder
   Current time: 2021-05-05 20:46:22

API request initiated. Scraping r/PremierLeague in progress...
   Current time: 2021-05-05 20:46:22

Request complete: 5
    Current time: 2021-05-05 20:46:27

Request complete: 10
    Current time: 2021-05-05 20:46:33

Request complete: 15
    Current time: 2021-05-05 20:46:39

r/PremierLeague data and report files saved to output folder
   Current time: 2021-05-05 20:46:39



# Scraping

## Early Scrapes

Once my function was initially complete, I tested the waters to see what I would get back. I pulled these smaller datasets into the cleaning process and realized fairly quickly that I would need more to ensure that I had a balanced sample from both subreddits. As I dropped rows based on certain criteria, the posts from r/premierleague were dropped at a much higher rate than r/nfl, so I needed to get enough Premier League content that I wouldn't have to fight to recover posts with insufficient or unusable date.

In [9]:
# v1
# api_requests(200, 2, ['nfl', 'PremierLeague'], output_folder='../git_ignore/output')

In [10]:
# v2
# api_requests(400, 2, ['nfl', 'PremierLeague'], output_folder='../git_ignore/output')

In [11]:
# v3
# api_requests(800, 2, ['nfl', 'PremierLeague'], output_folder='../git_ignore/output')

## Final Scrape

The final scrape was enough and after cleaning, I still had roughly 5000 solid posts from the Premier League subreddit which was in the range I wanted for the sake of building my model. going to retrieve 100 posts for each iteration, totaling 100,000 posts

In [12]:
# v4
# api_requests(1000, 2, ['nfl', 'PremierLeague'], output_folder='../git_ignore/output')

Another advantage of pulling more data than necessary, is that I didn't have to come back to pull more data again. In addition, the raw data can continue to serve as a source of information for OverArmor as they continue to work on their marketing strategies.

Once I clean and prepare this for EDA and Modeling, I'll have fulfilled OverArmor's first request. 

---