<span style="font-family:Trebuchet MS; font-size:2em;">Project 3 | NB1: Data Collection</span>

Riley Robertson | Reddit Classification Project | 

## Imports

Primary Imports

In [4]:
import requests
import pandas as pd
import datetime as dt
import time

Alerts:

In [166]:
import knockknock
kk_url = "https://hooks.slack.com/services/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"
kk_channel_name = 'jupyter-notebook'
kk_users = ['@rileyrobertsond']

---

## Subreddit info

**NFL** - https://www.reddit.com/r/nfl

**Premier League** - https://www.reddit.com/r/PremierLeague


In [9]:
nfl = 'nfl'
epl = 'PremierLeague'

subs = [nfl, epl]


---

## Single Request Function

In [217]:
# Using Pushshift API (https://pushshift.io/api-parameters)
# returns a list of dictionaries from chosen subreddit
# each dictionary containing 1 reddit submission (post) or comment

def get_request(dict_params, request_type='submission'):
    if request_type == 'submission':
        url = f'https://api.pushshift.io/reddit/{request_type}/search'
        res = requests.get(url, dict_params)  
        return res.json()['data']

    if request_type == 'comment':
        
        return res.json()['data']

    else:
        return 'Enter valid request_type (submission or comment)'

#### Test

**Submission (post) test**

In [218]:
# parameters setup
params = {'subreddit': epl, 'size': '1', 'is_self': True}

# assignment of function return to variable
get_req = get_request(params, 'submission')

In [219]:
print(
f'''
subbreddit: {get_req[0]['subreddit']}
     title: {get_req[0]['title'][:50]}...
    author: {get_req[0]['author']}
   created: {get_req[0]['created_utc']}
  comments: {get_req[0]['num_comments']}
       url: {get_req[0]['url']}''')


subbreddit: PremierLeague
     title: Who is Jesse Marsch the New Head Coach of RB Leipz...
    author: ajmal1979
   created: 1619694586
  comments: 0
       url: https://www.reddit.com/r/PremierLeague/comments/n120et/who_is_jesse_marsch_the_new_head_coach_of_rb/


**Comment Test**

In [220]:
# # parameters setup
# params = {'subreddit': epl, 'size': '1', 'is_self': True}

# # assignment of function return to variable
# get_req = get_request(params, 'comment')

In [221]:
# get_req

In [222]:
# print(
# f'''
# subbreddit: {get_req[0]['subreddit']}
#      title: {get_req[0]['title'][:50]}...
#     author: {get_req[0]['author']}
#    created: {get_req[0]['created_utc']}
#   comments: {get_req[0]['num_comments']}
#        url: {get_req[0]['url']}''')

**Comment Parameters**
- reply_delay (Integer)	-	Restrict based on time elapsed in seconds when comment reply was made  
- nest_level		(Integer) -		Restrict based on nest level of comment. 1 is a top level comment  
- sub_reply_delay	(Integer) -		Restrict based on number of seconds elapsed from when submission was made  
- utc_hour_of_week	(Integer) -		Restrict based on hour of week when comment was made (for aggregations)  
- link_id	        (Integer) -		Restrict results based on submission id  
- parent_id	        (Integer) -		Restrict results based on parent id

---

## Loop Function

In [223]:
# Function to make [n] requests of 100 posts. 
# Each request will be for a period of [win] days and the periods will not overlap
# Exports .csv of API responses for each subreddit

@knockknock.slack_sender(webhook_url=kk_url, channel=kk_channel_name, user_mentions=kk_users)
def get_nrequest(n, win, subs_list, request_type='submission',
                 size=100, is_self=True, is_video=False):

    # Setting up list of features to retrieve from subreddit data:
    features = ['subreddit', 'created_utc', 'author', 'num_comments', 
                'score', 'is_self', 'link_flair_text','title', 'selftext', 'full_link']

    # List instantiation for error/event logging
    report = []

    # For Loop to scrape multiple subreddits:
    for sub in subs_list:
        
        # Onscreen display during loops:        
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f'API request initiated. Scraping r/{sub} in progress...')
        print(f'   Current time: {cur_datetime}')
        print('')
        
        # Adding updates to event log
        report.append({'event': f'API request from r/{sub} began',
                       'datetime': f'{cur_datetime}',
                       'exception': ''
                       })

        # Setting parameters to use in data requests from the API. Parameters mostly set by function args. 
        params = {
            'subreddit': sub,
            'size': size,
            'is_self': is_self,
            'is_video': is_video,
            'selftext:not': '[removed]'      # thanks to Amanda for posting this in the groupwork channel
        }

        list_data = []
        
        for i in range(1, n+1):               # I used the demo notebook to figure out these time
            params['after'] = f'{i * win}d'   # parameters and loop structure using n and day_window

            # Try/Except to handle any errors in the get_requests
            try:
                new_data = get_request(params, request_type)

            # Adding errors to event log if get_request fails
            except Exception as e:                                              
                cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                report.append({'event': f'REQUEST FAILED: {i}',
                               'datetime': f'{cur_datetime}',
                               'exception': f'{e}'
                               })
                
                # Onscreen display - Failure of single request
                print(f'Request failed: {i}')
                print(f'  Current time: {cur_datetime}')
                print('')
                time.sleep(.25)
                continue
                
            # Adding newly pulled data to list for eventual DataFrame conversion and export
            list_data.extend(new_data)
    
            # Adding completion updates to event log
            cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            report.append({'event': f'Request complete: {i}',
                           'datetime': f'{cur_datetime}',
                           'exception': ''
                           })
            
            # Onscreen updates for every 10 requests to provide timing updates
            if i % 5 == 0:
                print(f'Request complete: {i}')
                print(f'    Current time: {cur_datetime}')
                print('')
            df_report = pd.DataFrame(report)
            df_report.to_csv(f'../git_ignore/output/report.csv', index=False)
            time.sleep(.25)

        # creating DataFrame from list after all requests have been made    
        df_output = pd.DataFrame(list_data)
        df_outputfull = pd.DataFrame(list_data)
        
        df_output = df_output[features]
        
        # creating simple date and time columns using utc code
        # https://docs.python.org/3/library/datetime.html#date-objects
        # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
        df_output['date'] = df_output['created_utc'].map(lambda x: dt.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
        df_output['time'] = df_output['created_utc'].map(lambda x: dt.datetime.fromtimestamp(x).strftime('%H:%M:%S'))

        # Exporting primary data output with selected/filtered features
        df_output.reset_index(inplace=True)
        cur_datetime = dt.datetime.now().strftime('%Y_%m_%d-%H_%M_%S')
        df_output.to_csv(f'../git_ignore/output/{cur_datetime}_data_{sub}.csv', index=False,)

        # Exporting secondary output with all features
        cur_datetime = dt.datetime.now().strftime('%Y_%m_%d-%H_%M_%S')
        df_outputfull.to_csv(f'../git_ignore/output/{cur_datetime}_datafull_{sub}.csv', index=False,)
        
        # Exporting event log as a CSV report
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        report.append({'event': f'r/{sub} data files saved to output folder', 'datetime': f'{cur_datetime}'})
        df_report = pd.DataFrame(report)
        cur_datetime = dt.datetime.now().strftime('%Y_%m_%d-%H_%M_%S')
        df_report.to_csv(f'../git_ignore/output/{cur_datetime}_report_{sub}.csv', index=False,)
        report = []

        # Onscreen display - completion of each subreddit scrape
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f'r/{sub} data and report files saved to output folder')
        print(f'   Current time: {cur_datetime}')
        print('')

## in progress...

In [1]:
get_nrequest(15, 2, ['nfl', 'PremierLeague'])

NameError: name 'get_nrequest' is not defined

## Testing

In [None]:
# list_subs = [nfl]
# subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'num_comments', 'score', 'is_self'] 
# for sub in list_subs:
#     params = {
#         'subreddit': sub,
#         'size': '1',
#         'is_self': 'True'
#     }

#     list_data = []
#     for i in range(3):
#         new_data = get_request(params, 'submission')
#         time.sleep(2)
#         list_data.extend(new_data)


# list_data

In [None]:
# get_request(params, 'submission') #['data']

In [None]:
# new_data = get_request(params, 'submission')
# new_data

### Extracting submission tags (link_flairs)

I did a bunch of work trying to figure out how to extract the flair text, only to realize later that there was a dedicated field for it.... ('link_flair_text')

I had the code below correctly pulling the same text, but was having trouble getting it to function properly in the function above, so I'm glad I found the dedicated field. I didn't have to keep troubleshooting the method of retrieving the value via indexing.

In [48]:
request

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'khamis1179',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_9m4csl1l',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1619399784,
  'domain': 'youtu.be',
  'full_link': 'https://www.reddit.com/r/nfl/comments/mymciv/relaxing_waterfall_sound_for_sleep/',
  'gildings': {},
  'id': 'mymciv',
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': False,
  'is_self': False,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media': {'oembed': {'author_name': 'Relaxing Soul Sounds',
    'author_url': 'https://www.youtube.com/

In [112]:
for post in request:
#     print(post['link_flair_type'])
    if post['link_flair_type'] == 'richtext':
        print(post['link_flair_type'])
        print(post['link_flair_richtext'])
        print(post['link_flair_richtext'][-1]['t'])
        print(' ')

richtext
[{'e': 'text', 't': 'Highlight'}]
Highlight
 
richtext
[{'e': 'text', 't': 'Highlight'}]
Highlight
 


In [173]:
# get_nrequest_test(3, 2,['nfl', 'PremierLeague'])

In [213]:
g, h = 1, 2