# Reddit & Quibi: Web API and NLP
## Part 1-A: Gathering Data from Reddit with Pushshift API

Quibi is a new mobile-only streaming platform that launched April 2020. All of the shows are serial and have episodes under 10 minutes. They have lots of big names doing shows such as Chrissy Tiegen and Sophie Turner. Quibi stands for "quick bites" as the content is meant to be consumed during "in between" times of your day. Since Quibi is brand new, I want to analyze how well their content compares to that of other popular media: **videos, television, and podcasts**.
- [Videos](https://www.reddit.com/r/videos/): The length of the content is most similar to videos found on services like YouTube and Vimeo
- [Television](https://www.reddit.com/r/cordcutters/): Given the start power and financial backing of each project, they are definitely going for the narrative and production quality of television shows. I'm specifically choosing the cordcutters reddit because as of 2019, [more people pay for streaming than for cable](https://fortune.com/2019/03/19/cord-cutting-record-netflix-deloitte/). This is especially true for Millenials who are the target demographic for Quibi.
- [Podcasts](https://www.reddit.com/r/podcasts/): As mentioned earlier, the content is meant to be consumed while you're waiting or in between tasks. Podcasts are often consumed in a similar manner. 

To see what people are saying about each of these area, I'm going to use the Pushshift API to gather 30,000 of the most recent posts for each of the subreddits above.

**Problem Statement**: How can we best segment the Quibi slate to reach audiences that enjoy videos, television, and podcasts?

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time

Since the Pushshift API has a limit of 1000 posts per request, the function below will take in a specific subreddit and number of desired posts, then gather until the number is reached. It will also print out periodic updates of how many have been gathered and the date from the oldest post.

In [2]:
#Function to get posts from reddit
def get_posts(subreddit, num):
    #Setting the base url and first "before" time
    base_url = 'https://api.pushshift.io/reddit/submission/search'
    bef_time = t
    
    #list to hold the dataframes to concat
    to_concat = []
    
    #While loop that keeps gathering until the number of desired posts is reached
    while len(to_concat) < (num / 1000):
        params = {
            'subreddit' : subreddit,
            'size' : 1000,
            'before' : bef_time,
            'lang' : True,
            'author': '![deleted]'
                }
        get = requests.get(base_url, params)
        data = get.json()['data']
        df = pd.DataFrame(data)
        bef_time = df['created_utc'].min()
        to_concat.append(df)
        
        #If statement to print out updates every 5000 posts including the time of the earliest post
        if len(to_concat) % 5 == 0:
            #Converting the epoch time into a more readable, datetime format
            print_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(bef_time))
            print(f'{len(to_concat)*1000} posts have been gathered, oldest post is from {print_time}')
    
    #Once the while loop is done, concat the dataframes together and reset the index
    master = pd.concat(to_concat, axis=0)
    master.reset_index(inplace=True)
    
    #Making sure the posts are unique with unique ID's
    duplicates = master['id'].duplicated().sum()
    
    #Final update confirming how many posts were gathered and if there are duplicates
    print(f'Final DataFrame shape: {master.shape}, there are {duplicates} duplicates')
    
    #Return the final dataframe
    return master


Posts are being added everyday, by setting a constant start time of April 19, 2020 12am, everytime the code is run, it should grab the same posts.

In [3]:
#Used this stackoverflow areticle to help set a permanent start time of this morning: 
#https://stackoverflow.com/questions/7241170/how-to-convert-current-date-to-epoch-timestamp
t = int(time.mktime(time.strptime('19 April, 2020', '%d %B, %Y')))

In [4]:
#Setting a consistent start time so everytime I run it, it should pull the same posts 
t

1587279600

### 1. Gathering "Podcast' posts

In [5]:
pod = get_posts('podcasts', 30_000)

5000 posts have been gathered, oldest post is from 2019-12-30 07:56:52
10000 posts have been gathered, oldest post is from 2019-08-27 00:38:53
15000 posts have been gathered, oldest post is from 2019-05-11 12:55:34
20000 posts have been gathered, oldest post is from 2019-01-25 17:23:14
25000 posts have been gathered, oldest post is from 2018-09-20 11:12:18
30000 posts have been gathered, oldest post is from 2018-04-17 08:34:07
Final DataFrame shape: (30000, 92), there are 0 duplicates


Checking it came through correctly:

In [6]:
pod.head(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,category,content_categories,media_embed,removal_reason,secure_media_embed,suggested_sort,rte_mode,author_id,brand_safe,previous_visits
0,0,[],False,cvbk12,,[],,text,t2_u453b,False,...,,,,,,,,,,
1,1,[],False,HydraDominatus1,,[],,text,t2_ppkym3w,False,...,,,,,,,,,,


In [7]:
pod.tail(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,category,content_categories,media_embed,removal_reason,secure_media_embed,suggested_sort,rte_mode,author_id,brand_safe,previous_visits
29998,998,,,lsdinc,,[],,text,,,...,,,,,,,markdown,,True,
29999,999,,,redbulluci,,[],,text,,,...,,,,,,,markdown,,True,


Exporting the raw data to a csv:

In [8]:
pod.to_csv('../datasets/podcasts_raw.csv', index=False)

### 2. Gathering "Television" posts

In [9]:
tv = get_posts('television', 30_000)

5000 posts have been gathered, oldest post is from 2020-03-20 10:27:17
10000 posts have been gathered, oldest post is from 2020-02-15 16:53:54
15000 posts have been gathered, oldest post is from 2020-01-12 07:18:58
20000 posts have been gathered, oldest post is from 2019-12-05 14:00:13
25000 posts have been gathered, oldest post is from 2019-11-03 14:50:39
30000 posts have been gathered, oldest post is from 2019-10-06 18:32:36
Final DataFrame shape: (30000, 88), there are 0 duplicates


Checking it came through correctly:

In [10]:
tv.head(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,event_end,event_is_live,event_start,author_cakeday,poll_data,steward_reports,removed_by,updated_utc,og_description,og_title
0,0,[],False,AppleSauceJake,,[],,text,t2_3uu5v201,False,...,,,,,,,,,,
1,1,[],False,BadLobster0024,,[],,text,t2_4my71vkz,False,...,,,,,,,,,,


In [11]:
tv.tail(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,event_end,event_is_live,event_start,author_cakeday,poll_data,steward_reports,removed_by,updated_utc,og_description,og_title
29998,998,[],False,promo_9movies_io,,[],,text,t2_2hql6pge,False,...,,,,,,[],,1570499000.0,,
29999,999,[],False,cynognathus,Daredevil1,[],,text,t2_5glxo,False,...,,,,,,[],,1570498000.0,,


Exporting the raw data to a csv:

In [12]:
tv.to_csv('../datasets/tv_raw.csv', index=False)

### 3. Gathering "Video" posts

In [13]:
vid = get_posts('videos', 30_000)

5000 posts have been gathered, oldest post is from 2020-04-16 03:16:36
10000 posts have been gathered, oldest post is from 2020-04-12 23:40:41
15000 posts have been gathered, oldest post is from 2020-04-10 07:58:24
20000 posts have been gathered, oldest post is from 2020-04-07 15:44:42
25000 posts have been gathered, oldest post is from 2020-04-05 07:32:45
30000 posts have been gathered, oldest post is from 2020-04-02 15:26:48
Final DataFrame shape: (30000, 73), there are 0 duplicates


Checking it came through correctly:

In [14]:
vid.head(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,secure_media_embed,thumbnail_height,thumbnail_width,author_cakeday,link_flair_text,link_flair_css_class,author_flair_background_color,author_flair_text_color,gilded,link_flair_template_id
0,0,[],False,jokerbaby66,,[],,text,t2_4xcbdvhn,False,...,,,,,,,,,,
1,1,[],False,scratchwax,,[],,text,t2_57cyqoy5,False,...,,,,,,,,,,


In [15]:
vid.tail(2)

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,secure_media_embed,thumbnail_height,thumbnail_width,author_cakeday,link_flair_text,link_flair_css_class,author_flair_background_color,author_flair_text_color,gilded,link_flair_template_id
29998,998,[],False,brwonmagikk,,[],,text,t2_sm7ms,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,R1: Political,removed,,,,
29999,999,[],False,GlassMath6,,[],,text,t2_63mleusz,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,ATN,removed,,,,


This DataFrame was too large to be uploaded to GitHub in one csv so I split it into two exports:

In [16]:
vid.loc[:25000, :]

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,secure_media_embed,thumbnail_height,thumbnail_width,author_cakeday,link_flair_text,link_flair_css_class,author_flair_background_color,author_flair_text_color,gilded,link_flair_template_id
0,0,[],False,jokerbaby66,,[],,text,t2_4xcbdvhn,False,...,,,,,,,,,,
1,1,[],False,scratchwax,,[],,text,t2_57cyqoy5,False,...,,,,,,,,,,
2,2,[],False,michaelforrest,,[],,text,t2_dhs5u,False,...,,,,,,,,,,
3,3,[],False,funnywurld,,[],,text,t2_j54t0,False,...,,,,,,,,,,
4,4,[],False,faiza786,,[],,text,t2_14isn0,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24996,996,[],False,projetoams2020,,[],,text,t2_5ur349ef,False,...,,,,,,,,,,
24997,997,[],False,dannyinternet,,[],,text,t2_va90r,False,...,,,,,,,,,,
24998,998,[],False,reviewmua,,[],,text,t2_3dwd7rkb,False,...,,,,,,,,,,
24999,999,[],False,NaturesClassroomIns,,[],,text,t2_626o0vrf,False,...,"{'content': '&lt;iframe width=""459"" height=""34...",105.0,140.0,,,,,,,


In [17]:
vid.loc[25000:, :]

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,secure_media_embed,thumbnail_height,thumbnail_width,author_cakeday,link_flair_text,link_flair_css_class,author_flair_background_color,author_flair_text_color,gilded,link_flair_template_id
25000,0,[],False,Notsureif0010,,[],,text,t2_7t6ee,False,...,"{'content': '&lt;iframe class=""embedly-embed"" ...",105.0,140.0,,,,,,,
25001,1,[],False,MoustacheSpy,,[],,text,t2_nokse,False,...,,,,,,,,,,
25002,2,[],False,IMian91,,[],,text,t2_a1g78cv,False,...,,,,,,,,,,
25003,3,[],False,Ahmad7Raza,,[],,text,t2_4umewomm,False,...,,,,,,,,,,
25004,4,[],False,chidedneck,,[],,text,t2_7fe8j,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,995,[],False,OranReilly,,[],,text,t2_j206w,False,...,,,,,,,,,,
29996,996,[],False,00jknight,,[],,text,t2_13e2hc,False,...,"{'content': '&lt;iframe width=""459"" height=""34...",105.0,140.0,,,,,,,
29997,997,[],False,HOWMUCHISSTAROUTFIT,,[],,text,t2_3uxqscl6,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,,,,,,
29998,998,[],False,brwonmagikk,,[],,text,t2_sm7ms,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,R1: Political,removed,,,,


In [18]:
vid.loc[25001:, :]

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,secure_media_embed,thumbnail_height,thumbnail_width,author_cakeday,link_flair_text,link_flair_css_class,author_flair_background_color,author_flair_text_color,gilded,link_flair_template_id
25001,1,[],False,MoustacheSpy,,[],,text,t2_nokse,False,...,,,,,,,,,,
25002,2,[],False,IMian91,,[],,text,t2_a1g78cv,False,...,,,,,,,,,,
25003,3,[],False,Ahmad7Raza,,[],,text,t2_4umewomm,False,...,,,,,,,,,,
25004,4,[],False,chidedneck,,[],,text,t2_7fe8j,False,...,,,,,,,,,,
25005,5,[],False,999_Apps,,[],,text,t2_2yx49guu,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,995,[],False,OranReilly,,[],,text,t2_j206w,False,...,,,,,,,,,,
29996,996,[],False,00jknight,,[],,text,t2_13e2hc,False,...,"{'content': '&lt;iframe width=""459"" height=""34...",105.0,140.0,,,,,,,
29997,997,[],False,HOWMUCHISSTAROUTFIT,,[],,text,t2_3uxqscl6,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,,,,,,
29998,998,[],False,brwonmagikk,,[],,text,t2_sm7ms,False,...,"{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,R1: Political,removed,,,,


In [19]:
vid.loc[:25000, :].to_csv('../datasets/video_raw_1.csv', index=False)
vid.loc[25001:, :].to_csv('../datasets/video_raw_2.csv')