<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

# Problem Statement



Working for a sports wear company that wants to classify reddit posts 

---

# Table of Contents:
- [Background](#Background)
- [Datasets Used for Analysis](#Datasets-Used-for-Analysis)
- [Import Libraries](#Import-Libraries)
- [Functions](#Functions)
- [Data Collection](#Data-Collection)
- [Combining Subreddit DataFrames](#Combining-Subreddit-DataFrames)

---

# Import Libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import timeit
from time import process_time

---

# Functions

In [2]:
def collect_data(subreddit, rows):
    tic = timeit.default_timer()
    n = 0
    last = ''
    posts = []
    rows = int(rows)
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' + subreddit
    while n < rows:
        res = requests.get('{}&before={}'.format(url, last))
        json = res.json()
        for i in json['data']:
            posts.append(i)
            n += 1
    last = int(i['created_utc'])
    print('Status code: ', res.status_code)
    toc=timeit.default_timer()
    elapsed_time = toc - tic
    print('Total elapsed time in seconds:', elapsed_time)
    print('Number of requests per second:', rows/elapsed_time)
    
    return posts

---

# Data Collection

Despite the pushshift API documentation stating that the maximum number of results that can be returned is at 500, only 100 was returned. To overccome that, 

## Subreddit: Cycling

In [3]:
posts = collect_data('cycling', '2000')

Status code:  200
Total elapsed time in seconds: 56.061447
Number of requests per second: 35.67514052928387


In [4]:
cycling_df = pd.DataFrame(posts)

In [5]:
cycling_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 66 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   all_awardings                2000 non-null   object 
 1   allow_live_comments          2000 non-null   bool   
 2   author                       2000 non-null   object 
 3   author_flair_css_class       0 non-null      object 
 4   author_flair_richtext        2000 non-null   object 
 5   author_flair_text            0 non-null      object 
 6   author_flair_type            2000 non-null   object 
 7   author_fullname              2000 non-null   object 
 8   author_is_blocked            2000 non-null   bool   
 9   author_patreon_flair         2000 non-null   bool   
 10  author_premium               2000 non-null   bool   
 11  awarders                     2000 non-null   object 
 12  can_mod_post                 2000 non-null   bool   
 13  contest_mode      

In [6]:
cycling_df = cycling_df[['title','subreddit','selftext']]

In [7]:
cycling_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      2000 non-null   object
 1   subreddit  2000 non-null   object
 2   selftext   2000 non-null   object
dtypes: object(3)
memory usage: 47.0+ KB


## Subreddit: Diving

In [8]:
posts = collect_data('diving', '2000')

Status code:  200
Total elapsed time in seconds: 63.623932599999996
Number of requests per second: 31.4347120379038


In [9]:
diving_df = pd.DataFrame(posts)

In [10]:
diving_df = diving_df[['title','subreddit','selftext']]

In [11]:
diving_df

Unnamed: 0,title,subreddit,selftext
0,A freediver in Florida said he saw his friends...,diving,
1,Lobster Catch and Cook,diving,
2,4 teenage divers killed by sea snake (hydrophi...,diving,
3,Until the tide takes us… [OC],diving,
4,True!,diving,
...,...,...,...
1995,Deco with a coelenterate.,diving,[removed]
1996,"Dos Ojos Cenote, Mexico",diving,
1997,"Manati Cenote, Mexico",diving,
1998,Some footage from a trip to Mexico,diving,


---

# Combining Subreddit DataFrames

In [12]:
df = pd.concat([cycling_df, diving_df])

In [13]:
df

Unnamed: 0,title,subreddit,selftext
0,Why Are Bike Trailers So Expensive? A Cost Ana...,cycling,[removed]
1,chain lube,cycling,I have run out.\nIs Finish Line Teflon Plus Dr...
2,Advice needed!,cycling,I'm a 18M with a body fat percentage of 24.4%....
3,Neoprene socks,cycling,I see a lot on this sub and among the cycling ...
4,Is this bike really Trek?,cycling,Hello all!\n\nI am looking to buy a used road ...
...,...,...,...
1995,Deco with a coelenterate.,diving,[removed]
1996,"Dos Ojos Cenote, Mexico",diving,
1997,"Manati Cenote, Mexico",diving,
1998,Some footage from a trip to Mexico,diving,


In [18]:
df.head(10)

Unnamed: 0,title,subreddit,selftext
0,Why Are Bike Trailers So Expensive? A Cost Ana...,cycling,[removed]
1,chain lube,cycling,I have run out.\nIs Finish Line Teflon Plus Dr...
2,Advice needed!,cycling,I'm a 18M with a body fat percentage of 24.4%....
3,Neoprene socks,cycling,I see a lot on this sub and among the cycling ...
4,Is this bike really Trek?,cycling,Hello all!\n\nI am looking to buy a used road ...
5,Women's cycling pull-over?,cycling,Does anyone know of a nice light pull-over wit...
6,Looking for a good exercise bike for a short g...,cycling,I want to get into cycling to add to my fitnes...
7,Top 5 Best Budget Bike Trailers for Kids (Plus...,cycling,[removed]
8,Best place to watch Paris Roubaix,cycling,Anyone offer some guidance on where to watch i...
9,Packaging options for shipping though Bike Fli...,cycling,I’m going from DC to Denver and staying for a ...


In [20]:
df.tail(10)

Unnamed: 0,title,subreddit,selftext
1990,Dive tours in Miami,diving,[removed]
1991,Try Diving,diving,
1992,Who put that banana there,diving,
1993,"Diving around Dunedin, NZ",diving,May be a long shot but I just recently moved t...
1994,Finally Meeting a Tiger Shark,diving,
1995,Deco with a coelenterate.,diving,[removed]
1996,"Dos Ojos Cenote, Mexico",diving,
1997,"Manati Cenote, Mexico",diving,
1998,Some footage from a trip to Mexico,diving,
1999,First Dives In Cozumel,diving,


In [22]:
df.to_csv('../data/subreddits.csv')