# Reddit API project
## Sonyah Seiden

### Goals:
- Collect data via an API request
- Build a binary predictor
- Present information to a business-focused audience (non-technical)

### Question:
#### _What characteristics of a post on Reddit contribute most to the overall interaction? Measured here in number of comments._
#### _Characteristics must include_:
- Title of thread
- Subreddit
- Length of time up on Reddit
- Number of comments (yi)

#### _Model developed must incorporate:_
- Classification model
- Natural Language Processing
- Threshold determined by _median_  number of comments

#### _**BONUS FEATURES**:_
- Use GridSearch Ridge and Lasso for this model to determine best hyperparameters
- Turn pitch into a blog post and host on a website

Data Dictionary used as reference
https://github.com/reddit-archive/reddit/wiki/JSON  
(Secondary data dictionary created in cleaning_modeling notebook for selected attributes)

In [2]:
import requests
import time
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
%matplotlib inline

### For the first step in my scraping process I'm going to make 1 request to identify where the information I want is stored.

In [3]:
url = "http://www.reddit.com/hot.json"
headers = {'User-agent': 'Bleep Bloop'}
res = requests.get(url, headers=headers)
res.status_code #looking for a 2-- code

200

In [4]:
data_json = res.json()
sorted(data_json.keys()) #checking what keys exist

['data', 'kind']

In [5]:
(data_json['data']['children'][0])
#searching within to find exactly where the information I want is

{'data': {'approved_at_utc': None,
  'approved_by': None,
  'archived': False,
  'author': 'capt_geo',
  'author_flair_css_class': None,
  'author_flair_template_id': None,
  'author_flair_text': None,
  'banned_at_utc': None,
  'banned_by': None,
  'can_gild': False,
  'can_mod_post': False,
  'clicked': False,
  'contest_mode': False,
  'created': 1527720026.0,
  'created_utc': 1527691226.0,
  'distinguished': None,
  'domain': 'usatoday.com',
  'downs': 0,
  'edited': False,
  'gilded': 1,
  'hidden': False,
  'hide_score': True,
  'id': '8n93wl',
  'is_crosspostable': False,
  'is_reddit_media_domain': False,
  'is_self': False,
  'is_video': False,
  'likes': None,
  'link_flair_css_class': None,
  'link_flair_text': None,
  'locked': False,
  'media': None,
  'media_embed': {},
  'media_only': False,
  'mod_note': None,
  'mod_reason_by': None,
  'mod_reason_title': None,
  'mod_reports': [],
  'name': 't3_8n93wl',
  'no_follow': False,
  'num_comments': 1995,
  'num_crossposts':

In [13]:
dataset = data_json['data']

In [14]:
dataset['after'] #this shows the name of the last post in my list

't3_8n7npw'

### I have tested my scrape, and will use this structure to build out a for loop and generate thousands of posts.

In [12]:
headers = {'User-agent': 'Bleep Bloop'}
#Creating a custom agent for scraping

In [15]:
params = {'after': 't3_8n7npw'} #pulled the name from above
requests.get(url, params=params, headers=headers)
#displaying how to adapt the for loop using params

<Response [200]>

In [17]:
posts = []
after = 't3_8n38jx'
#t3_8n1ha7'
#t3_8n0e3h'
#t3_8n14rc'
for i in range(3):
    #Worked my way up to 500 posts with 3 seconds (used that range when I actually scraped)
    #This allowed me to pull more unique posts and once, and avoid too many doubles
    #Tested how long it took/checked for rejections by adjusting time.sleep(i)
    if i % 1 == 0:
        print('iteration{}'.format(i))
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/hot.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend([c['data'] for c in  the_json['data']['children']])
        #including a list comprehension allows me to extract data in an organized way
        #this means less cleaning later on!
        pd.DataFrame(posts).to_csv('display.csv', index=False)
        #saving to display.csv for a reference to see how this works.
        #Won't be using this dataset because it will be doubles of posts I already have
        #display.csv is left in the repository for your reference
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(0)

iteration0
iteration1
iteration2


In [18]:
len(posts) #checking the number of posts

75

In [19]:
len(set([p['name'] for p in posts])) #checking how many are unique

74

In [20]:
posts[74]['name'] #pull the name of the last post to run another loop & avoid unecessary doubles

't3_8n806v'

### I've run the loop 4 times and scraped a few thousands unique posts. I saved them all to csv's and am now going to import then and combine them before extracting and cleaning.   
### I am doing this step by step to ensure I do not mess it up, and to keep track of how the size grows with each additional dataset.

In [21]:
dataset = pd.read_csv('my_precious.csv');
dataset.shape

(7500, 85)

In [22]:
dataset_1 = pd.read_csv('my_precious_1.csv');
dataset_1.shape

(5475, 85)

In [23]:
dataset_2 = pd.read_csv('my_precious_2.csv');
dataset_2.shape

(7493, 85)

In [24]:
dataset_3 = pd.read_csv('my_precious_3.csv');
dataset_3.shape

(10655, 85)

In [26]:
dataset_4 = pd.read_csv('my_precious_3.csv')
#same csv number because I overwrote it, but posts were saved before this happened
#they exist within unique_posts
dataset_4.shape

(10655, 85)

In [27]:
new_dataset = dataset.append(dataset_1);
new_dataset.shape

(12975, 85)

In [28]:
new_dataset = new_dataset.append(dataset_2);
new_dataset.shape

(20468, 85)

In [29]:
new_dataset = new_dataset.append(dataset_3);
new_dataset.shape

(31123, 85)

In [30]:
unique_posts = new_dataset.drop_duplicates(subset='permalink')
#Permalink is the only unique component of each post.
#This ensures we eliminate all duplicates

In [31]:
len(unique_posts)

11047

In [None]:
#unique_posts.to_csv('unique_posts.csv', index=False)
#Saving this to a CSV because I learned that mistake early on.
#Commented it out now to avoid overwriting it and losing any posts

### After I initially did this and tested the csv in my other notebook, I decided to add a few more unique posts. I ran the for loop and saved it as dataset_4, then reloaded in the original unique_posts and appended it directly to that.

In [32]:
unique_dataset_4 = dataset_4.drop_duplicates(subset='permalink')

In [33]:
unique_posts = pd.read_csv('./unique_posts.csv');

In [34]:
unique_posts = unique_posts.append(unique_dataset_4)

In [35]:
unique_posts.shape

(18018, 85)

In [36]:
unique_posts.to_csv('unique_posts.csv', index=False)
#resaving full dataset of unique_posts with new set of posts

#### _Now that I saved all unique posts and have a sufficient amount, I'm moving over to my cleaningmodeling notebook to keep things organized._