# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Target outcome: Identify characteristics of a post on Reddit that contribute the most to overall interaction (i.e. number of comments)

### Attributes must include:
- Title of thread
- Subreddit
- Length of time post has been up
- Number of comments

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

# Wrapper

https://praw.readthedocs.io/en/latest/

Data Dictionary used as reference
https://github.com/reddit-archive/reddit/wiki/JSON

In [263]:
import requests
import time
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
%matplotlib inline

In [5]:
url = "http://www.reddit.com/hot.json"

In [6]:
headers = {'User-agent': 'Bleep Bloop'}

In [7]:
## YOUR CODE HERE
res = requests.get(url, headers=headers)

In [8]:
res.status_code

200

In [9]:
data_json = res.json()

In [10]:
sorted(data_json.keys())

['data', 'kind']

In [57]:
dataset = data_json['data'];
#Everything we want is stored in 'data'

In [12]:
sorted(data_json['data'].keys())
#breaks down our information
#after is the ID of the last post in this list

['after', 'before', 'children', 'dist', 'modhash']

In [14]:
(data_json['data']['children'][0]);

In [37]:
(data_json['data']['after'])
#This is the id of the last comment in this post

't3_8m1gat'

In [15]:
[post['data']['name'] for post in data_json['data']['children']];
#names of all the blog posts on reddit
#this is the anchor to use for the next time you hit reddit's API

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [33]:
params = {'after': 't5_2s5ti'}


In [40]:
requests.get(url, params=params, headers=headers)

<Response [200]>

In [56]:
posts = []

In [233]:
after = ''
#t3_8mb92u'
#t3_8m9kly'
#t3_8mc2oz'
#t3_8mc68o'
#t3_8mdgts'
#t3_8ma3gf'
#t3_8matgl'
#t3_8ma73u'
#t3_8maqro'
#t3_8mcgvx'
#t3_8mcp0e'
#t3_8mcq8j'
#t3_8mcj7f'
#t3_8mbhqm'
#t3_8mbdn4'
#t3_8md4ye'
#t3_8mbjtc'
#t3_8mcirf'
#t3_8mdcnv'
#t3_8mcvf3'
#t3_8mc317'
#t3_8mcqb5'
#t5_2s5ti'
#t3_8m1gat
for i in range(300): #I increased my range and time.sleep with each additional link
                     #I worked my way up, I wasn't sure how much my computer could take.
    if after == None:
        params = {}
        print('Error. Check id number')
    else:
        params = {'after': after}
    url = 'http://www.reddit.com/hot.json?{}'.format(after)
    res = requests.get(url, params=params, headers = headers)
    if res.status_code ==200:
        data_json = res.json()
        posts.extend(data_json['data']['children'])
        after = data_json['data']['after']
    else:
        print(res.status_code)
        break #stops the for loop if it receives an error code
    time.sleep(3)
print('Done!')
    

In [241]:
posts[3857];
#call the last post to get the name, input that back into the function

##### I did this until I maxed out my requests and ended up with 8,000 unique posts. Decided that was enough and I moved on.

In [231]:
len(posts) #checking the number of posts

31661

In [232]:
len(set([p['data']['name'] for p in posts])) #checking how many are unique

8113

In [240]:
unique_ids = set([p['data']['name'] for p in posts])
#Saving this to an object so I can reference it later and pull only unique data

### Next: build a for loop to pull individual bits of information and save it back to a `DataFrame`
### A reminder of our goal: Identify characteristics of a post on Reddit that contribute the most to overall interaction (i.e. number of comments)¶
### Attributes must include:
- Title of thread
- Subreddit
- Length of time post has been up
- Number of comments (yi)

##### I'm going to go through a single post and pull information I feel could be useful, as well.

In [338]:
[p['data']['preview']['enabled']for p in posts[0:5]]

[True, True, True, True, True]

In [288]:
import time
ms = time.time()*1000.0
print(ms)
#Checking time for length of time up calculation

1527429342827.872


In [340]:
dataset = []
for p in posts:
    thread_info = {}
    thread_info['comments'] = p['data']['num_comments']
    thread_info['title'] = p['data']['title']
    thread_info['subreddit'] = p['data']['subreddit']
    thread_info['subreddit_subscribers'] = p['data']['subreddit_subscribers'] 
    thread_info['subreddit_type'] = p['data']['subreddit_type']
    thread_info['ups'] = p['data']['ups']
    thread_info['downs'] = p['data']['downs']
    thread_info['score'] = p['data']['score']
    thread_info['gilded'] = p['data']['gilded']
    thread_info['can_gild'] = p['data']['can_gild']
    thread_info['time_since_posted'] = (1527429342827.872-(p['data']['created'])) #time pulled above
    thread_info['over_18'] = p['data']['over_18']
    thread_info['distinguished'] = p['data']['distinguished']
    thread_info['stickied'] = p['data']['stickied']
    thread_info['archived'] = p['data']['archived']
    thread_info['locked'] = p['data']['locked']
    thread_info['num_crossposts'] = p['data']['num_crossposts']
    thread_info['pinned'] = p['data']['pinned']
    thread_info['post_categories'] = p['data']['post_categories']
    thread_info['is_video'] = p['data']['is_video']
    thread_info['saved'] = p['data']['saved']
    #thread_info['preview_avail'] = p['data']['preview']['enabled']
    
    dataset.append(thread_info)
pd.DataFrame(dataset).head()

#Putting in multiple attributes to capture popularity
#Will reduce this using a GridSearch to test penalties of model
#Tried to select as many bools as possible to simplify when creating dummies

KeyError: 'preview'

In [348]:
len(pd.DataFrame(dataset)['preview_avail'])

10

In [334]:
#Creating a data dictionary that I fill as I add attributes to my dataset
#Making notes on transformations to keep track of thoughts for data cleaning step
dictionary = pd.DataFrame()
dictionary['variable'] = ['comments',
                          'title',
                          'subreddit',
                          'subreddit_subscribers',
                          'subreddit_type',
                          'ups',
                          'downs',
                          'score',
                          'gilded',
                          'can_gild',
                          'time_since_posted',
                          'over_18',
                          'distinguished',
                          'stickied',
                          'archived',
                          'locked',
                          'num_crossposts',
                          'pinned',
                          'post_categories',
                          'is_video',
                          'saved',
                          'preview_avail'
                         ]

dictionary['description'] = ['Number of comments. Target variable',
                          'Title of post. Apply vectorizer',
                          'subreddit name (Consider lemmatizer)',
                          'Number of subreddit subscribers',
                          'public v. private',
                          'Number of uplikes',
                          'Number of downlikes',
                          'Net score(ups + downs)',
                          'Number of times post is gilded',
                          'Can post be gilded. gild = 0 if false',
                          'Calculated as current_time - time_posted',
                          'If post is 18+',
                          'distinguished',
                          'If post is stickied in post',
                          'If post is archived',
                          'If post is locked in post',
                          'Times has been crossposted to other threads',
                          'If post is pinned',
                          'List of categories, Use LabelEncoder',
                          'If post contains video media',
                          'If post is saved',
                          'If preview of post is visible'
                         ]

dictionary['dtype'] = ['str',
                          'str',
                          'subreddit',
                          'int',
                          'category',
                          'int',
                          'int',
                          'int',
                          'int',
                          'bool',
                          'long',
                          'bool',
                          'bool',
                          'bool',
                          'bool',
                          'bool',
                          'int',
                          'bool',
                          'list of str',
                          'bool',
                          'bool',
                          'bool'
                         ]

In [335]:
def what_is(dataset, variable):
    """
    Works on all standard Sonyah Seiden data dictionaries.
    dataset = dataframe used to house data dictionary
    variable = attribute information is needed on, formatted as string """
    return dataset[dataset['variable']==variable]
    

In [336]:
what_is(dictionary, 'comments')

Unnamed: 0,variable,description,dtype
0,comments,Number of comments. Target variable,str


In [349]:
posts[0]; #using this to reference structure and build for loop

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [None]:
# Export to csv


In [2]:
from sklearn.decomposition import TruncatedSVD, PCA

In [3]:
td = TruncatedSVD(n_components = 3,
                 n_iter = 5,
                 random_state = 1994)
pca = PCA()

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE