# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [None]:
import requests
import json

In [None]:
URL = "http://www.reddit.com/r/boardgames.json"

In [None]:
## YOUR CODE HERE

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [None]:
## YOUR CODE HERE

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [35]:
import pandas as pd

In [36]:
# https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
wpt_df = pd.read_json('whitepeopletwitter.json')
wpt_df.head()
# wpt_df["is_bpt"] = 0

bpt_df = pd.read_json('blackpeopletwitter.json')
# bpt_df["is_bpt"] = 1
bpt_df.head()

Unnamed: 0,0
0,Update to Thirsty Games
1,/r/BlackPeopleTwitter Weekly Discussion Thread...
2,"It was a joke Sheryl, just like my life."
3,When you realize â€œThe War on Drugsâ€� was re...
4,Blame The Video Games!


In [38]:
corpus = []
list(map(lambda x: corpus.append(x[0]), wpt_df.values.tolist()))
list(map(lambda x: corpus.append(x[0]), bpt_df.values.tolist()))
corpus

['You tryin to short me?!',
 'Haters gonna hate',
 'This sh*t is the best sh*t!',
 "That's awful",
 'Very true',
 'Sister Khalifa',
 'Smh girls these days',
 'Useless information',
 '18 to 22',
 "It's his house now",
 '#Priorities',
 'Not the bees!',
 'Slim Jims',
 'The actor role is dominated by men',
 'Some people are idiots lol',
 'A very wise man',
 'People that donâ€™t use turn signals are scary',
 'Forced groups in class are a life lesson',
 'same.',
 'Who doesnâ€™t love the classics?',
 'what a nice looking school of fish',
 'Proven fact',
 'the more you know',
 'ALABAMAAAA',
 'LivePD',
 'What do Dinosaurs Taste Like?',
 'D.A.R.E you to answer',
 'The truth',
 'You donâ€™t mess with someone granddaughter.',
 'I understand now',
 'Pancake-ception',
 'We have a prodigy in our midst',
 'Who wants to hop in the time machine and take it for a spin?',
 'All doctor appointments when youâ€™re single and 30+',
 'Well that took a dark turn',
 'White and british',
 'car boys',
 "Can't go o

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [39]:
## YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# Fit the vectorizer on our corpus
cvec = CountVectorizer()
cvec.fit(corpus)

# Transform the corpus
new_corpus = cvec.transform(corpus)

In [43]:
df  = pd.DataFrame(new_corpus.todense(),
                   columns=cvec.get_feature_names())
df.head()
sorted_df = df.count().sort_values(ascending=False)
sorted_df

œðÿ                6022
form               6022
franklin           6022
frame              6022
fragrance          6022
fox                6022
found              6022
forward            6022
fortnite           6022
former             6022
forget             6022
fray               6022
forever            6022
foresight          6022
foreshadowing      6022
foreign            6022
forced             6022
for                6022
footballtwitter    6022
foot               6022
franklins          6022
freaking           6022
follow             6022
friendsðÿ          6022
fuckkkk            6022
fucking            6022
fuckin             6022
fucked             6022
fuck               6022
fruit              6022
                   ... 
play               6022
platinum           6022
plate              6022
planning           6022
place              6022
pitty              6022
pit                6022
pistol             6022
pls                6022
plugs              6022
pocket          

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.