

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Using Reddit's API for Predicting Comments<br>
### Author: Will J. Suh

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: **_What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_**

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [2]:
import requests
import json
from bs4 import BeautifulSoup    
import pandas as pd
import numpy as np
import requests
import re
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline





In [3]:
URL = 'https://www.reddit.com/r/todayilearned.json'

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [4]:
results = requests.get(URL, headers = {'User-agent' : 'Will Bot 0.1'})

In [5]:
data = results.json()

In [6]:
print(len(data['data']['children']))

25


In [None]:
data['data'].keys()

In [8]:
data['data']['after']

't3_9eecnq'

In [9]:
#after = 
new_url = 'http://www.reddit.com/r/todayilearned.json?after=t3_9dkoki'

In [10]:
results = requests.get(new_url, headers = {'User-agent':'Will Bot 0.1'})
new_data = results.json()
new_data['data']['children']


[]

In [11]:
soup = BeautifulSoup(results.text, 'lxml')

In [12]:
len(new_data['data']['children'])

0

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

## First subreddit: /r/todayilearned

In [13]:
import time

In [26]:
url = 'http://www.reddit.com/r/todayilearned.json'
all_posts =[]
for _ in range(40): 
    # construct a list of 1000
    
    # Get the posts by hitting the url, put it in json and store it
    res = requests.get(url, headers={'User-agent': 'will bot'})
    data = res.json()
    
    # save only the posts out of the json into the list_of_posts, then
    # add all the posts to the all_posts list
    list_of_posts = data['data']['children']
    
    for post in list_of_posts:
        current_post = []
        current_post.append(post['data']['selftext'])
        current_post.append(post['data']['title'])
        current_post.append(post['data']['subreddit_name_prefixed'])
        all_posts.append(current_post)
    
    # reassign the after to the current 'after', and then update the url to hit
    after = data['data']['after']
    url = 'http://www.reddit.com/r/todayilearned.json?after=' + after
    
    
    # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
    print('The current after: ', after)
    time.sleep(3)

The current after:  t3_9eecnq
The current after:  t3_9eev2n
The current after:  t3_9efh4r
The current after:  t3_9eajh3
The current after:  t3_9e871c
The current after:  t3_9e2xb5
The current after:  t3_9e2ih3
The current after:  t3_9dyqwo
The current after:  t3_9dxamf
The current after:  t3_9dgksr
The current after:  t3_9dqo5o
The current after:  t3_9dmcb4
The current after:  t3_9dqzu2
The current after:  t3_9do3aa
The current after:  t3_9dma8z
The current after:  t3_9df5t1
The current after:  t3_9df4ig
The current after:  t3_9dlqcv
The current after:  t3_9d9dbe
The current after:  t3_9d7veb
The current after:  t3_9d4hnx
The current after:  t3_9cpvk2
The current after:  t3_9czmxp
The current after:  t3_9cwjda


TypeError: must be str, not NoneType

In [27]:
# save this to a dataframe first and then to CSV

first_subreddit = pd.DataFrame(all_posts, columns = ['post', 'title', 'true_y'])

In [28]:
first_subreddit.head()

Unnamed: 0,post,title,true_y
0,,TIL the monarch butterfly’s life span is 2 to ...,r/todayilearned
1,,TIL That Bob Hawke is the only Australian prim...,r/todayilearned
2,,TIL A man called Miyamoto Musashi is considere...,r/todayilearned
3,,TIL the Barbie Liberation Organization swapped...,r/todayilearned
4,,TIL that although bamboo are commonly refered ...,r/todayilearned


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [29]:
pd.DataFrame(first_subreddit).to_csv('./TIL') #renamed to first_subreddit in jupyter notebook

## Second subreddit: /r/askreddit

In [61]:
url2 = 'https://www.reddit.com/r/askreddit.json'
results2 = requests.get(url2, headers = {'User-agent' : 'Will Bot 0.2'})
data2 = results2.json()
print(len(data['data']['children']))

soup = BeautifulSoup(results2.text, 'lxml')

11


In [62]:
url2 = 'https://www.reddit.com/r/askreddit.json'
all_posts =[]
for _ in range(40): 
    # construct a list of 1000
    
    # Get the posts by hitting the url, put it in json and store it
    res = requests.get(url2, headers={'User-agent': 'will bot 0.2'})
    data2 = res.json()
    
    # save only the posts out of the json into the list_of_posts, then
    # add all the posts to the all_posts list
    list_of_posts = data2['data']['children']
    
    for post in list_of_posts:
        current_post = []
        current_post.append(post['data']['selftext'])
        current_post.append(post['data']['title'])
        current_post.append(post['data']['subreddit_name_prefixed'])
        all_posts.append(current_post)
    
    # reassign the after to the current 'after', and then update the url to hit
    after2 = data2['data']['after']
    url2 = 'https://www.reddit.com/r/askreddit.json?after=' + after2
    
    # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
    print('The current after: ', after2)
    time.sleep(5)


The current after:  t3_9e3rrq
The current after:  t3_9ec5nk
The current after:  t3_9eefpl
The current after:  t3_9e5rrb
The current after:  t3_9ednp8
The current after:  t3_9efeot
The current after:  t3_9ef7ak
The current after:  t3_9eezj3
The current after:  t3_9efjg0
The current after:  t3_9efexp
The current after:  t3_9ef97c
The current after:  t3_9ef4n8
The current after:  t3_9edmq7
The current after:  t3_9ed3ey
The current after:  t3_9eenw8
The current after:  t3_9efmyu
The current after:  t3_9efjij
The current after:  t3_9efgdm
The current after:  t3_9edjyo
The current after:  t3_9ef917
The current after:  t3_9ef5p0
The current after:  t3_9ef2y6
The current after:  t3_9eax3g
The current after:  t3_9eeyb8
The current after:  t3_9eeu8r
The current after:  t3_9edowb
The current after:  t3_9eawcv
The current after:  t3_9ediz0
The current after:  t3_9eeg0a
The current after:  t3_9eecn6
The current after:  t3_9ee981
The current after:  t3_9ee5a3
The current after:  t3_9ed1c5
The curren

TypeError: must be str, not NoneType

In [63]:
len(all_posts)

956

In [64]:
second_subreddit = pd.DataFrame(all_posts, columns = ['post', 'title', 'true_y'])

In [65]:
second_subreddit.head()

Unnamed: 0,post,title,true_y
0,We'll use this as a celebration post. Subreddi...,AskReddit has reached 20 million subscribers!,r/AskReddit
1,,"What's a ""fact"" you thought was true for years...",r/AskReddit
2,,NSFW what's the most awkward boner you've ever...,r/AskReddit
3,,"Reddit, what's a good icebreaker (for parties,...",r/AskReddit
4,,What's something people think makes them uniqu...,r/AskReddit


In [66]:
pd.DataFrame(second_subreddit).to_csv('./askreddit')

## Load CSV and Clean Dataframes

In [67]:
til = pd.read_csv('./TIL')
ask = pd.read_csv('./askreddit')

In [68]:
del til['Unnamed: 0']
del ask['Unnamed: 0']

In [69]:
ask.head()

Unnamed: 0,post,title,true_y
0,We'll use this as a celebration post. Subreddi...,AskReddit has reached 20 million subscribers!,r/AskReddit
1,,"What's a ""fact"" you thought was true for years...",r/AskReddit
2,,NSFW what's the most awkward boner you've ever...,r/AskReddit
3,,"Reddit, what's a good icebreaker (for parties,...",r/AskReddit
4,,What's something people think makes them uniqu...,r/AskReddit


In [70]:
ask.shape

(956, 3)

In [71]:
til.shape

(611, 3)

In [72]:
til.head()

Unnamed: 0,post,title,true_y
0,,TIL the monarch butterfly’s life span is 2 to ...,r/todayilearned
1,,TIL That Bob Hawke is the only Australian prim...,r/todayilearned
2,,TIL A man called Miyamoto Musashi is considere...,r/todayilearned
3,,TIL the Barbie Liberation Organization swapped...,r/todayilearned
4,,TIL that although bamboo are commonly refered ...,r/todayilearned


There are three observations in 'AskReddit' where there is a post. 

There are no observations in 'Today I Learned' where there is a post.

I will create a feature that combines post and title instead of deleting the post feature.

In [73]:
til.isnull().sum()

post      611
title       0
true_y      0
dtype: int64

In [74]:
ask.isnull().sum()

post      954
title       0
true_y      0
dtype: int64

In [75]:
ask['post'] = ask['post'].fillna('')

In [76]:
til['post'] = til['post'].fillna('')

In [77]:
# Post + Title Feat. Eng.

ask['interact'] = ask['post'] + ask['title']
til['interact'] = til['post'] + til['title']

In [78]:
ask.head()

Unnamed: 0,post,title,true_y,interact
0,We'll use this as a celebration post. Subreddi...,AskReddit has reached 20 million subscribers!,r/AskReddit,We'll use this as a celebration post. Subreddi...
1,,"What's a ""fact"" you thought was true for years...",r/AskReddit,"What's a ""fact"" you thought was true for years..."
2,,NSFW what's the most awkward boner you've ever...,r/AskReddit,NSFW what's the most awkward boner you've ever...
3,,"Reddit, what's a good icebreaker (for parties,...",r/AskReddit,"Reddit, what's a good icebreaker (for parties,..."
4,,What's something people think makes them uniqu...,r/AskReddit,What's something people think makes them uniqu...


In [79]:
# Join TIL and AskReddit into a dataframe called concat_df

concat_df = pd.concat([til, ask], axis = 0, ignore_index= True)

In [80]:
# clean stop words:

from nltk.corpus import stopwords
stop = stopwords.words('english')

concat_df['interact'] = concat_df['interact'].apply(lambda x: ' '.join([word for word in x.split(' ') if word not in (stop)]))

In [81]:
# Remove "TIL" from concat_df['interact'] because it would be too easy.
# Remove "?" from concat_df['interact'] because almost all of AskReddit ends in a question mark.
# Remove other punctuation marks and lower case everything.

artifact = ['\n\n', '\n', '\t', '#', r'r/', 'https://', 'www.', 'http://', 'reddit.com', 'wiki',
            '.', ',', '-',';','"', ':', "[", "[", "(", ")", '[/', '/', '?', "'", '*', '$', '&', 
            'TIL', 'the', 'reddit', 'it', 'be']
for i in artifact:
    concat_df['interact'] = concat_df['interact'].map(lambda x: x.replace(i, ''))

concat_df['interact'] = concat_df['interact'].str.lower()

In [82]:
# one hot encode true_y. 0 == 'AskReddit' / 1 == 'TIL'

one_hot = pd.get_dummies(concat_df['true_y'])


In [83]:
concat_df.tail() # It's clean!

Unnamed: 0,post,title,true_y,interact
1562,,People who have been to therapy have you ever ...,r/AskReddit,people rapy ever thought dating rapist getting...
1563,,"Feeling worthless and empty at times. Not sad,...",r/AskReddit,feeling worthless empty times not sad hollow w...
1564,,What is the genre of music in the mr green com...,r/AskReddit,what genre music mr green commercial
1565,,"What is your ""Better them than me"" moment?",r/AskReddit,what better me moment
1566,,What are some of the creepiest things you have...,r/AskReddit,what creepiest things found internet


In [84]:
X = concat_df['interact']
y = concat_df['true_y']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [85]:
# vectorize interact feature:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             lowercase = False,
                             max_features = 5000) 

vectorizer.fit(X_train)

X_train_vec = vectorizer.transform(X_train)

X_test_vec = vectorizer.transform(X_test)

In [86]:
X_train_vec_df = pd.DataFrame(X_train_vec.todense(), 
                    columns = vectorizer.get_feature_names())

## Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [119]:
## YOUR CODE HERE

clf = RandomForestClassifier(n_jobs=2, random_state=0)

In [121]:
clf.fit(X_train_vec.todense(), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [122]:
clf.score(X_train_vec, y_train)

0.9957446808510638

In [89]:
pred = clf.predict(X_test_vec)

In [90]:
clf.score(X_test_vec, y_test)

0.8903061224489796

In [91]:
clf.predict_proba(X_test_vec)[:10]

array([[0.1, 0.9],
       [1. , 0. ],
       [0.9, 0.1],
       [0.1, 0.9],
       [0.2, 0.8],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ]])

In [92]:
np.argsort(clf.feature_importances_)[4900:5000]

array([1956,  682, 1673, 3564, 1961, 4333, 1917, 3576, 4454, 2919, 3522,
       1991, 4060, 2020, 1667, 1670, 4713, 4938,  605, 1437, 1993,  748,
        879, 2899, 1777,  800, 2101, 4875,   85, 2002,  891, 2551, 1550,
       4046,  618, 1619,  587, 2917, 4447, 4943, 1592,  478, 3852, 2360,
       1215, 1803, 3243, 4935, 3677,  914, 4962, 3104, 2999, 4066,  761,
        363, 4138, 2014, 4324, 2058, 1521, 1792, 2824, 4662, 1346, 4886,
       2161, 3534, 2631, 2431, 2685, 4434, 1238, 4707, 2013, 2225, 1568,
        665, 1299, 2365, 4449, 2325, 3160, 1689, 4880,  825, 4951, 1141,
       3854, 2066, 4446, 4982, 4042, 4702, 4893, 4436, 3533, 2056, 4872,
       4870])

In [93]:
feature_names = pd.DataFrame(X_train_vec.todense(), columns=vectorizer.get_feature_names()).head()

In [116]:
# Top 100 words

feature_names.columns[[1956,  682, 1673, 3564, 1961, 4333, 1917, 3576, 4454, 2919, 3522,
       1991, 4060, 2020, 1667, 1670, 4713, 4938,  605, 1437, 1993,  748,
        879, 2899, 1777,  800, 2101, 4875,   85, 2002,  891, 2551, 1550,
       4046,  618, 1619,  587, 2917, 4447, 4943, 1592,  478, 3852, 2360,
       1215, 1803, 3243, 4935, 3677,  914, 4962, 3104, 2999, 4066,  761,
        363, 4138, 2014, 4324, 2058, 1521, 1792, 2824, 4662, 1346, 4886,
       2161, 3534, 2631, 2431, 2685, 4434, 1238, 4707, 2013, 2225, 1568,
        665, 1299, 2365, 4449, 2325, 3160, 1689, 4880,  825, 4951, 1141,
       3854, 2066, 4446, 4982, 4042, 4702, 4893, 4436, 3533, 2056, 4872,
       4870]]

Index(['gil', 'blue', 'er', 'released', 'give', 'takes', 'friend', 'rememred',
       'this', 'north', 'record', 'guys', 'south', 'held', 'english', 'entire',
       'using', 'workers', 'baseball', 'died', 'had', 'break', 'cases', 'non',
       'favore', 'butterflies', 'inappropriate', 'when', '1980s', 'happen',
       'caused', 'male', 'drink', 'song', 'battle', 'egypt', 'band', 'normal',
       'things', 'world', 'earth', 'around', 'series', 'larger', 'country',
       'female', 'plant', 'work', 'rock', 'century', 'ww2', 'parents', 'one',
       'space', 'brish', 'also', 'star', 'head', 'system', 'human', 'done',
       'feel', 'named', 'uned', 'deal', 'who', 'instead', 'reddors', 'means',
       'life', 'mexico', 'that', 'created', 'used', 'he', 'it', 'due', 'black',
       'cy', 'late', 'think', 'known', 'people', 'ever', 'which', 'called',
       'would', 'considered', 'serious', 'if', 'thing', 'you', 'something',
       'us', 'why', 'the', 'redd', 'how', 'whats', 'what'],
      d

### The three most important words were: _'What, whats, and how'._

This makes sense because when you ask a question (on r/AskReddit), most questions start with 'what', 'what's', and 'how'. (Why was #4).

After deeper exploration, other interesting feature words in top 100 were: 'called', 'people', 'used', 'earth', 'city', 'favorite', 'best', 'black', 'ww2', 'blood', 'created', 'love'.

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [128]:
## YOUR CODE HERE

#One hot encoded above.

#### Thought experiment: What is the baseline accuracy for this model?

In [96]:
# The baseline accuracy would be the proportion of TIL to AskReddit in the total dataset.
# 978 for AskReddit (Class 0)
# 631 for TIL (Class 1)

# Thus:
ask_reddit_baseline = 926/1537
ask_reddit_baseline

#The model's accuracy scored ~93%. Did a lot better than the baseline.

0.6024723487312947

In [97]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred) 

array([[228,  11],
       [ 32, 121]])

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [105]:
## YOUR CODE HERE

steps = [
    ('random_forest', RandomForestClassifier())
]

pipe = Pipeline(steps)

param_grid = {
    'random_forest__max_depth' : [30, 50, 70, 80, 100],
    'random_forest__n_estimators' : [20, 30, 50, 70, 80, 100]
}

gs = GridSearchCV(pipe, param_grid, cv = 3)
result = gs.fit(X_train_vec, y_train)


In [106]:
result.best_params_

{'random_forest__max_depth': 100, 'random_forest__n_estimators': 100}

In [107]:
result.best_score_ # Not much difference. 

0.9276595744680851

## `GradientBoost` with GridSearchCV

In [188]:
## Gradient Boost

gradient_boost = GradientBoostingClassifier()

grad_params = {
    'n_estimators' : [80, 90, 100],
    'learning_rate' : [0.09, 0.1, 0.2, 0.5],
    'max_depth' : [1, 2, 4, 6]
}

g_search = GridSearchCV(gradient_boost, param_grid = grad_params, cv = 3)
g_search.fit(X_train_vec, y_train)
print(g_search.best_params_)
print(g_search.best_score_)

{'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 80}
0.9347079037800687


In [190]:
g_search.score(X_test_vec, y_test) #model is slightly overfitting.

0.9278350515463918

## `Decision Tree` with GridSearchCV

In [110]:
## Decision Tree

dt = DecisionTreeClassifier(max_depth= 20)
cross_val_score(dt, X_train_vec, y_train, cv = 5).mean()

0.9344563726532714

In [111]:
dt.fit(X_train_vec, y_train)
dt.score(X_train_vec, y_train)

0.9744680851063829

In [112]:
dt.score(X_test_vec, y_test) 

0.9311224489795918

In [102]:
dt = DecisionTreeClassifier()

dt_params = {
    'max_depth' : [10, 20, 30, 50, 100, 200],
}

gridsearch = GridSearchCV(dt, param_grid = dt_params, cv = 5)
gridsearch.fit(X_train_vec, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [10, 20, 30, 50, 100, 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [103]:
gridsearch.best_params_

{'max_depth': 20}

In [198]:
gridsearch.score(X_test_vec, y_test)

0.9226804123711341

# Executive Summary
---


1. Random Forest had a 99% training accuracy score and a 93% test score.
2. Gradient boost had a 93% training accuracy score and a 92% test score.
3. Decision Tree had a 97% training accuracy score and a 92% test score.

Using feature_importance_, I found that the most important word was 'What'. This makes sense because most of the posts that belonged to 'AskReddit' starts with 'What'. 

In r/todayilearned, all of the posts begin with 'TIL'. I took that keyword out because this would make the classifying too easy. Although this keyword was taken out, having a 92% accuracy rate across three different models is promising! 

Observing which key words were most important provides insight into what kind of questions most people asked. Seems like a lot of people had scientific questions relating to their physical surroundings judging from keywords such as, 'city', 'earth', and 'called'. I also think people were interested in societal questions judging from keywords such as, 'black' and 'love'. Historical questions on AskReddit were popular too, 'WW2' came up once.

In the future, I would like to experiment taking out 'What' and 'Whats' and see the change in accuracy rate. Since we also know what types of questions are popular, I would like to know if there's a correlation on Reddit Gold and asking  or answering scientific/historical/societal questions.
