# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [139]:
import requests
import json
import pandas as pd
import time
import datetime
from datetime import timedelta
import numpy as np

from bs4 import BeautifulSoup

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
from sklearn import model_selection

import matplotlib.pyplot as plt


#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [140]:
URL = "http://www.reddit.com/hot.json"

num_pages = 120

reddit_posts = []

after = None

#cycle through 25 post groups and append to a list
for _ in range(num_pages):
    req_url = URL + "?after=" + after if after else URL
    res = requests.get(req_url, headers={"User-agent": "Sted Bot 0.1"})
    
    data = res.json()["data"]
    children = data["children"]
    after = data["after"]
    time.sleep(3)
    
    for j in range(len(children)):
        post = children[j]["data"]
        reddit_posts.append(post)
    print("cycles: ", _)

cycles:  0
cycles:  1
cycles:  2
cycles:  3
cycles:  4
cycles:  5
cycles:  6
cycles:  7
cycles:  8
cycles:  9
cycles:  10
cycles:  11
cycles:  12
cycles:  13
cycles:  14
cycles:  15
cycles:  16
cycles:  17
cycles:  18
cycles:  19
cycles:  20
cycles:  21
cycles:  22
cycles:  23
cycles:  24
cycles:  25
cycles:  26
cycles:  27
cycles:  28
cycles:  29
cycles:  30
cycles:  31
cycles:  32
cycles:  33
cycles:  34
cycles:  35
cycles:  36
cycles:  37
cycles:  38
cycles:  39
cycles:  40
cycles:  41
cycles:  42
cycles:  43
cycles:  44
cycles:  45
cycles:  46
cycles:  47
cycles:  48
cycles:  49
cycles:  50
cycles:  51
cycles:  52
cycles:  53
cycles:  54
cycles:  55
cycles:  56
cycles:  57
cycles:  58
cycles:  59
cycles:  60
cycles:  61
cycles:  62
cycles:  63
cycles:  64
cycles:  65
cycles:  66
cycles:  67
cycles:  68
cycles:  69
cycles:  70
cycles:  71
cycles:  72
cycles:  73
cycles:  74
cycles:  75
cycles:  76
cycles:  77
cycles:  78
cycles:  79
cycles:  80
cycles:  81
cycles:  82
cycles:  83
cy

Need to create a function that reads in csv files from folder and concatenates

In [141]:
# Initialize main df
df = pd.DataFrame()

In [142]:
#Add each cycle of scraping to dataframe

for i in reddit_posts:
    df = df.append(pd.DataFrame([i]), ignore_index=True)

In [144]:
df.shape ## checking work

(3000, 85)

In [146]:
##generate length of time
df['now'] = time.time()
df['since_posted'] = df['now'] - df['created_utc']

In [147]:
df.columns  ## Checking work

Index(['approved_at_utc', 'approved_by', 'archived', 'author',
       'author_cakeday', 'author_flair_css_class', 'author_flair_template_id',
       'author_flair_text', 'banned_at_utc', 'banned_by', 'can_gild',
       'can_mod_post', 'clicked', 'contest_mode', 'created', 'created_utc',
       'crosspost_parent', 'crosspost_parent_list', 'distinguished', 'domain',
       'downs', 'edited', 'gilded', 'hidden', 'hide_score', 'id',
       'is_crosspostable', 'is_reddit_media_domain', 'is_self', 'is_video',
       'likes', 'link_flair_css_class', 'link_flair_text', 'locked', 'media',
       'media_embed', 'media_metadata', 'media_only', 'mod_note',
       'mod_reason_by', 'mod_reason_title', 'mod_reports', 'name', 'no_follow',
       'num_comments', 'num_crossposts', 'num_reports', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_categories',
       'post_hint', 'preview', 'previous_visits', 'pwls', 'quarantine',
       'removal_reason', 'report_reasons', 'saved', 

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [None]:
## YOUR CODE HERE
# Above loop can be run any number of times

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [148]:
# Export to csv
# setup time stamp
current_time = datetime.datetime.now()
date = (  str(current_time.year)+'_'+
            str(current_time.month)+'_'+
            str(current_time.day)+'_'+
            str(current_time.hour)+'_'+
            str(current_time.minute))
filename = 'reddit_hot_posts_' + date + '.csv'

# append results to csv
df.to_csv(filename, mode='a')
print('Results added to CSV!')

Results added to CSV!


## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [149]:
## YOUR CODE HERE
df = pd.read_csv("reddit_hot_posts_2018_6_1_12_48.csv")

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

In [150]:
df['since_posted_days'] = df.since_posted/60/60/24

df['since_posted_hours'] = df.since_posted/60/60

df.shape

(3000, 91)

In [151]:
df.since_posted_days.mean()

0.3276354670686266

In [152]:
median = np.median(df.num_comments)
median

19.0

In [153]:
df['is_outlier'] = df['num_comments'] - median

In [154]:
## create 1 or 0 classification for median

def function(x):
    if x > 0:
        return 1
    else:
        return 0

df['is_outlier_bin'] = list(map(function, df['is_outlier']))
# 

In [184]:
# develop a selection of features for modelling

features = [
    #'title',
    #'is_outlier',
    #'since_posted',
    #'subreddit',
    #'num_comments',
    #'is_outlier_bin',
    'gilded',
    'since_posted_days',
    'num_crossposts',
    'ups',
    'since_posted_hours'
    
]
post_df = df[features]

In [185]:
# Create dummies based on some key features

dummies_subreddit = pd.get_dummies(df['subreddit'], prefix = 'subreddit')
dummies_posthint = pd.get_dummies(df['post_hint'], prefix = 'post_hint')

dummies_posthint.shape

(3000, 6)

In [186]:
## Build dateframe to be modelled against

X_columns_concat = [
    
    post_df,
    dummies_subreddit,
    dummies_posthint
    ]
X = pd.concat(X_columns_concat, axis = 1)
#X = pd.concat(['post_df', 'X'], axis = 1)

In [187]:
X.shape  # Checking work

(3000, 1969)

In [188]:
X.columns  #  Checking work

Index(['gilded', 'since_posted_days', 'num_crossposts', 'ups',
       'since_posted_hours', 'subreddit_13ReasonsWhy', 'subreddit_13or30',
       'subreddit_195', 'subreddit_2healthbars', 'subreddit_2mad4madlads',
       ...
       'subreddit_yourmomshousepodcast', 'subreddit_youseeingthisshit',
       'subreddit_youtubehaiku', 'subreddit_zelda', 'post_hint_hosted:video',
       'post_hint_image', 'post_hint_link', 'post_hint_rich:video',
       'post_hint_self', 'post_hint_video'],
      dtype='object', length=1969)

In [189]:
X.to_csv('X_df.csv', mode='a')  ## in case this is needed later, without re running scraping

In [190]:
#### Marks beginning of modelling section

In [191]:
# Develop training and test sets

y = df["is_outlier_bin"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [192]:
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

0.7493333333333333

In [163]:
log = LogisticRegression()
log.fit(X_train, y_train)
log.score(X_test, y_test)

0.7506666666666667

In [164]:
#importances = forest.feature_importances_
std = np.std([et.feature_importances_ for tree in et.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

Feature ranking:
1. feature 3 (0.107792)
2. feature 1 (0.095077)
3. feature 1963 (0.046568)
4. feature 2 (0.026340)
5. feature 1964 (0.008443)
6. feature 713 (0.003975)
7. feature 1962 (0.003681)
8. feature 283 (0.003123)
9. feature 1838 (0.003034)
10. feature 1965 (0.002951)
11. feature 1310 (0.002882)
12. feature 237 (0.002822)
13. feature 1785 (0.002785)
14. feature 0 (0.002658)
15. feature 612 (0.002641)
16. feature 771 (0.002601)
17. feature 783 (0.002556)
18. feature 696 (0.002375)
19. feature 1374 (0.002304)
20. feature 318 (0.002295)
21. feature 1324 (0.002265)
22. feature 1098 (0.002244)
23. feature 1906 (0.002134)
24. feature 699 (0.002112)
25. feature 887 (0.002096)
26. feature 544 (0.002079)
27. feature 1461 (0.002061)
28. feature 228 (0.002053)
29. feature 188 (0.002049)
30. feature 700 (0.002041)
31. feature 1706 (0.002018)
32. feature 1869 (0.002010)
33. feature 1895 (0.002005)
34. feature 1925 (0.001985)
35. feature 1010 (0.001960)
36. feature 1301 (0.001941)
37. featur

1011. feature 1024 (0.000136)
1012. feature 133 (0.000135)
1013. feature 1722 (0.000135)
1014. feature 1683 (0.000134)
1015. feature 1096 (0.000134)
1016. feature 1773 (0.000134)
1017. feature 1567 (0.000133)
1018. feature 498 (0.000133)
1019. feature 110 (0.000132)
1020. feature 584 (0.000131)
1021. feature 1075 (0.000130)
1022. feature 823 (0.000130)
1023. feature 84 (0.000129)
1024. feature 1402 (0.000128)
1025. feature 806 (0.000127)
1026. feature 1133 (0.000127)
1027. feature 1666 (0.000127)
1028. feature 661 (0.000127)
1029. feature 1542 (0.000126)
1030. feature 399 (0.000126)
1031. feature 712 (0.000125)
1032. feature 646 (0.000124)
1033. feature 1560 (0.000124)
1034. feature 587 (0.000124)
1035. feature 729 (0.000124)
1036. feature 1232 (0.000122)
1037. feature 886 (0.000120)
1038. feature 1949 (0.000120)
1039. feature 881 (0.000120)
1040. feature 209 (0.000119)
1041. feature 1303 (0.000117)
1042. feature 818 (0.000117)
1043. feature 348 (0.000116)
1044. feature 451 (0.000116)


1510. feature 877 (0.000001)
1511. feature 705 (0.000001)
1512. feature 1385 (0.000001)
1513. feature 816 (0.000001)
1514. feature 559 (0.000001)
1515. feature 1842 (0.000001)
1516. feature 1799 (0.000001)
1517. feature 987 (0.000001)
1518. feature 1614 (0.000001)
1519. feature 767 (0.000001)
1520. feature 1564 (0.000001)
1521. feature 1329 (0.000001)
1522. feature 1198 (0.000001)
1523. feature 350 (0.000001)
1524. feature 515 (0.000001)
1525. feature 200 (0.000001)
1526. feature 73 (0.000001)
1527. feature 1012 (0.000001)
1528. feature 935 (0.000001)
1529. feature 277 (0.000001)
1530. feature 1421 (0.000001)
1531. feature 1333 (0.000001)
1532. feature 205 (0.000001)
1533. feature 926 (0.000001)
1534. feature 840 (0.000001)
1535. feature 121 (0.000001)
1536. feature 836 (0.000001)
1537. feature 1901 (0.000001)
1538. feature 799 (0.000001)
1539. feature 1236 (0.000001)
1540. feature 33 (0.000001)
1541. feature 32 (0.000001)
1542. feature 517 (0.000001)
1543. feature 758 (0.000001)
1544.

IndexError: index 1968 is out of bounds for axis 0 with size 1968

In [166]:
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees)
results = model_selection.cross_val_score(model, X, y, cv=4)
print(results.mean())

0.6983454613548943


#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE
rf = RandomForestClassifier(bootstrap = True) ## N_features auto takes square root
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

In [168]:
et = ExtraTreesClassifier(bootstrap = True)
et.fit(X_train, y_train)
et.score(X_test, y_test)

0.764

In [169]:
et.decision_path(X)

(<3000x20382 sparse matrix of type '<class 'numpy.int64'>'
 	with 3344035 stored elements in Compressed Sparse Row format>,
 array([    0,  1971,  4026,  5953,  8092, 10093, 12210, 14105, 16172,
        18311, 20382]))

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [172]:
from sklearn.feature_extraction.text import CountVectorizer

In [199]:
text = []
for i in df['title']:
    text.append(i)

In [200]:
text

['This news paper from the Dominican republic used a picture of Alec Baldwin as Trump',
 'Kim k forgets her baby',
 'I make paintings of cars. This is my latest work! (OC)',
 'How to make air goggle to see underwater',
 'I love Mick Jagger',
 'Bless Online trailer plagiarized How to train your dragon',
 'Twist it, pull it, bop it',
 'WCGW trying to get the ball down field? (sound on)',
 'Current mood.',
 'Give me an R rated Boba Fett movie with some real dark underworld action, and I’ll never ask for anything ever again.',
 "Dylan's addiction to Robitussin was affecting his state of mind but his mother was tired of mopping the floors",
 'How do I look?',
 'Child for sale',
 'Asterix : The Secret of the Magic Potion - Official Poster',
 "50 of us jumped into same games for this, It's time.",
 'About 7,000 years ago, something weird happened to men: the genetic diversity of their Y chromosomes collapsed. It was as if there was only one man left to mate for every 17 women. The collapse ma

In [207]:
#text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# # summarize
print(vectorizer.vocabulary_)
# # encode document
vector = vectorizer.transform(text)
# # summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'this': 6849, 'news': 4678, 'paper': 4971, 'from': 2828, 'the': 6823, 'dominican': 2177, 'republic': 5708, 'used': 7184, 'picture': 5143, 'of': 4793, 'alec': 478, 'baldwin': 814, 'as': 690, 'trump': 7047, 'kim': 3805, 'forgets': 2762, 'her': 3239, 'baby': 785, 'make': 4194, 'paintings': 4947, 'cars': 1326, 'is': 3605, 'my': 4587, 'latest': 3916, 'work': 7513, 'oc': 4782, 'how': 3370, 'to': 6912, 'air': 453, 'goggle': 2998, 'see': 6031, 'underwater': 7127, 'love': 4122, 'mick': 4400, 'jagger': 3641, 'bless': 1009, 'online': 4830, 'trailer': 6980, 'plagiarized': 5174, 'train': 6981, 'your': 7603, 'dragon': 2214, 'twist': 7081, 'it': 3621, 'pull': 5434, 'bop': 1068, 'wcgw': 7391, 'trying': 7052, 'get': 2946, 'ball': 815, 'down': 2202, 'field': 2641, 'sound': 6367, 'on': 4826, 'current': 1879, 'mood': 4509, 'give': 2972, 'me': 4313, 'an': 544, 'rated': 5544, 'boba': 1038, 'fett': 2634, 'movie': 4548, 'with': 7484, 'some': 6346, 'real': 5567, 'dark': 1939, 'underworld': 7129, 'action': 373

In [208]:
vector = vector.toarray()
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [210]:
y = df["is_outlier_bin"]

X_train, X_test, y_train, y_test = train_test_split(vector, y)

In [211]:
## YOUR CODE HERE
et_vectors = ExtraTreesClassifier(bootstrap = True)
et_vectors.fit(X_train, y_train)
et_vectors.score(X_test, y_test)

0.588

#### Repeat the model-building process with a non-tree-based method.

In [212]:
## YOUR CODE HERE
log = LogisticRegression()
log.fit(X_train, y_train)
log.score(X_test, y_test)

0.5693333333333334

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

Reddit reaches 30 million users daily. There is at least a 25% opportunity to increase add revenue from targeting communities based on post content and community interaction. Through modern data science techniques, we can model these interactions and make effective predictions to maximize add revenue.

Currently advertisements are served up randomly to the entire reddit community. This is neither effective for our advertisers, nor is it helpful for our users who could benefit from adds that interest them specifically.

Using techniques in data science such as natural language processing and logistic regression in combination with well engineered user tracking on the reddit website, we can cater our advertisements directly to those that are interested in the product or service being sold. 

With our help, Facebook and Google have been leading the way, generating record revenue with extensive use of this model.

Our team is ready to help Reddit implement these proven strategies within their communities. It is time to bring value to the advertisements being served to the Reddit community by quickly executing a proven strategy in modern data science. 


![image.png](attachment:image.png)

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE