# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import requests
import json

In [2]:
URL = "http://www.reddit.com/r/science.json"

In [3]:
res = requests.get(URL, headers={'User-agent': 'MAC Bot 0.1'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [5]:
data = res.json()

In [10]:
data

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'science',
     'selftext': '',
     'author_fullname': 't2_f47vk',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': "One man's bipolar symptoms have been linked to the lunar cycle. Typically, our circadian rhythm (biological clock) is linked to the sun, but this patient's circadian rhythm was also linked to the moon. Every new moon, he experienced insomnia and switched from a depressive episode to a manic episode.",
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/science',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': 'psych',
     'downs': 0,
     'thumbnail_height': None,
     'parent_whitelist_status': 'all_ads',
     'hide_score': False,
     'name': 't3_9d4xh9',
     'quarantine': False,
     'link_flair_text_color': 'light',
     'author_fl

In [8]:
print(len(data['data']['children']))

25


In [7]:
data['data']['after']

't3_9czhkg'

In [15]:
data['data']['children'][0]['data']['title']

"One man's bipolar symptoms have been linked to the lunar cycle. Typically, our circadian rhythm (biological clock) is linked to the sun, but this patient's circadian rhythm was also linked to the moon. Every new moon, he experienced insomnia and switched from a depressive episode to a manic episode."

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [16]:
import time
import pandas as pd

The following is a function that will scrape 1000 posts from each subreddit.

In [46]:
df = pd.DataFrame()

for subreddit in ['science', 'philosophy']:
    url = 'http://www.reddit.com/r/' + subreddit + '.json?'
    for _ in range(40):
        res = requests.get(url, headers={'User-agent': 'wnm Bot 0.1'})
        data = res.json()
        for j in range(25):
            entry = data['data']['children'][j]['data']['title']
            to_append = pd.DataFrame({'post': [entry], 'topic': [subreddit]})
            df = df.append(to_append, ignore_index=True)
        url += 'after=' + data['data']['after']
    time.sleep(3)

In [47]:
df.shape

(2000, 2)

In [48]:
df.head()

Unnamed: 0,post,topic
0,A new study of 100 hunter-gatherers cultures s...,science
1,One man's bipolar symptoms have been linked to...,science
2,No evidence that moral reminders reduce cheati...,science
3,The “chicken or egg” paradox was first propose...,science
4,People who are more well-off were made happier...,science


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [346]:
# Export to csv
df.to_csv('data.csv')

In [347]:
#Let's change science to 1 and philosophy to 0

df = pd.read_csv('data.csv', index_col = 'Unnamed: 0')

In [348]:
df['topic'].loc[0:999] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [349]:
df['topic'].loc[1000:] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [350]:
df.sample(5)

Unnamed: 0,post,topic
1172,The Monarchy of Fear - Interview with Martha N...,0
847,Scientists pioneer a new way to turn sunlight ...,1
1621,On the Falsehood of Philosophy: a skeptic’s pa...,0
481,First known omnivorous shark species identified,1
405,More than a billion adults around the world ar...,1


Here you can see I have changed "science" posts to 1, and "philosophy" posts to 0.

In [351]:
df['topic'].value_counts()

1    1000
0    1000
Name: topic, dtype: int64

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [352]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

In [353]:
X = df['post']
y = df['topic']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [354]:
#Add 'science' and 'philosophy' to stopwords. Otherwise it would be boring!
to_add = ['science', 'philosophy']
stops = list(stopwords.words('english'))

stops.extend(to_add)

In [355]:
cvec = CountVectorizer(stop_words = stops, ngram_range = (1,2))
cvec.fit(X_train)
X_train_cvec = cvec.transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [356]:
tfidf = TfidfVectorizer(stop_words = stops, ngram_range = (1,2))
tfidf.fit(X_train)
X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [357]:
X_train_cvec.todense().shape

(1500, 1934)

In [358]:
X_train_tfidf.todense().shape

(1500, 1934)

## Predicting subreddit using Random Forests + Another Classifier

In [345]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
#Did this above

#### Thought experiment: What is the baseline accuracy for this model?

Since 50% of the posts are science, and 50% are philosophy, the baseline accuracy is 50%. If we guessed that all the posts were science (or that all were philosophy, in this case, since the classes are equal), we would achieve 50% accuracy. We want our model to do better than this.

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [359]:
rf_cvec = RandomForestClassifier(random_state = 10)
rf_cvec.fit(X_train_cvec, y_train)
rf_cvec.score(X_test_cvec, y_test)

0.992

In [360]:
rf_tfidf = RandomForestClassifier(random_state = 10)
rf_tfidf.fit(X_train_tfidf, y_train)
rf_tfidf.score(X_test_tfidf, y_test)

0.992

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [361]:
cross_val_score(rf_cvec, X_train_cvec, y_train).mean()

0.9873333013332054

In [362]:
cross_val_score(rf_tfidf, X_train_cvec, y_train).mean()

0.9873333013332054

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [370]:
cvec_models = {'rf_cvec': [RandomForestClassifier(random_state = 1), {'n_estimators' : [10, 20, 30],
                                                'max_depth' : [10,20,30]}],
         'et_cvec': [ExtraTreesClassifier(random_state = 1), {'n_estimators' : [20, 30, 40],
                                                'max_depth' : [20, 30, 40]}],
         'bag_cvec': [BaggingClassifier(random_state = 1), {'n_estimators' : [5, 10, 15]}],
              'ada_cvec': [AdaBoostClassifier(random_state = 1), {'n_estimators' : [60,70,80]}],
              'mnb_cvec': [MultinomialNB(), {'alpha': [1.0e-10, 1.0, 2.0]}],
              'knn_cvec': [KNeighborsClassifier(), {'n_neighbors': [3,5,7]}],
              'lg_cvec': [LogisticRegressionCV(random_state = 1), {'Cs': [1, 10, 15]}],
              'svc_cvec': [SVC(random_state = 1), {'C': [0.25, 0.5, 1.0], 'kernel': ['rbf', 'linear', 'poly']}]}

In [371]:
for k, v in cvec_models.items():
    gs = GridSearchCV(v[0], v[1])
    gs.fit(X_train_cvec, y_train)
    print(k, gs.best_params_, gs.score(X_test_cvec, y_test))

rf_cvec {'max_depth': 20, 'n_estimators': 10} 0.992
et_cvec {'max_depth': 20, 'n_estimators': 40} 0.992
bag_cvec {'n_estimators': 5} 0.994
ada_cvec {'n_estimators': 70} 0.996
mnb_cvec {'alpha': 1e-10} 0.992
knn_cvec {'n_neighbors': 3} 0.99
lg_cvec {'Cs': 10} 0.994
svc_cvec {'C': 0.25, 'kernel': 'linear'} 0.994


In [372]:
tfidf_models = {'rf_tfidf': [RandomForestClassifier(random_state = 1), {'n_estimators' : [10, 20, 30],
                                                'max_depth' : [10,20,30]}],
         'et_tfidf': [ExtraTreesClassifier(random_state = 1), {'n_estimators' : [20, 30, 40],
                                                'max_depth' : [20, 30, 40]}],
         'bag_tfidf': [BaggingClassifier(random_state = 1), {'n_estimators' : [5, 10, 15]}],
              'ada_tfidf': [AdaBoostClassifier(random_state = 1), {'n_estimators' : [60, 70, 80]}],
              'mnb_tfidf': [MultinomialNB(), {'alpha': [1.0e-10, 1.0, 2.0]}],
              'knn_tfidf': [KNeighborsClassifier(), {'n_neighbors': [3,5,7]}],
              'lg_tfidf': [LogisticRegressionCV(random_state = 1), {'Cs': [1, 10, 15]}],
              'svc_tfidf': [SVC(random_state = 1), {'C': [0.5, 1.0, 2.0], 'kernel': ['rbf', 'linear', 'poly']}]}

In [373]:
for k, v in tfidf_models.items():
    gs = GridSearchCV(v[0], v[1])
    gs.fit(X_train_tfidf, y_train)
    print(k, gs.best_params_, gs.score(X_test_tfidf, y_test))

rf_tfidf {'max_depth': 30, 'n_estimators': 10} 0.994
et_tfidf {'max_depth': 30, 'n_estimators': 30} 0.994
bag_tfidf {'n_estimators': 5} 0.994
ada_tfidf {'n_estimators': 70} 0.996
mnb_tfidf {'alpha': 1e-10} 0.992
knn_tfidf {'n_neighbors': 3} 0.994
lg_tfidf {'Cs': 10} 0.994
svc_tfidf {'C': 0.5, 'kernel': 'linear'} 0.994


With either vectorizer, the AdaBoost model did the best. Let's check out where it went wrong.

In [374]:
ada_cvec = AdaBoostClassifier(n_estimators = 70)
ada_cvec.fit(X_train_cvec, y_train)

ada_tfidf = AdaBoostClassifier(n_estimators = 70)
ada_tfidf.fit(X_train_cvec, y_train)

results= pd.DataFrame()

results['real'] = y_test
results['preds_ada_cvec'] = mnb_cvec.predict(X_test_cvec)
results['preds_ada_tfidf'] = mnb_tfidf.predict(X_test_tfidf)

In [375]:
results[results['real'] != results['preds_ada_cvec']]

Unnamed: 0,real,preds_ada_cvec,preds_ada_tfidf
1048,0,1,1
1041,0,1,1
1029,0,1,1
33,1,0,0


In [376]:
results[results['real'] != results['preds_ada_tfidf']]

Unnamed: 0,real,preds_ada_cvec,preds_ada_tfidf
1048,0,1,1
1041,0,1,1
1029,0,1,1
33,1,0,0


It's the same. So let's look into them.

In [399]:
print(df['post'].iloc[1048])
print(df['post'].iloc[1041])
print(df['post'].iloc[1029])
print(df['post'].iloc[33])

How We Can Access New Realms Through Kindness – Josia Nakash – Medium
To be happier, focus on what’s within your control
Berlin, Rawls and Nozick, liberty and the collective.
Domestic horses ( Equus caballus ) discriminate between negative and positive human nonverbal vocalisations


In [391]:
features = pd.DataFrame()
features['importances'] = ada_cvec.feature_importances_
features['features'] = cvec.get_feature_names()

In [396]:
features.sort_values('importances', ascending = False).head(10)

Unnamed: 0,importances,features
670,0.085714,first
1665,0.057143,study
185,0.057143,bone
1832,0.057143,used
720,0.057143,future
1526,0.057143,scientists
565,0.057143,every
1571,0.042857,show
143,0.042857,bacteria
1615,0.028571,species


In [400]:
string = "So like, right now for example. The Haitians need to come to America. But some people are all, “What about the strain on our resources?” Well it’s like when I had this garden party for my father’s birthday. I put R.S.V.P. ’cause it was a sit-down dinner. But some people came that like did not R.S.V.P. I was totally buggin’. I had to haul ass to the kitchen, redistribute the food, and squish in extra place settings. But by the end of the day it was, like, the more the merrier. And so if the government could just get to the kitchen and rearrange some things we could certainly party with the Haitians. And in conclusion may I please remind you it does not say R.S.V.P. on the Statue of Liberty! Thank you very much."

In [403]:
cher = cvec.transform([string])

In [404]:
ada_cvec.predict(cher)

array([0])

In [405]:
trump = ['The Woodward book is a scam. I don’t talk the way I am quoted. If I did I would not have been elected President. These quotes were made up. The author uses every trick in the book to demean and belittle. I wish the people could see the real facts - and our country is doing GREAT!']

In [406]:
tweet = cvec.transform(trump)

In [407]:
ada_cvec.predict(tweet)

array([1])

# Executive Summary
---
Put your executive summary in a Markdown cell below.

The goal of the present project was to build and optimize a model that is able to distinguish between Reddit posts on science and philosophy. Reddit, a discussion forum, contain many "subreddits," each on a different topic. The site provides an API for webscraping. which was used in the project. From each of the subreddits r/science and r/philosophy were gathered 1000 posts. The posts were then put through a count vectorizer and a term frequency/inverse document frequency vectorizer, separately. A number of models were built in the process, including random forests, extra trees, bagging classifiers, adaptive boost classifiers, multinomial naive Bayes classifiers, k-nearest neighbors classifiers, logistic regressions, and support vector machine classifiers. Each of the models were optimized under a variety of conditions to predict whether a post in the test set appeared in the science or philosophy subreddits. The adaptive boost model performed the best, with n_estimators = 70. The choice of vectorizer made no difference in the accuracy, nor the specific posts the model mischaracterized. Within the 500 element test set, the model chose incorrectly only 4 times, with 3 philosophy posts incorrectly dubbed as science, and one vice versa. The accuracy of the model was 99.6%. Although common stopwords, including, in this case, "science" and "philosophy," were removed before vectorization, the most important features were still characteristically scientific, including "bacteria," "species," "study," and "future."