# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import requests
import json

In [2]:
URL = "http://www.reddit.com/r/boardgames.json"

In [3]:
# I ended up using Node and the pushshift.io API

# the code for getting the files is located in the pushshift.js file

# I also made a function for merging all of the files in the merge.js file

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [4]:
## YOUR CODE HERE

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [5]:
import pandas as pd

In [97]:
sm = pd.read_json("./json2/SequelMemes.json")
pm = pd.read_json("./json2/PrequelMemes.json")


In [130]:
sm_titles = sm[["title"]]
pm_titles = pm[["title"]]

sm_titles["is_sequel_meme"] = 1
pm_titles["is_sequel_meme"] = 0

meme_titles = pd.concat([pm_titles,sm_titles])
print(meme_titles.head())
print(meme_titles.tail())

                                title  is_sequel_meme
0                      Drunk Politics               0
1                 When the Fun Begins               0
2                      Just one Windu               0
3  dlmoisttlotjidnftdsaydihbpjfastmne               0
4                     Drunk Democracy               0
                                                   title  is_sequel_meme
14595                        His swoleness got him #6!!!               1
14596  Looks like someone at my local brewery is a Se...               1
14597  MAGA.... Nah! Time to make the Republic great ...               1
14598           Take On Me except it's Leia slapping Poe               1
14599           Take On Me except it's Leia slapping Poe               1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [103]:
corpus = [title[0] for title in meme_titles[["title"]].values]

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [125]:
## YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Fit the vectorizer on our corpus
cvec = CountVectorizer()
cvec.fit(corpus)

# Transform the corpus
new_corpus = cvec.transform(corpus)

In [134]:
df  = pd.DataFrame(new_corpus.todense(),
                   columns=cvec.get_feature_names())
df.head()

Unnamed: 0,00,000,00001,00100000,007,009,00am,01,01100001,01100100,...,œìž,œðÿ,širl,šã,šðÿ,žirl,žã,žæ,žè,ˆì
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# borrowed from http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

vectorizers = [
    ("cvec", CountVectorizer()),
    ("cvec_sw", CountVectorizer(stop_words='english')),
    ("tfidf", TfidfVectorizer()),
    ("tfidf_sw", TfidfVectorizer(stop_words='english')),
]

classifiers = [
    ("knn", KNeighborsClassifier()),
    ("svm", SVC()),
    ("tree", DecisionTreeClassifier()),
    ("rfc", RandomForestClassifier()),
    ("ada", AdaBoostClassifier()),
    ("gcp", GaussianProcessClassifier()),
#     ("mlp", MLPClassifier()),
#     ("bnb", BinomialNB()),
    ("qda", QuadraticDiscriminantAnalysis())
]


for vec in vectorizers:
    local_vec = vec[1]
    local_vec.fit(corpus)
    # Transform the corpus
    new_corpus = local_vec.transform(corpus)
    X  = pd.DataFrame(new_corpus.todense(),
                   columns=local_vec.get_feature_names())
    y = meme_titles[["is_sequel_meme"]]
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    for classifier in classifiers:
        classifier[1].fit(X_train, y_train)
        print(f"Train data: {classifier[1].score(X_train, y_train)}")
        print(f"Test data: {classifier[1].score(X_test, X_test)}")

## Predicting subreddit using Random Forests + Another Classifier

In [109]:
## YOUR CODE HERE
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

X = meme_titles[["title"]]
y = meme_titles[["is_sequel_meme"]]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [110]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)


ValueError: could not convert string to float: 'When you jump on a meme bandwagon, but instantly have second thoughts.'

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [13]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [14]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [15]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [16]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [17]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.