# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [26]:
# Import libraries
import requests
import json
import time
import pandas as pd
import numpy as np
import regex as re
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

In [27]:
URL = "http://www.reddit.com/r/boardgames.json"

In [28]:
res = requests.get(URL, headers={'User-agent': 'Polina Bot 0.1'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [29]:
data = res.json()
print(data['data'].__len__())

5


#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [30]:
from pathlib import Path
import time

# check if my file exists
project_data = Path("./project_data.csv")

if project_data.is_file():  
    # if does, read it
    df = pd.read_csv('./project_data.csv',index_col=0)
    
else:
    # else read data from server
    all_posts =[]
    for subreddit in ['cats','dogs']:
        url = 'http://www.reddit.com/r/'+subreddit+'.json'
        num_posts = 0
        while num_posts < 1000: 
            # construct a list of 500 for each subreddit

            # Get the posts by hitting the url, put it in json and store it
            res = requests.get(url, headers={'User-agent': 'polina'})
            data = res.json()
            
            # save only the posts out of the json into the list_of_posts, then
            # add all the posts to the all_posts list
            list_of_posts = data['data']['children']
            all_posts = all_posts + list_of_posts
            num_posts += len(list_of_posts)
            
            #print('The current number of posts: ', num_posts)    

            # reassign the after to the current 'after', and then update the url to hit
            if data['data']['after'] == None:
                # something goes wrong
                print('Server returned after=None')
                break
            else:   
                after = data['data']['after']
                url = 'http://www.reddit.com/r/'+subreddit+'.json?after=' + after

            # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out

            time.sleep(3)
    
    # extract data    
    all_posts_data = [x['data']['selftext'] for x in all_posts]
    # extract titles
    all_posts_titles = [x['data']['title'] for x in all_posts]
    all_posts_subreddit = [x['data']['subreddit'] for x in all_posts]
    
    # extract descriptions, and remember, that not all of them have description
    for i in range(len(all_posts)):
        try:
            all_posts_titles[i] += all_posts[i]['data']['secure_media']['oembed']['description']
        except:
            pass
    
    # create dataframe
    df = pd.DataFrame(all_posts_data,columns=['data'])
    df['title'] = all_posts_titles
    df['subreddit'] = all_posts_subreddit
    
    # Export to csv
    df.to_csv('project_data.csv')

Server returned after=None
Server returned after=None


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

> We saved to csv above

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

> Do some preprocessing:
- Lead our text to lower case
- Leave letters only
- Use lemmatizer
- Delete stop words

In [31]:
def preprocessing(raw_text):
    
    # 1. Convert to lowercase
    lower_text = raw_text.lower()
    
    # 2. Remove punctuation
    letters_only = re.sub("[^a-z]",     # The pattern to search for
                          " ",          # The pattern to replace it with
                          lower_text )  # The text to search
    
    # 3. Split and lemmatize words
    words = letters_only.split()
    lemmatizer = WordNetLemmatizer()
    words_lem = [lemmatizer.lemmatize(i) for i in words]
     
    # 4. Remove stop words
    stops = set(stopwords.words('english'))
    meaningful_words = [w for w in words if w not in stops]

    
    # 5. Join the words back into one string separated by space, 
    # and return the result.
    return (' '.join(meaningful_words))
  

> Create new feature with preparing title

In [32]:
df['prep_title']=df['title'].map(preprocessing)

> Create a target column from subreddit: 1 if it 'boardgames' and 0 else.

In [33]:
df.head()

Unnamed: 0,data,title,subreddit,prep_title
0,,Please do us mods a favor and if you report so...,cats,please us mods favor report something claiming...
1,,Drew a nice picture of a cat on my iPad the ot...,cats,drew nice picture cat ipad day black backgroun...
2,,This is my feline. There is no tragic protect ...,cats,feline tragic protect story definitely youthfu...
3,,His name is Meow Meow. He is my pride and joy....,cats,name meow meow pride joy years old ever since ...
4,,My babies. Just wanted to celebrate the love I...,cats,babies wanted celebrate love girls probably re...


In [34]:
df['target'] = df['subreddit'].map(lambda x: 1 if x=='dogs' else 0)
y = df['target']

> Create X from 'title' column by CountVectorizer

In [35]:
cvec = CountVectorizer(stop_words='english')

cvec_X_data = cvec.fit_transform(df['prep_title'])

cvec_X  = pd.DataFrame(cvec_X_data.todense(),
                   columns=cvec.get_feature_names())

> Use cross-validation and logistic regression to estimate result of count vectorizing

In [36]:
kf = KFold(n_splits=5, shuffle=True,random_state=42)
logreg = LogisticRegression(random_state=42)
print('Logreg score for CountVectorizer: ',cross_val_score(logreg,cvec_X,y,cv=kf).mean())

Logreg score for CountVectorizer:  0.9779403763531642


> Create X from 'prep_title' column by TfidVectorizer

In [37]:
tfid = TfidfVectorizer(stop_words='english')

tfid_X_data = tfid.fit_transform(df['prep_title'])

tfid_X  = pd.DataFrame(tfid_X_data.todense(),
                   columns=cvec.get_feature_names())

> Use cross-validation and logistic regression to estimate result of Tfid vectorizing

In [38]:
logreg = LogisticRegression(random_state=42)
print('Logreg score for TfidVectorizer: ',cross_val_score(logreg,tfid_X,y,cv=kf).mean())

Logreg score for TfidVectorizer:  0.9699742969175063


## Predicting subreddit using Random Forests + Another Classifier

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

> We already create out target variable

#### Thought experiment: What is the baseline accuracy for this model?

In [39]:
df['target'].value_counts(normalize = True)

0    0.541054
1    0.458946
Name: target, dtype: float64

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

> Use random forest on both cvec and tfid samples

In [40]:
cvec_rf = RandomForestClassifier(random_state=42)
print('Random forest score for CountVectorizer: ',cross_val_score(cvec_rf,cvec_X,y,cv=kf).mean())

Random forest score for CountVectorizer:  0.9791729986304196


In [41]:
tfid_rf = RandomForestClassifier(random_state=42)
print('Random forest score for TfidVectorizer: ',cross_val_score(tfid_rf,tfid_X,y,cv=kf).mean())

Random forest score for TfidVectorizer:  0.9816269863604811


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

> We did it above

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

> Use multinomial Naive Bayes

In [42]:
cvec_nb = MultinomialNB()
print('Multinomial NB score for CountVectorizer: ',cross_val_score(cvec_nb,cvec_X,y,cv=kf).mean())

Multinomial NB score for CountVectorizer:  0.9638562128290274


In [43]:
tfid_nb = MultinomialNB()
print('Multinomial NB score for TfidVectorizer: ',cross_val_score(tfid_nb,tfid_X,y,cv=kf).mean())

Multinomial NB score for TfidVectorizer:  0.9607887281664509


> We can see that Random Forest much better on both cvec and tfid samples, than multinomial NB

> Use support vectors machine

In [44]:
from sklearn import svm

svc = svm.SVC()
params = {
     'kernel':['rbf','linear','poly','sigmoid']

}
gs = GridSearchCV(svc,param_grid=params)
gs.fit(cvec_X,y)
print("SVM score for CountVectorizer",gs.best_score_)
gs.best_params_

SVM score for CountVectorizer 0.9822303921568627


{'kernel': 'linear'}

> Do the same but for TfidVectorizer

In [45]:
svc = svm.SVC()
params = {
    'kernel':['linear','poly'],
    'C':[0.5,1,1.5]

}
gs = GridSearchCV(svc,param_grid=params)
gs.fit(tfid_X,y)
print("SVM score for TfidVectorizer",gs.best_score_)
gs.best_params_

SVM score for TfidVectorizer 0.9822303921568627


{'C': 1, 'kernel': 'linear'}

> Use decision tree for CountVectorizer

In [46]:
from sklearn.tree import DecisionTreeClassifier

cvec_tree = DecisionTreeClassifier(random_state = 42)
print('Decision tree score for CountVectorizer: ',cross_val_score(cvec_tree,cvec_X,y,cv=kf).mean())

Decision tree score for CountVectorizer:  0.9834618487458021


> Do the same for TfidVectorizer

In [47]:
tfid_tree = DecisionTreeClassifier(random_state = 42)
print('Decision tree score for CountVectorizer: ',cross_val_score(tfid_tree,tfid_X,y,cv=kf).mean())

Decision tree score for CountVectorizer:  0.9840734695409091


# Executive Summary
---
Put your executive summary in a Markdown cell below.

We can see that two categories 'cats' and 'dogs' may be divided with very high accuracy. All the methods give us pretty high accuracy (logistic regression, multinomial NB, random forest, decision tree, support vectors) on the both samples - count vectorized and Tfold vectorized. But the highest score has decision tree classifier with the count vectorizer.
For 'games' and 'boardgames' the best result is by the multinomial NB on count vectorized data, acc. 89%