# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
# Import libraries 
import requests
import json
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Imputer
from time import sleep


In [2]:
import requests
import json

In [3]:
URL = "http://www.reddit.com/hot.json"
res = requests.get(URL, headers = {'User-agent':'Raj Bot 0.1'})


In [4]:
data = res.json()

In [5]:
data.keys()

dict_keys(['kind', 'data'])

In [6]:
data['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [7]:
res.status_code #200 is what you want! 

200

In [8]:
print(len(data['data']['children']))

25


In [9]:
reddit = [child['data'] for child in data['data']['children']]
reddit = pd.DataFrame(reddit)
time = pd.Timestamp.utcnow()
reddit['time fetched'] = time
reddit.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls,time fetched
0,,,False,Don_Thate,,,[],,,,...,140,Removed the top of my desk for cleaning. Cat d...,46400,https://gfycat.com/ObviousShockingBronco,[],,False,all_ads,6,2018-06-11 22:08:44.971936+00:00
1,,,False,50wpm,,,[],,,,...,140,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,21082,https://www.huffingtonpost.com/entry/tina-fey-...,[],,False,all_ads,6,2018-06-11 22:08:44.971936+00:00
2,,,False,Mewiee,,,[],,,,...,140,Just keep it steady,12994,https://v.redd.it/40wruaq1re311,[],,False,all_ads,6,2018-06-11 22:08:44.971936+00:00
3,,,False,dim_ov,,,[],,,,...,140,Cat saves his buddy from falling off a ledge,20810,https://gfycat.com/ThoroughExemplaryGreatdane,[],,False,all_ads,6,2018-06-11 22:08:44.971936+00:00
4,,,False,dickfromaccounting,,,[],,,,...,140,Time-lapse of rain storm,15971,https://i.imgur.com/LUWQJCQ.gifv,[],,False,all_ads,6,2018-06-11 22:08:44.971936+00:00


In [10]:
reddit.shape

(25, 95)

In [11]:
reddit.columns

Index(['approved_at_utc', 'approved_by', 'archived', 'author',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category',
       'clicked', 'content_categories', 'contest_mode', 'created',
       'created_utc', 'distinguished', 'domain', 'downs', 'edited', 'gilded',
       'hidden', 'hide_score', 'id', 'is_crosspostable', 'is_original_content',
       'is_reddit_media_domain', 'is_self', 'is_video', 'likes',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media',
       'media_embed', 'media_only', 'mod_note', 'mod_reason_by',
       'mod_reason_title', 'mod_reports', 'name', 'no_follow', 'num_comments',
       'num_cross

In [17]:
next_post = data['data']['after']
for i in range(0,200):
    URL = "http://www.reddit.com/hot.json?after=" + next_post
    res = requests.get(URL, headers = {'User-agent':'Raj Bot 0.1'})
    post = res.json()
    time_now = pd.Timestamp.utcnow()
    next_post = post['data']['after']
    post_df = [child['data'] for child in post['data']['children']]
    post_df = pd.DataFrame(post_df)
    reddit = pd.concat([reddit, post_df], ignore_index = True)
    reddit['time fetched'] = time_now
    sleep(1)
reddit.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_width,time fetched,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,Don_Thate,,,,[],,,...,140.0,2018-06-11 22:15:51.435208+00:00,Removed the top of my desk for cleaning. Cat d...,46400,https://gfycat.com/ObviousShockingBronco,[],,False,all_ads,6.0
1,,,False,50wpm,,,,[],,,...,140.0,2018-06-11 22:15:51.435208+00:00,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,21082,https://www.huffingtonpost.com/entry/tina-fey-...,[],,False,all_ads,6.0
2,,,False,Mewiee,,,,[],,,...,140.0,2018-06-11 22:15:51.435208+00:00,Just keep it steady,12994,https://v.redd.it/40wruaq1re311,[],,False,all_ads,6.0
3,,,False,dim_ov,,,,[],,,...,140.0,2018-06-11 22:15:51.435208+00:00,Cat saves his buddy from falling off a ledge,20810,https://gfycat.com/ThoroughExemplaryGreatdane,[],,False,all_ads,6.0
4,,,False,dickfromaccounting,,,,[],,,...,140.0,2018-06-11 22:15:51.435208+00:00,Time-lapse of rain storm,15971,https://i.imgur.com/LUWQJCQ.gifv,[],,False,all_ads,6.0


In [18]:
reddit.shape

(5275, 99)

In [19]:

# Look at 4 features: title, subreddit, num_comments, and created_utc# Look a 
reddit_new = reddit[['title', 'subreddit', 'num_comments', 'created_utc', 'id', 'time fetched']].copy(deep = True)
reddit_new.head()

Unnamed: 0,title,subreddit,num_comments,created_utc,id,time fetched
0,Removed the top of my desk for cleaning. Cat d...,aww,486,1528745000.0,8qc84c,2018-06-11 22:15:51.435208+00:00
1,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,television,990,1528742000.0,8qbqmn,2018-06-11 22:15:51.435208+00:00
2,Just keep it steady,combinedgifs,235,1528739000.0,8qbbb3,2018-06-11 22:15:51.435208+00:00
3,Cat saves his buddy from falling off a ledge,AnimalsBeingBros,211,1528737000.0,8qb37o,2018-06-11 22:15:51.435208+00:00
4,Time-lapse of rain storm,woahdude,175,1528737000.0,8qb0qq,2018-06-11 22:15:51.435208+00:00


In [20]:
# Convert created_utc to readable format
reddit_new['created_utc'] = reddit_new['created_utc'].astype('datetime64[s]')
reddit_new['time fetched'] = reddit_new['time fetched'].astype('datetime64[s]')
reddit_new.head()

Unnamed: 0,title,subreddit,num_comments,created_utc,id,time fetched
0,Removed the top of my desk for cleaning. Cat d...,aww,486,2018-06-11 19:27:23,8qc84c,2018-06-11 22:15:51
1,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,television,990,2018-06-11 18:31:18,8qbqmn,2018-06-11 22:15:51
2,Just keep it steady,combinedgifs,235,2018-06-11 17:37:08,8qbbb3,2018-06-11 22:15:51
3,Cat saves his buddy from falling off a ledge,AnimalsBeingBros,211,2018-06-11 17:10:42,8qb37o,2018-06-11 22:15:51
4,Time-lapse of rain storm,woahdude,175,2018-06-11 17:02:18,8qb0qq,2018-06-11 22:15:51


In [30]:
import pickle

with open('reddit_new.pkl', 'wb') as f:
    pickle.dump(reddit, f)

In [31]:
with open('reddit_new.pkl', 'rb') as f:
     reddit = pickle.load(f)



In [32]:
reddit_new.shape

(5275, 6)

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [33]:
## YOUR CODE HERE

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [34]:
# Export to csv
reddit_new.to_csv('project3scrape.csv')

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [35]:
## Reading in the code 
comment_predictor = pd.read_csv('project3scrape.csv')

In [36]:
# Dropping first column since it is uncesssary 
comment_predictor.drop('Unnamed: 0', axis=1, inplace=True)

In [37]:
# Check shape to see 
comment_predictor.shape

(5275, 6)

In [38]:
# Checking my data frame
comment_predictor

Unnamed: 0,title,subreddit,num_comments,created_utc,id,time fetched
0,Removed the top of my desk for cleaning. Cat d...,aww,486,2018-06-11 19:27:23,8qc84c,2018-06-11 22:15:51
1,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,television,990,2018-06-11 18:31:18,8qbqmn,2018-06-11 22:15:51
2,Just keep it steady,combinedgifs,235,2018-06-11 17:37:08,8qbbb3,2018-06-11 22:15:51
3,Cat saves his buddy from falling off a ledge,AnimalsBeingBros,211,2018-06-11 17:10:42,8qb37o,2018-06-11 22:15:51
4,Time-lapse of rain storm,woahdude,175,2018-06-11 17:02:18,8qb0qq,2018-06-11 22:15:51
5,For us ladies on budget,ShittyLifeProTips,368,2018-06-11 17:13:08,8qb3ze,2018-06-11 22:15:51
6,[THENEEDLEDROP] KIDS SEE GHOSTS Album Review,Kanye,702,2018-06-11 19:01:07,8qc009,2018-06-11 22:15:51
7,RIP net neutrality: Ajit Pai's 'fuck you' to t...,technology,2867,2018-06-11 15:54:30,8qammr,2018-06-11 22:15:51
8,10 Most Downvoted Reddit Comments [OC],dataisbeautiful,368,2018-06-11 19:42:18,8qccrn,2018-06-11 22:15:51
9,Forbidden Ice Cream,forbiddensnacks,117,2018-06-11 17:00:23,8qb0bd,2018-06-11 22:15:51


#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [39]:
## Find medium of the number of comments for each feature 

med_comment = comment_predictor['num_comments'].median()
med_comment
#It's 23 comments!

19.0

In [40]:
## Creating binary High vs Low column for the comments
comment_predictor['Class'] = comment_predictor.apply(lambda x: 'HIGH' if x['num_comments'] > med_comment else 'LOW', axis=1)

comment_predictor.head()

#Only one of my features is under the median, all others are above



Unnamed: 0,title,subreddit,num_comments,created_utc,id,time fetched,Class
0,Removed the top of my desk for cleaning. Cat d...,aww,486,2018-06-11 19:27:23,8qc84c,2018-06-11 22:15:51,HIGH
1,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,television,990,2018-06-11 18:31:18,8qbqmn,2018-06-11 22:15:51,HIGH
2,Just keep it steady,combinedgifs,235,2018-06-11 17:37:08,8qbbb3,2018-06-11 22:15:51,HIGH
3,Cat saves his buddy from falling off a ledge,AnimalsBeingBros,211,2018-06-11 17:10:42,8qb37o,2018-06-11 22:15:51,HIGH
4,Time-lapse of rain storm,woahdude,175,2018-06-11 17:02:18,8qb0qq,2018-06-11 22:15:51,HIGH


#### Thought experiment: What is the baseline accuracy for this model?

In [41]:
# Baseline accuracy is 50% 

comment_predictor.Class.value_counts()[0] / len(comment_predictor) * 100

50.38862559241706

##### Since baseline accuracy is the majority class, my baseline  accuracy is 52%, the LOW class comments. 

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

# https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e

In [42]:
# Building random forest model to predict High/Low number of comments using only subreddit feature
# Definte X,y
X = comment_predictor['subreddit']
y = comment_predictor['Class'].apply(lambda x: 1 if x == 'HIGH' else 0)

X = pd.get_dummies(X, drop_first = True)


In [43]:
# Set up Train / Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [44]:
# RFM

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print('Score on Test data: {}'.format(rf.score(X_test, y_test)))
print('Score on Train data : {}'.format(rf.score(X_train, y_train)))

#How can I meaningfully interpret these kind of results? If it was lower, I would say it's overfitting. Is there something to be said about bias and variance in this case by looking all the results? What can I do to make this classification meaningful, such as re-selecting training and test sets or just using cross-validation on all data?

#Model is overfitting since the test data is lower than the train data 

Score on Test data: 0.6626658243840808
Score on Train data : 0.7955037919826652


In [45]:
 # Gridsearch Model to get a better test model score 

rf = RandomForestClassifier()
rf_params = {'min_samples_split' : range(2,10),
             'n_estimators' : [50, 50 , 150]}

gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

0.6375947995666306
{'min_samples_split': 4, 'n_estimators': 50}
0.6797220467466835


## Model is doing better now with a train score of 52% and a test score of 50%

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [46]:
## Creating new features to check if cat/funny in title 

X = comment_predictor[['title', 'subreddit']].copy(deep=True)
y = comment_predictor['Class'].apply(lambda x: 1 if x == 'HIGH' else 0)

X['Cat'] = X['title'].map(lambda x: 1 if 'cat' in x else 0)
X['Funny'] = X['title'].map(lambda x: 1 if 'funny' in x else 0)
X = pd.get_dummies(X, drop_first=True) 

In [47]:
X.head()

Unnamed: 0,Cat,Funny,"title_""Alexa, start a discussion about marketing budgets in video games."" ""Did you say slapfight?""","title_""Art"" by Salvador Larroca..","title_""Attorney Denies Robert De Niro Was Client of Prostitution Ring With Underage Girls; Police Interrogated Star for Nine Hours – True PunditTrue Pundit"" A *Nine* hour interview and his very high priced Lawyers would've been pushing for it to be brief to prevent inadvertent selfincriminating. Right?","title_""Automata"" DUST Trailer - Episode 1 premieres on June 12th","title_""Balancing Act"", Digital, 720x720, Collab with Broken Isnt Bad","title_""Bandit"" in cinemas soon","title_""Canada burned down the White House during the war of 1812"" - Trump","title_""Coffee, tea, or... me?""",...,subreddit_xxfitness,subreddit_yakuzagames,subreddit_yesyesyesno,subreddit_yesyesyesyesno,subreddit_youdontsurf,subreddit_youseeingthisshit,subreddit_youtube,subreddit_youtubehaiku,subreddit_zelda,subreddit_zerocarb
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
# Train Test Validation 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3) 

In [49]:
# RFM

RandomForestClassifier()
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))
print(rf.score(X_train, y_train))

0.6519267214150347
0.9336403033586133


In [50]:
# Model seems to be overfit, using gridsearch to reduce overfitting.

rf = RandomForestClassifier()
rf_params = {'min_samples_split' : range(2,5),
            'n_estimators': [50, 100, 150]}

GridSearchCV(rf, param_grid = rf_params) 
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_train, y_train))

0.6592632719393283
{'min_samples_split': 3, 'n_estimators': 150}
0.995124593716143


In [51]:
# These features are not helping the model

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [52]:
## Gridsearch above, now doing cross validation 

cross_validation = cross_val_score(gs, X_train, y_train, cv=5)

In [53]:
print(cross_validation)

[0.65764547 0.67388363 0.67208672 0.67615176 0.64498645]


#### Receiving accuracy scores of about 50-60%

#### Repeat the model-building process with a non-tree-based method.

In [54]:
comment_predictor.head()

Unnamed: 0,title,subreddit,num_comments,created_utc,id,time fetched,Class
0,Removed the top of my desk for cleaning. Cat d...,aww,486,2018-06-11 19:27:23,8qc84c,2018-06-11 22:15:51,HIGH
1,Tina Fey Says Liz Lemon And Leslie Knope Shoul...,television,990,2018-06-11 18:31:18,8qbqmn,2018-06-11 22:15:51,HIGH
2,Just keep it steady,combinedgifs,235,2018-06-11 17:37:08,8qbbb3,2018-06-11 22:15:51,HIGH
3,Cat saves his buddy from falling off a ledge,AnimalsBeingBros,211,2018-06-11 17:10:42,8qb37o,2018-06-11 22:15:51,HIGH
4,Time-lapse of rain storm,woahdude,175,2018-06-11 17:02:18,8qb0qq,2018-06-11 22:15:51,HIGH


In [55]:
# Going to do a couple different models with different features 

In [56]:
# Using Features num_comments and subredddit 
X = comment_predictor[['num_comments', 'subreddit']].copy(deep=True) #I'm still not sure what the deep=True is doing, but I'm sticking with it to not mess anything up
y = comment_predictor['Class'].apply(lambda x: 1 if x == 'HIGH' else 0)
X = pd.get_dummies(X, drop_first=True)
X.head()

Unnamed: 0,num_comments,subreddit_13or30,subreddit_1500isplenty,subreddit_2healthbars,subreddit_2mad4madlads,subreddit_2meirl42meirl4meirl,subreddit_2meirl4meirl,subreddit_3DS,subreddit_3Dprinting,subreddit_3amjokes,...,subreddit_xxfitness,subreddit_yakuzagames,subreddit_yesyesyesno,subreddit_yesyesyesyesno,subreddit_youdontsurf,subreddit_youseeingthisshit,subreddit_youtube,subreddit_youtubehaiku,subreddit_zelda,subreddit_zerocarb
0,486,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,990,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,235,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,211,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,175,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
# I am running into some errors here having to deal with 
# value errors. There are a mismatch of features 
# between my training and test data. 

#Using helpful code to try and solve it 

# Look at all 3 features: title, subreddit, and age
X = comment_predictor[['num_comments', 'subreddit']].copy(deep = True)
y = comment_predictor['Class'].apply(lambda x: 1 if x == 'HIGH' else 0)

X = pd.get_dummies(X, columns = ['subreddit'], drop_first = True)

# Train / Test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Define function to make sure features are the same in both the train and test 
def feature_check(train, test):
    missing_cols = set(train.columns) - set(test.columns)
    for c in missing_cols:
        test[c] = 0
    test = test[train.columns]
    return test

#

In [74]:
## Now rebuilding the model process using Multinomial Naive Bayes and Logistic Regression

# Train/Test Split

X_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [75]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

1.0

#### Logistic Regression 75% accuracy 

In [76]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb.score(X_test, y_test)

0.8894504106127605

#### 88% accuracy with Multinomial NB 

In [77]:
coefs = pd.DataFrame(logreg.coef_[0], index = X.columns, columns = ['coef'])
coefs['coef'] = np.exp(coefs['coef'])
coefs.sort_values(by='coef', ascending = False, inplace=True)
coefs.head(10)

Unnamed: 0,coef
num_comments,1.817401
subreddit_2healthbars,1.382472
subreddit_exjw,1.372397
subreddit_RealmRoyale,1.35952
subreddit_Fishing,1.355772
subreddit_reddevils,1.327084
subreddit_CasualUK,1.306744
subreddit_firstworldanarchists,1.269877
subreddit_Floof,1.2642
subreddit_marvelstudios,1.241404



#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [63]:
# Checking highest coefficients!

coefs = pd.DataFrame(logreg.coef_[0], index = X.columns, columns = ['coef'])
coefs['coef'] = np.exp(coefs['coef'])
coefs.sort_values(by='coef', ascending = False, inplace=True)
coefs.head(10)

Unnamed: 0,coef
num_comments,6.15584
subreddit_2healthbars,3.98474
subreddit_exjw,3.944795
subreddit_RealmRoyale,3.894325
subreddit_Fishing,3.879754
subreddit_reddevils,3.770033
subreddit_CasualUK,3.694124
subreddit_firstworldanarchists,3.560414
subreddit_Floof,3.540259
subreddit_marvelstudios,3.460468


In [64]:
# KNN Model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

1.0

#### With the number of comments and subreddit features, both KNN and Logistic Regression performed the best.

## Use Count Vectorizer from scikit-learn to create features from the thread titles.
### Examine using count or binary features in the model
### Re-evaluate your models using these. Does this improve the model performance?
### What text features are the most valuable?

In [65]:
X = comment_predictor['subreddit']
y = comment_predictor['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [66]:
cvec = CountVectorizer(stop_words = 'english', binary=False)
cvec.fit(X_train)
X_train_cvec = cvec.transform(X_train)
X_test_cvec = cvec.transform(X_test)

In [67]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_cvec, y_train)
rf.score(X_test_cvec, y_test)


0.599696739954511

#### Random Forest did better! by 5%!

In [68]:
logreg = LogisticRegression()
logreg.fit(X_train_cvec, y_train)
logreg.score(X_test_cvec, y_test)

0.6724791508718726

#### Logistic Regression went from 75% to 42%, way worse. Sad! 

In [69]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_cvec, y_train)
knn.score(X_test_cvec, y_test)

0.5625473843821076

#### KNN also went from 75% to 42%

### I had a bit of trouble using cvec to run with other features like number of comments, which is something I wanted to do instead. For some reason I think the value was too low. The error was something along lines of minimum of 2, but I had something with 1. 

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

#### I sort of stumbeled my way through this one before grasping all the concepts, even though I was learning how to use them while doing the project.. My models didn't perform exceptionally well, highest being logistic regression and knn at 75%. I did manage to learn that subreddits with their own followers really get the most traffic. Among those subreddits, it was gaming stuff (which means people like me), and funny posts. My 3rd largest subreddit was white people twitter, which I'm sure is just as hilarious as it sounds. If I were to give advice to someone to make reddit posts I would let them know that a lot of it can come down to human activity on subreddits, which can sometimes be unpredictable. Finding a specific subreddit to tie their post into would be really helpful since those seem to have people that follow them already, and are a little less random - if they really wanted some exposure, somehow tying it to video games or something funny, may be a good route to take. Personally, I learned quite a bit while putting the models to use, even though maybe not all of them are being used correctly, the syntax is starting to sync in. My workflow is still a bit messy because I find myself sometimes struggling what I am tryin gto achieve while I am doing it, but I think with enough practice it will get there. The most interesting thing for me was probably seeing how drastically different models and features perform given the same dataset. I also think changing the way we did our High/Low comment class could maybe use some tweaking so it wasn't so black and white for how our posts got sorted. It seems a little too skewed to be helpful for me, but I could be wrong.