

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Using Reddit's API for Predicting Comments<br>
### Author: Will J. Suh

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: **_What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_**

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [46]:
import requests
import json
from bs4                                   import BeautifulSoup    
import pandas as pd
import numpy as np
import requests
import re
from sklearn.cross_validation              import train_test_split, cross_val_score
from sklearn.feature_extraction.text       import CountVectorizer
from sklearn.ensemble                      import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree                          import DecisionTreeClassifier
from sklearn.model_selection               import GridSearchCV
from sklearn.pipeline                      import Pipeline

from nltk.stem                             import WordNetLemmatizer
from nltk.tokenize                         import RegexpTokenizer



In [2]:
URL = 'https://www.reddit.com/r/todayilearned.json'

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [3]:
results = requests.get(URL, headers = {'User-agent' : 'Will Bot 0.1'})

In [4]:
data = results.json()

In [5]:
print(len(data['data']['children']))

25


In [6]:
data['data']

{'modhash': '',
 'dist': 25,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'todayilearned',
    'selftext': '',
    'author_fullname': 't2_x1zt7',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'TIL Eleanor Roosevelt held weekly press conferences and allowed female journalists to attend, forcing many news organizations to hire their first female reporters',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/todayilearned',
    'hidden': False,
    'pwls': 6,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': 108,
    'hide_score': False,
    'name': 't3_9rbr33',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'author_flair_background_color': None,
    'subreddit_type': 'public',
    'ups': 18319,
    'domain': 'womenshistory.org',
    'media_embed': {},
    'thumbnail_width': 140,
    'author_flair_template_id': None,
    'is_original_conten

In [7]:
data['data']['after']

't3_9raftf'

In [10]:
#after = 
new_url = 'http://www.reddit.com/r/todayilearned.json?after=t3_9raftf'

In [11]:
results = requests.get(new_url, headers = {'User-agent':'Will Bot 0.1'})
new_data = results.json()
new_data['data']['children']


[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'todayilearned',
   'selftext': '',
   'author_fullname': 't2_16be8i04',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'TIL about an outbreak of mass hysteria in Belgium in 1999; hundreds reported cramping, nausea and headaches allegedly from drinking Coca-Cola products. Coca-Cola recalled 30 million products, claiming it had identified the cause of the epidemic. Toxicology experts found no evidence of contamination.',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/todayilearned',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': 73,
   'hide_score': False,
   'name': 't3_9rbbhm',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 159,
   'domain': 'theguardian.com',
   'media_embed': {},
   'thumbnail_width'

In [12]:
soup = BeautifulSoup(results.text, 'lxml')

In [13]:
len(new_data['data']['children'])

25

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

## First subreddit: /r/todayilearned

In [14]:
import time

In [15]:
url = 'http://www.reddit.com/r/todayilearned.json'
all_posts =[]
for _ in range(40): 
    # construct a list of 1000
    
    # Get the posts by hitting the url, put it in json and store it
    res = requests.get(url, headers={'User-agent': 'will bot'})
    data = res.json()
    
    # save only the posts out of the json into the list_of_posts, then
    # add all the posts to the all_posts list
    list_of_posts = data['data']['children']
    
    for post in list_of_posts:
        current_post = []
        current_post.append(post['data']['selftext'])
        current_post.append(post['data']['title'])
        current_post.append(post['data']['subreddit_name_prefixed'])
        all_posts.append(current_post)
    
    # reassign the after to the current 'after', and then update the url to hit
    after = data['data']['after']
    url = 'http://www.reddit.com/r/todayilearned.json?after=' + after
    
    
    # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
    print('The current after: ', after)
    time.sleep(3)

The current after:  t3_9raftf
The current after:  t3_9rd79i
The current after:  t3_9re6r3
The current after:  t3_9r6so3
The current after:  t3_9r5s9i
The current after:  t3_9r1n3c
The current after:  t3_9qrni7
The current after:  t3_9qon3j
The current after:  t3_9qs5tz
The current after:  t3_9qv8g8
The current after:  t3_9qvsh0
The current after:  t3_9qszxt
The current after:  t3_9qoige
The current after:  t3_9qk96x
The current after:  t3_9qjbpv
The current after:  t3_9qdip1
The current after:  t3_9qcahn
The current after:  t3_9qbhhy
The current after:  t3_9q80gc
The current after:  t3_9q41pd
The current after:  t3_9q54i3
The current after:  t3_9pytzv
The current after:  t3_9psv8d
The current after:  t3_9psbpp
The current after:  t3_9phrfr
The current after:  t3_9pjjsc
The current after:  t3_9pnjju
The current after:  t3_9poin3


TypeError: must be str, not NoneType

In [16]:
# save this to a dataframe first and then to CSV

first_subreddit = pd.DataFrame(all_posts, columns = ['post', 'title', 'true_y'])

In [17]:
first_subreddit.head()

Unnamed: 0,post,title,true_y
0,,TIL Eleanor Roosevelt held weekly press confer...,r/todayilearned
1,,TIL that Algerian women make up 70% of their l...,r/todayilearned
2,,TIL the only university that would allow the m...,r/todayilearned
3,,TIL that some black bears are born white and a...,r/todayilearned
4,,"TIL of Mary Babnik Brown, who donated her 34"" ...",r/todayilearned


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [18]:
pd.DataFrame(first_subreddit).to_csv('./TIL') #renamed to first_subreddit in jupyter notebook

## Second subreddit: /r/askreddit

In [19]:
url2 = 'https://www.reddit.com/r/askreddit.json'
results2 = requests.get(url2, headers = {'User-agent' : 'Will Bot 0.2'})
data2 = results2.json()
print(len(data['data']['children']))

soup = BeautifulSoup(results2.text, 'lxml')

21


In [20]:
url2 = 'https://www.reddit.com/r/askreddit.json'
all_posts =[]
for _ in range(40): 
    # construct a list of 1000
    
    # Get the posts by hitting the url, put it in json and store it
    res = requests.get(url2, headers={'User-agent': 'will bot 0.2'})
    data2 = res.json()
    
    # save only the posts out of the json into the list_of_posts, then
    # add all the posts to the all_posts list
    list_of_posts = data2['data']['children']
    
    for post in list_of_posts:
        current_post = []
        current_post.append(post['data']['selftext'])
        current_post.append(post['data']['title'])
        current_post.append(post['data']['subreddit_name_prefixed'])
        all_posts.append(current_post)
    
    # reassign the after to the current 'after', and then update the url to hit
    after2 = data2['data']['after']
    url2 = 'https://www.reddit.com/r/askreddit.json?after=' + after2
    
    # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
    print('The current after: ', after2)
    time.sleep(5)


The current after:  t3_9r2apo
The current after:  t3_9ralb5
The current after:  t3_9rbsmr
The current after:  t3_9rd1x8
The current after:  t3_9rdlof
The current after:  t3_9rdo57
The current after:  t3_9rdunb
The current after:  t3_9re9nr
The current after:  t3_9rbxrj
The current after:  t3_9rdssk
The current after:  t3_9rd0ib
The current after:  t3_9rbhd9
The current after:  t3_9re8ly
The current after:  t3_9re2xv
The current after:  t3_9rdxz8
The current after:  t3_9rd00x
The current after:  t3_9rcwed
The current after:  t3_9racax
The current after:  t3_9rekgx
The current after:  t3_9rdbpf
The current after:  t3_9rat3z
The current after:  t3_9rebj2
The current after:  t3_9re9ve
The current after:  t3_9re7vi
The current after:  t3_9rczf9
The current after:  t3_9rc3ha
The current after:  t3_9r2d74
The current after:  t3_9rdy4s
The current after:  t3_9rdufi
The current after:  t3_9rdrzt
The current after:  t3_9ra2wd
The current after:  t3_9rdkg1
The current after:  t3_9rdigw
The curren

TypeError: must be str, not NoneType

In [64]:
len(all_posts)

987

In [65]:
second_subreddit = pd.DataFrame(all_posts, columns = ['post', 'title', 'true_y'])

In [66]:
second_subreddit.head()

Unnamed: 0,post,title,true_y
0,**Please keep all top level-comments as questi...,Halloween Megathread 2018,r/AskReddit
1,**Please keep all top level-comments as questi...,The Sexy Halloween Megathread Strikes Again!,r/AskReddit
2,,What is a fun game you play in your head to ki...,r/AskReddit
3,,"Managers of Reddit, what’s the fastest you’ve ...",r/AskReddit
4,,"What's your biggest ""This isn't what it looks ...",r/AskReddit


In [67]:
pd.DataFrame(second_subreddit).to_csv('./askreddit')

## Load CSV and Clean Dataframes

In [68]:
til = pd.read_csv('./TIL')
ask = pd.read_csv('./askreddit')

In [69]:
del til['Unnamed: 0']
del ask['Unnamed: 0']

In [70]:
ask.head()

Unnamed: 0,post,title,true_y
0,**Please keep all top level-comments as questi...,Halloween Megathread 2018,r/AskReddit
1,**Please keep all top level-comments as questi...,The Sexy Halloween Megathread Strikes Again!,r/AskReddit
2,,What is a fun game you play in your head to ki...,r/AskReddit
3,,"Managers of Reddit, what’s the fastest you’ve ...",r/AskReddit
4,,"What's your biggest ""This isn't what it looks ...",r/AskReddit


In [71]:
ask.shape

(987, 3)

In [72]:
til.shape

(721, 3)

In [73]:
til.head()

Unnamed: 0,post,title,true_y
0,,TIL Eleanor Roosevelt held weekly press confer...,r/todayilearned
1,,TIL that Algerian women make up 70% of their l...,r/todayilearned
2,,TIL the only university that would allow the m...,r/todayilearned
3,,TIL that some black bears are born white and a...,r/todayilearned
4,,"TIL of Mary Babnik Brown, who donated her 34"" ...",r/todayilearned


I will create a feature that combines post and title instead of deleting the post feature.

In [74]:
til.isnull().sum()

post      721
title       0
true_y      0
dtype: int64

In [75]:
ask.isnull().sum()

post      984
title       0
true_y      0
dtype: int64

In [76]:
ask['post'] = ask['post'].fillna('')

In [77]:
til['post'] = til['post'].fillna('')

In [78]:
# Post + Title Feat. Eng.

ask['interact'] = ask['post'] + ask['title']
til['interact'] = til['post'] + til['title']

In [79]:
ask.head()

Unnamed: 0,post,title,true_y,interact
0,**Please keep all top level-comments as questi...,Halloween Megathread 2018,r/AskReddit,**Please keep all top level-comments as questi...
1,**Please keep all top level-comments as questi...,The Sexy Halloween Megathread Strikes Again!,r/AskReddit,**Please keep all top level-comments as questi...
2,,What is a fun game you play in your head to ki...,r/AskReddit,What is a fun game you play in your head to ki...
3,,"Managers of Reddit, what’s the fastest you’ve ...",r/AskReddit,"Managers of Reddit, what’s the fastest you’ve ..."
4,,"What's your biggest ""This isn't what it looks ...",r/AskReddit,"What's your biggest ""This isn't what it looks ..."


In [80]:
# Join TIL and AskReddit into a dataframe called concat_df

concat_df = pd.concat([til, ask], axis = 0, ignore_index= True)

In [81]:
# clean stop words:

from nltk.corpus import stopwords
stop = stopwords.words('english')

concat_df['interact'] = concat_df['interact'].apply(lambda x: ' '.join([word for word in x.split(' ') if word not in (stop)]))

In [82]:
# Use BeautifulSoup to get rid of html artifacts.

concat_df['interact'] = concat_df['interact'].map(lambda x: BeautifulSoup(x,'lxml').get_text())

# Initialize Lemmatizer
lemmatizer = WordNetLemmatizer()
concat_df['interact'] = [lemmatizer.lemmatize(word) for word in concat_df['interact']]

# lower case
concat_df['interact'] = [word.lower() for word in concat_df['interact']]

# get rid of all special characters using Regex
concat_df['interact'] = [re.sub('[^A-Za-z0-9]+', ' ', word) for word in concat_df['interact']]

In [135]:
# Remove "TIL" from concat_df['interact'] because it would be too easy.
# Remove "?" from concat_df['interact'] because almost all of AskReddit ends in a question mark.
# Remove other punctuation marks and lower case everything.

artifact = ['?', 'til']
for i in artifact:
    concat_df['interact'] = concat_df['interact'].map(lambda x: x.replace(i, ''))

concat_df['interact'] = concat_df['interact'].str.lower()

In [136]:
# one hot encode true_y. 0 == 'AskReddit' / 1 == 'TIL'

one_hot = pd.get_dummies(concat_df['true_y'])


In [137]:
concat_df.head() # It's clean!

Unnamed: 0,post,title,true_y,interact
0,,TIL Eleanor Roosevelt held weekly press confer...,r/todayilearned,eleanor roosevelt held weekly press conferenc...
1,,TIL that Algerian women make up 70% of their l...,r/todayilearned,algerian women make 70 lawyers 60 judges 65 u...
2,,TIL the only university that would allow the m...,r/todayilearned,university would allow movie animal house fil...
3,,TIL that some black bears are born white and a...,r/todayilearned,black bears born white called spirit bears th...
4,,"TIL of Mary Babnik Brown, who donated her 34"" ...",r/todayilearned,mary babnik brown donated 34 untreated hair u...


In [138]:
X = concat_df['interact']
y = concat_df['true_y']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [139]:
# vectorize interact feature:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             lowercase = False,
                             max_features = 5000) 

vectorizer.fit(X_train)

X_train_vec = vectorizer.transform(X_train)

X_test_vec = vectorizer.transform(X_test)

In [140]:
X_train_vec_df = pd.DataFrame(X_train_vec.todense(), 
                    columns = vectorizer.get_feature_names())

## Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [141]:
## YOUR CODE HERE

clf = RandomForestClassifier(n_jobs=2, random_state=0)

In [142]:
clf.fit(X_train_vec.todense(), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [143]:
clf.score(X_train_vec, y_train)

0.9992193598750976

In [144]:
pred = clf.predict(X_test_vec)

In [145]:
clf.score(X_test_vec, y_test)

0.9203747072599532

In [146]:
clf.predict_proba(X_test_vec)[:10]

array([[0.6, 0.4],
       [1. , 0. ],
       [0.2, 0.8],
       [1. , 0. ],
       [1. , 0. ],
       [0.2, 0.8],
       [1. , 0. ],
       [0.1, 0.9],
       [1. , 0. ],
       [1. , 0. ]])

In [147]:
np.argsort(clf.feature_importances_)[4900:5000]

array([ 388, 1244, 4854, 2586, 1033, 2327, 2269, 2754, 4091, 1902, 4137,
       2649, 3319, 2581, 1844,  905, 1366, 4444, 1601, 2293, 4018,  125,
       1344, 2063,  365, 1998,  949, 1986, 4931,  358,  732, 2644, 2412,
       1470, 2332, 3984, 3140, 2531,  678, 1924, 2419, 1558, 2535, 1991,
       2195, 4799, 1903, 4862, 2251, 3814, 2211, 3893, 2773, 2482, 4650,
       3657, 4968, 4683, 2065, 4413, 4093, 1331, 2468, 2727,  553, 1827,
       2912, 2289,  392, 4285, 4851, 1678,  402, 1184, 2611, 2559, 4969,
       3778,  521, 4865, 2504,    0, 4406, 1745, 3979, 4858, 2630, 4404,
        731,  938, 3359, 4869, 4975, 4943, 3980, 2138, 2120, 3358, 4378,
       4849])

In [148]:
feature_names = pd.DataFrame(X_train_vec.todense(), columns=vectorizer.get_feature_names()).head()

In [149]:
# Top 100 words

feature_names.columns[[2701, 2311, 4851, 3878, 2090,  732, 2020, 1732, 2535,  663,  125,
       2725, 3988, 1773, 3122, 4439,  388, 1467, 3529, 3750, 1845,   34,
        649, 3138, 4397, 2285, 1515, 2559, 1868,    4, 3577, 3354, 1987,
       2263, 4645, 2480,  689, 4015, 3703, 4862, 2664, 4914, 1925, 3623,
       3891, 3271, 4609, 3775,  392, 4799, 1828,  219, 2418, 2493, 1678,
       4941, 4650, 2641, 2732, 2611, 2580, 4404, 2771,  678, 1911,  866,
       4090,  949, 4968, 2649, 1876, 2630, 2644, 4411,  402, 4974, 4869,
       1745,  731, 4917,    0, 2504, 4943, 4858, 3976, 4402, 2288, 2066,
       2910, 4865, 3977, 3776,  938, 2121, 4975, 3356, 2139, 4376, 4849,
       4442]]

Index(['ocd', 'lame', 'whats', 'silphium', 'highly', 'better', 'grant',
       'dropped', 'medical', 'based', '2000', 'ok', 'sophie', 'eating',
       'presents', 'thursday', 'alongside', 'day', 'rewrite', 'sega',
       'experienced', '16', 'banned', 'prime', 'these', 'knocks', 'defense',
       'miles', 'famous', '10', 'roger', 'recruitment', 'germany', 'killian',
       'united', 'lost', 'became', 'speaker', 'scientific', 'white', 'nuclear',
       'woman', 'flying', 'russa', 'singer', 'rabid', 'uk', 'sergio', 'also',
       'war', 'every', '80', 'lights', 'made', 'do', 'worst', 'university',
       'non', 'on', 'named', 'moment', 'thing', 'origin', 'be', 'fired',
       'british', 'stashed', 'canada', 'year', 'not', 'favourite', 'new',
       'north', 'thirds', 'american', 'york', 'why', 'due', 'best', 'women',
       '000', 'man', 'would', 'which', 'some', 'thieves', 'knowledge', 'head',
       'penny', 'who', 'somebody', 'serial', 'called', 'however', 'you', 'red',
       'ii', '

### There are a lot of questions words in the top 100 most important words: "how, what's, if, why, which" 
### There are some words that denote comparision or superlatives: "similar, highest, better"

This makes sense because when you ask a question (on r/AskReddit), most questions start with 'what', 'what's', and 'how'.

After deeper exploration, other interesting feature words in top 100 were: "October", "Russian", "German", "Canadian", "American".

I wonder if some of the subjects were about the midterm elections.

#### Thought experiment: What is the baseline accuracy for this model?

In [150]:
ask_reddit_baseline = 721/1708
ask_reddit_baseline

#The model's accuracy scored ~97%. Did a lot better than the baseline.

0.42213114754098363

In [151]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred) 

array([[238,   9],
       [ 25, 155]])

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [152]:
## YOUR CODE HERE

steps = [
    ('random_forest', RandomForestClassifier())
]

pipe = Pipeline(steps)

param_grid = {
    'random_forest__max_depth' : [30, 50, 70, 80, 100],
    'random_forest__n_estimators' : [20, 30, 50, 70, 80, 100]
}

gs = GridSearchCV(pipe, param_grid, cv = 3)
result = gs.fit(X_train_vec, y_train)


In [153]:
result.best_params_

{'random_forest__max_depth': 100, 'random_forest__n_estimators': 100}

In [154]:
result.best_score_ #optimzed by 2%!

0.9430132708821234

## `GradientBoost` with GridSearchCV

In [155]:
## Gradient Boost

gradient_boost = GradientBoostingClassifier()

grad_params = {
    'n_estimators' : [80, 90, 100],
    'learning_rate' : [0.09, 0.1, 0.2, 0.5],
    'max_depth' : [1, 2, 4, 6]
}

g_search = GridSearchCV(gradient_boost, param_grid = grad_params, cv = 3)
print(g_search.fit(X_train_vec, y_train))
print(g_search.best_params_)
print(g_search.best_score_)

GridSearchCV(cv=3, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [80, 90, 100], 'learning_rate': [0.09, 0.1, 0.2, 0.5], 'max_depth': [1, 2, 4, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
{'learning_rate': 0.09, 'max_depth': 6, 'n_estimators': 90}
0.927400468384075


In [156]:
g_search.score(X_test_vec, y_test)

0.9484777517564403

## `Decision Tree` with GridSearchCV

In [157]:
## Decision Tree

dt = DecisionTreeClassifier(max_depth= 20)
cross_val_score(dt, X_train_vec, y_train, cv = 5).mean()

0.9187986381322958

In [158]:
dt.fit(X_train_vec, y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=20,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [159]:
dt.score(X_test_vec, y_test) 

0.955503512880562

In [160]:
dt = DecisionTreeClassifier()

dt_params = {
    'max_depth' : [10, 20, 30, 50, 100, 200],
}

gridsearch = GridSearchCV(dt, param_grid = dt_params, cv = 5)
gridsearch.fit(X_train_vec, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [10, 20, 30, 50, 100, 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [161]:
gridsearch.best_params_

{'max_depth': 10}

In [162]:
gridsearch.score(X_test_vec, y_test)

0.9461358313817331

# Executive Summary
---


1. Random Forest had a 99% training accuracy score and a 92% test score. It was overfit.
2. Gradient boost had a 92% training accuracy score and a 94% test score. It performed better.
3. Decision Tree had a 91% training accuracy score and a 95% test score. It performed the best.

Using feature_importance_, I found that the most important word was 'What' and "tickets". This makes sense because most of the posts that belonged to 'AskReddit' starts with 'What'. I'm not sure why the word "tickets" came up as the very first word that was important.

In r/todayilearned, all of the posts begin with 'TIL'. I took that keyword out because this would make the classifying too easy. When TIL was left in, I received 99% to 100% accuracy scores. 

Observing which key words were most important provides insight into what kind of questions most people asked. Seems like a lot of people had scientific questions relating to their physical surroundings judging from keywords such as, 'city', 'earth', and 'called'. I also think people were interested in societal questions judging from keywords such as, 'black' and 'love'. Historical questions on AskReddit were popular too, 'WW2' came up once.

In the future, I would like to experiment taking out 'What' and 'Whats' and see the change in accuracy rate. Since we also know what types of questions are popular, I would like to know if there's a correlation on Reddit Gold and asking  or answering scientific/historical/societal questions.
