# Lesson 10: Decision Trees & Random Forests
## Starter code for guided practice & demos
Today's examples use the StumbleUpon dataset to predict "evergreen-ness" of content using decision trees and random forests.

In [54]:
# Imports
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
%matplotlib inline

# Config
DATA_DIR = Path('./resources')
np.random.seed(1)

## Activity: "Exploring The StumbleUpon Dataset"
We will be using a dataset from StumbleUpon, a service that recommends webpages to users based upon their interests.  They like to recommend “evergreen” sites, ones that are always relevant.  This usually means websites that avoid topical content and focus on recipes, how-to guides, art projects, etc.  We want to determine important characteristics for “evergreen” websites. Follow these prompts to get started:

**Break into groups.**

1. Prior to looking at the data, brainstorm 3-5 characteristics that would be useful for predicting evergreen websites.
2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
3. Does being a news site affect evergreen-ness? Compute or plot the percent of evergreen news sites.
4. In general, does category affect evergreen-ness? Plot the rate of evergreen sites for all Alchemy categories.
5. How many articles are there per category?
6. Create a feature for the title containing “recipe”. Is the percentage of evergreen websites higher or lower on pages that have “recipe” in the title?

**Check:** Were you able to plot the requested features? Can you explain how you would approach this type of dataset?

In [55]:
# Import data
df = pd.read_csv(DATA_DIR / 'stumbleupon.tsv', sep='\t')
df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,1,1,40,0,4973,187,9,0.181818,0.125448,1
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,1,1,55,0,2240,258,11,0.166667,0.057613,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,1,0,24,0,2737,120,5,0.041667,0.100858,1
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,1,1,14,0,12032,162,10,0.098765,0.082569,0


In [56]:
# The 'boilerplate' column contains some JSON, let's extract some values inside the JSON as new cols
df['title'] = df.boilerplate.map(lambda x: json.loads(x).get('title', ''))
df['body'] = df.boilerplate.map(lambda x: json.loads(x).get('body', ''))
df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


### Data dictionary
This dataset comes from [StumbleUpon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are "evergreen" sites?

> Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

> A sample of URLs is below, where label = 1 are 'evergreen' websites

In [57]:
df[['url', 'label']].head()

Unnamed: 0,url,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,0
1,http://www.popsci.com/technology/article/2012-...,1
2,http://www.menshealth.com/health/flu-fighting-...,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,1
4,http://bleacherreport.com/articles/1205138-the...,0


### Exercises

> #### 1. In a group, brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.

Answers:
1. Is the site news
2. How frequently is the site updated
3. How much non-markup text there is on the page

> ####  2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- e.g. if you believe high-image content websites are likely to be evergreen, how can you build a feature that represents this?
- e.g. if you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

> #### Split up and develop 1-3 of the those features independently.

Answers:

> #### 3. Does being a news site affect evergreen-ness? 
Compute or plot the percentage of news related evergreen sites.

In [67]:
# Show average 'label' value and total count for is_news
pd.pivot_table(data=df,values='label',index='is_news',aggfunc=('mean','count'))

Unnamed: 0_level_0,mean,count
is_news,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.516916,4552
?,0.507562,2843


> #### 4. Does category in general affect evergreen-ness? 
Plot the rate of evergreen sites for all Alchemy categories.

In [73]:
pd.pivot_table(data=df,values='label',index='alchemy_category',aggfunc=('mean','count'))

Unnamed: 0_level_0,mean,count
alchemy_category,Unnamed: 1_level_1,Unnamed: 2_level_1
?,0.502135,2342
arts_entertainment,0.371945,941
business,0.711364,880
computer_internet,0.246622,296
culture_politics,0.457726,343
gaming,0.368421,76
health,0.573123,506
law_crime,0.419355,31
recreation,0.684296,1229
religion,0.416667,72


> #### 5. How many articles are there per category?

In [72]:
# See 'count' column in table above.

> Let's try extracting some of the text content.

> #### 6. Create a feature for the title containing 'recipe'. 
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [86]:
df['title_contains_recipe'] = 0
df.loc[(df.title.notnull() & df['title'].str.contains("recipe")),'title_contains_recipe'] = 1

## Demo: "Building decision trees in scikit-learn"
Let's build a decision tree model to predict the "evergreen-ness" of a given website.

In [102]:
from os import system
from sklearn.tree import DecisionTreeClassifier, export_graphviz

# Select features
df['recipe'] = df['title'].map(lambda t: 1 if 'recipe' in unicode(t).lower() else 0)
X = df[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)

# Fit the model
model = DecisionTreeClassifier()
model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

**The example below uses GraphViz. Please see the solution code for this lesson for detailed install instructions!**

In [103]:
# Helper function to visualise decision trees (creates a file tree.png in local directory)
def build_tree_image(model, filename='tree.png'):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model, out_file = dotfile, feature_names = X.columns)
    dotfile.close()
#     system("dot -Tpng tree.dot -o {0}".format(filename))  # comment out this line if you don't have GraphViz yet

build_tree_image(model)

This generates a massive 26MB PNG file! Zoomed out, this looks something like this (I'm using screenshots here to avid loading the whole PNG into the notebook):

<img src='./resources/massive_tree.png' width= 80%>

Let's zoom in a little!

<img src='./resources/massive_tree_zoom.png' width= 80%>

Let's zoom in more...

<img src='./resources/massive_tree_zoom_max.png' width= 80%>

## Activity: "Evaluating decision trees in scikit-learn"
Let's evaluate our decision tree.

1. In your groups from earlier, work on evaluating the decision tree using cross-validation methods.
2. What metrics would work best?  Why?

**Check:** Are you able to evaluate the decision tree model using cross-validation methods?

In [104]:
# Evaluate the decision tree using cross-validation; try using e.g. AUC as the evaluation metric.
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(), X, y, scoring='roc_auc', cv=5)

print "Cross-validated AUC scores: ", scores
print "Average AUC:", scores.mean()

Cross-validated AUC scores:  [ 0.54649123  0.57876318  0.5932724   0.58321256  0.55133026]
Average AUC: 0.570613924016


## Activity: "Adjusting decision trees to avoid overfitting"
1. You can control for overfitting in decision trees by adjusting one of the following parameters:
  - `max_depth`:  Control the maximum number of questions.
  - `min_samples_in_leaf`:  Control the minimum number of records in each node.

2. Test each of these parameters below.

In [107]:
# Control for overfitting in the decision tree model by adjusting the maximum number of questions
# (max_depth) or the minimum number of records in each final node (min_samples_leaf).
max_depth = 2
min_samples_leaf = 5
model = DecisionTreeClassifier(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
model.fit(X, y)

filename = "tree-max_depth_{0}-min_samples_leaf_{1}.png".format(max_depth, min_samples_leaf)
build_tree_image(model, filename='filename')

<img src=filename width= 80%>

## Activity: "Regression with decision trees & random forests"
1. Build a random forest model to predict the evergreen-ness of a website.  Remember to use the parameter n_estimators to control the number of trees used in the model.
2. Take note of the features that give the best splits to determine the most important features.
3. Decision trees and random forests can be used for both classification and regression.  In regression, predictions are made by taking the average value of the samples in the leaf node. You can take the average of the individual trees' predictions.  Build a regression based random forest model.

In [108]:
# How to fit a random forest model in sklearn
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [109]:
# Extracting importance of features
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values(by='Importance Score', inplace=True, ascending=False)

features_df.head()

Unnamed: 0,Features,Importance Score
1,html_ratio,0.499259
0,image_ratio,0.41119
2,recipe,0.089551


In [112]:
# Now evaluate the random forest model using cross-validation.
# Increase the number of estimators and view how that improves predictive performance.
scores = cross_val_score(RandomForestClassifier(), X, y, scoring='roc_auc', cv=5)

print "Cross-validated AUC scores: ", scores
print "Average AUC:", scores.mean()

Cross-validated AUC scores:  [ 0.62825475  0.61739771  0.64161726  0.61662641  0.61307976]
Average AUC: 0.623395177909


In [113]:
model = RandomForestClassifier(n_estimators = 50)
model.fit(X, y)
scores = cross_val_score(RandomForestClassifier(), X, y, scoring='roc_auc', cv=5)

print "Cross-validated AUC scores: ", scores
print "Average AUC:", scores.mean()

Cross-validated AUC scores:  [ 0.60193622  0.61431983  0.66616894  0.62964335  0.61436155]
Average AUC: 0.625285978019


## Activity: "Evaluate random forest using cross-validation"
1. Building upon the previous Guided Practice, continue adding any input variables to the model that you think may be relevant.
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation.
  - Evaluate the _importance_ of the feature.

3. **Bonus:** Just like the `recipe` feature, add in similar text features and evaluate their performance.

In [117]:
# Look at all possible features
df.dtypes

url                                object
urlid                               int64
boilerplate                        object
alchemy_category                   object
alchemy_category_score             object
avglinksize                       float64
commonlinkratio_1                 float64
commonlinkratio_2                 float64
commonlinkratio_3                 float64
commonlinkratio_4                 float64
compression_ratio                 float64
embed_ratio                       float64
framebased                          int64
frameTagRatio                     float64
hasDomainLink                       int64
html_ratio                        float64
image_ratio                       float64
is_news                            object
lengthyLinkDomain                   int64
linkwordscore                       int64
news_front_page                    object
non_markup_alphanum_characters      int64
numberOfLinks                       int64
numwords_in_url                   