# Predicting Evergreeness of Content with Decision Trees and Random Forests

# Problem Statement

Let's imagine you were making a recommendation engine.

You want to make recommendations about sites that are always useful (and not just useful for a short period of time).

We can call this quality of ALWAYS being relevant as being <u>"Evergreen"</u>

StumbleUpon is a recommendation service with this goal in mind.

Specifically, users have an extension that recommended websites to them, and that users can click to indicate a site they like.

The problem is deciding whether the site that users liked should also be recommended to others.

StumbleUpon needs to know if the site a user provided is Evergreen (always useful to others).

Therefore, StumbleUpon needs a classifier that can determine whether a site is Evergreen or not, based on the information they have about it.

## Predicting "Evergreeness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/).

It contains webpages suggested by users, basic meta-data from the website, as well as a user-provided label for whether the site is "Evergreen" or not.

From this data, we can train a classifier to predict whether a website is Evergreen.

A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0)

# Reading in the Data

Let's first read in the data into a pandas dataframe

The data are not comma separated as usual, but rather separated by tabs.

We can tell pandas to treat tabs as the delimeter by indicating that the field separator is '\t'

We do this using the command: sep = '\t'


In [None]:
import pandas as pd

#Tell the pandas csv reader that our file has the fields separated by a tab (\t)
#We'll drop rows with missing values for simplicity, but typically you would want to imput those values
df = pd.read_csv("stumbleupon.tsv", sep='\t').dropna()
df.head(10)

# Reminder: What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

A sample of URLs is below, where label = 1 are 'evergreen' websites

In [None]:
df[['url', 'label']].head()

### Exercises to Get Started

### Exercise: 1. Brainstorm 2 - 4 new features you could develop that would be useful for predicting evergreen websites based on the titles.
 
- Example: One feature could be a dummy variable for if the webpage is a recipe or not. Presumably recipes are always useful
 
###  Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?

- Example: If you think websites with recipes are likely to be evergreen, you could make a column that corresponds to whether or not the word recipe in the title.

### Exercise: 3. Implementing your feature ideas

- We can make a feature that works looks for keywords in the text field by applying a function to the relevant column

- We can also use a string command to accomplish the same

- Example: Make a feature that codes whether the website has the word recipe in the title

In [None]:
# Option 1: Create a function to check for this

def has_recipe(content):
    try:
        if 'recipe' in str(content).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
df['recipe'] = df['title'].apply(has_recipe)


# Option 2: string functions

df['recipe'] = df['title'].str.contains('recipe')


# Check the results

We can check to see if our function worked by subsetting our data to only have the rows where recipe  = 1 

<i>df[df.recipe == 1]</i>

and then printing the title for those rows

In [None]:
print df[df.recipe == 1]['title']

# Create your own Predictors

In the previous example, we thought that recipes would probably be more consistently useful. Are there any other keywords you can think of, that, if they were in the title or body of the website, they would indicate a website as likely being evergreen.

In the space below, feel free to try to implement your ideas from Exercise 1 and 2

#  Let's Explore Using Decision Trees To Classify our Websites

 ### Build a decision tree model to predict the "evergreeness" of a given website. 

First, we'll import the library to make the decision tree.

Next, we'll make a decision tree classifier object.

We'll keep the tree simple for now, so we'll specify that the tree can only have two levels at most

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)

Next, we'll build the dataframe that just has the predictors (image_ratio and recipe).

In this example, we'll use the ratio of images to text on the webpage and whether the site has the word 'recipe' in the title.

And then we'll build the dataframe that just has the outcome (whether the site was labeled as evergreen or not).

In [None]:
# Let's subset the dataframe to only have the 
predictors = ['image_ratio','recipe']
X = df[predictors]
y = df['label']

# Next we'll fit the model

In [None]:
model.fit(X, y)

# Let's visualize the tree

What we're going to do is create a function that take our model in.

We'll then create an IO object that acts as a fake file for writing the data to.

We'll then use the export_graphviz function from sklearn to write our model to the graphviz format and save the data in our fake file.

Last, we'll get the data that was written to the file, and then print it.

We can then use Graphviz to visualize it. 

Graphviz is a free program that reads in a graph stored in the "dot" format, and converts it to a labeled image

You don't need to have Graphviz installed on your computer.

Instead, you can copy the printed information and paste in the following website:
    http://www.webgraphviz.com/

In [None]:
from sklearn.tree import export_graphviz
import StringIO
def build_tree_image(model):
    dotfile = StringIO.StringIO()
    export_graphviz(model,out_file = dotfile, feature_names = X.columns)
   
    #If you had Graphviz installed on your computer, you would uncomment the folloing lines of code
    # dotfile = open("tree.dot","wb")
    # export_graphviz(model,out_file = dotfile, feature_names = X.columns)
    return dotfile.getvalue()
    
print build_tree_image(model)

# Evaluating the contribution of each feature

Altough decision trees are non-linear, you can still have a general sense of which predictor contributed the most to your classifications.

Specifically, the feature importance is the (normalized to sum to 1) total reduction in our purity brought about by that feature in the tree. If a feature is able to split the data in half, and narrow the category down a lot, then it has a lot of importance.

On our trained model, we can call the "feature\_importances_" command

In [None]:
model.feature_importances_

We can associate the feature with the predictor name using the zip command.

The zip command matches up elements from two lists and groups them in a single list.

It appears that recipe had roughly twice the importance of image ratio when classifying whether the webpage is an evergreen site.

In [None]:
zip(predictors, model.feature_importances_)

# Evaluating performance

We can see how our classifier performed, similar to how we evaluated the performance of our logistic regression and KNN classifier.

We'll first build a K-fold cross-validation object that splits our data into training and test sets.

For each fold, we'll fit the model on the training set of data, and then test out the performance on the training set.

We'll then see what the average performance was across all folds

In [None]:
from sklearn import cross_validation
#make a cross validation object for 10 folds
kfold = cross_validation.KFold(len(X), n_folds=10)

#using our model, fit a subset of the X data on a subset of the y data 
#using the kfold object to decide how to split the data
#using the area under the curve as our scoring metric for how well our model is making predictions
cv_scores = cross_validation.cross_val_score(model, X, y, cv=kfold,scoring='roc_auc')

#print off all of the cross-validation scores
print cv_scores

#print the average cross-validation score; 
#for Area under the Curve, anything above .50, is an improvement over chance
print cv_scores.mean()

# Adding More Predictors and Increasing the Depth

Right now, our decision tree is limited to a depth of 2, and only has two predictors.

Try to increase the Area under the Curve Score by increasing the number of predictors.

Below are the variable names to choose from.

In [None]:
print df.columns

In [None]:
predictors = ['','','']
X = df[predictors]
y= df['label']

model = DecisionTreeClassifier()
cv_scores = cross_validation.cross_val_score(model, X, y, cv=kfold,scoring='roc_auc')
print mean(cv_scores)

#  Adjusting Decision Trees to Avoid Overfitting

Right now, your decision tree is unconstrained in terms of it's depth and the minimum samples needed for a leaf to exist.

Let's try out different values for the depth it can take and the minimum number of samples needed for a leaf before we stop splitting.

We've seen the grid search function before.

We give it our classifier and a list of parameters we want it to try on the classifier, and when we fit a model to it, grid search tries out all of the combinations of paramters you give it.

To use grid search, you need to provide a dictionary, where each key is the parameter, and we provide a list containing the different values we want it to try

In [None]:
from sklearn.grid_search import GridSearchCV
max_depth_values = [2,3,4,5,6,7,8,9,10,12,13]
min_samples_leaf_values = [6,7,8,9,10,11,12,13,14]

model = DecisionTreeClassifier()
params = {'max_depth':max_depth_values,'min_samples_leaf':min_samples_leaf_values}
clf = GridSearchCV(model, params, cv=5,scoring='roc_auc')
clf.fit(X, y)

clf.grid_scores_

# Finding the Best Parameter Values

We can sort through the list of grid scores, by using the sorted function.

Right now, our grid scores are a list of tuples (a tuple is just a collection of objects, similar to a list)

In each tuple, we have the mean AUC, the standard deviation of the scores, and the parameter combination.

We want to sort the list of tuples by the first element in the tuple (the mean score). So we'll tell the sorted function to sort by the first (0th) element in the tuple.

In [None]:
sorted(clf.grid_scores_,key=lambda x:x[0])

# Random Forests

We've been focusing on building decision trees, but we can also build a random forest classifier that uses many different decision trees to classify our data.

We will simply create a RandomForestClassifier object and indicate how many different decision trees we want to include.

If n_estimators was equal to 1, then we would get similar performance to our decision tree

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

### Extracting importance of features

Similar to our decision tree context, we can also get the relative importance of each feature in our model

In [None]:
features = predictors
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head()

 ### Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

##  Independent Practice: Evaluate Random Forest Using Cross-Validation

1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature
  - 
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.


Here is a reminder of the features:

In [None]:
print df.columns

In [None]:
from sklearn import cross_validation
predictors = ['image_ratio', 'recipe']
# Building a model with more relevant features

model = RandomForestClassifier(n_estimators=50)

# Continue to add features to X
#     Build dummy features, include quantitative features, or add text features
X = df[predictors]

y = df['label']

# Evaluate predictive performance for the given feature set
scores = cross_validation.cross_val_score(model, X, y, cv=10,scoring='roc_auc')

#print the average cross-validated Area Under the Curve for that model
print 'Average AUC %f' % scores.mean()
