In [3]:
import pandas as pd
import json

# Read in the dataset
df = pd.read_csv("../../data/stumbleupon.tsv", sep='\t')

# Parse out the title and body of the article
df['title'] = df['boilerplate'].map(lambda x: json.loads(x).get('title', ''))
df['body'] = df['boilerplate'].map(lambda x: json.loads(x).get('body', ''))

# Show a preview of the data
df.head(1)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...


In [4]:
df.size

214455

# Predicting "Greeness" of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?

> Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

> A sample of URLs is below, where label = 1 are 'evergreen' websites

In [9]:
df[df['label'] == 1]['url'].sample(3)

6802          http://bunsinmyoven.com/2009/12/16/buttons/
2820    http://www.foodnetwork.com/recipes/paula-deen/...
5739    http://zenhabits.net/2009/02/how-to-declutter-...
Name: url, dtype: object

# Explore the Dataset

> ### Exercise \#1: In pairs, brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.
> ###  Exercise \#2: After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- Ex: If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
- Ex: If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

### Split up and develop 1-3 of the those features independently.

In [18]:
df['photography'] = df['body'].str.lower().str.contains('photography').map(lambda x: 1 if x else 0)

In [44]:
df['photography'].value_counts()

0    7163
1     232
Name: photography, dtype: int64

In [43]:
df['body'].str.lower().str.contains('travel').map(lambda x: 1 if x else 0).value_counts()

0    7104
1     291
Name: body, dtype: int64

In [45]:
df['travel'] = df['body'].str.lower().str.contains('travel').map(lambda x: 1 if x else 0)

### Exercise \#3: Does being a news site affect evergreeness?
Compute or plot the percentage of news related evergreen sites. 

In [29]:
df.groupby('is_news')['label'].mean()

is_news
1    0.516916
?    0.507562
Name: label, dtype: float64

### Exercise \#4: Does category in general affect evergreeness? 
Plot the rate of evergreen sites for all Alchemy categories.

In [32]:
df.groupby('alchemy_category')['label'].mean()

alchemy_category
?                     0.502135
arts_entertainment    0.371945
business              0.711364
computer_internet     0.246622
culture_politics      0.457726
gaming                0.368421
health                0.573123
law_crime             0.419355
recreation            0.684296
religion              0.416667
science_technology    0.456747
sports                0.205263
unknown               0.333333
weather               0.000000
Name: label, dtype: float64

###### Evergreenness could be validated using confidence intervals to compare categories

### Exercise \#5: How many articles are there per category?

In [34]:
df['alchemy_category'].value_counts()

?                     2342
recreation            1229
arts_entertainment     941
business               880
health                 506
sports                 380
culture_politics       343
computer_internet      296
science_technology     289
gaming                  76
religion                72
law_crime               31
unknown                  6
weather                  4
Name: alchemy_category, dtype: int64

### Exercise \#6: Create a feature for the title containing the word 'recipe'. 
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [33]:
df['has_recipe'] = df['body'].str.lower().str.contains('recipe').astype(float, errors='ignore')

## Demo: Build a decision tree model to predict the "evergreeness" of a given website.

In [39]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

cls = DecisionTreeClassifier()

features = ['image_ratio', 'html_ratio', 'lengthyLinkDomain']
target = 'label'

X = df[features]
y = df[target]
      
# Fits the model
cls.fit(X, y)

# Using cross val score to just look at the k-fold cross validation scores on a specific model
cross_val_score(cls, X, y, scoring='roc_auc', cv=5).mean()

0.52290293873095028

## Decision Trees in Scikit-Learn

### Exercise: Evaluate the decision tree using cross-validation; use AUC as the evaluation metric. Add your custom features in to see if there is an improvement relative to the previous model above.

In [38]:
from sklearn.model_selection import cross_val_score

df['has_recipe'] = df['body'].str.lower().str.contains('recipe').astype(float, errors='ignore')
df['has_recipe'] = df['has_recipe'].map(lambda x: 0 if pd.isnull(x) else x)

new_x = df[features + ['has_recipe']]
cross_val_score(cls, new_x, y, scoring= 'roc_auc', cv=5).mean()

0.62850179664938355

In [47]:
new_x2 = df[features + ['photography']]
cross_val_score(cls, new_x2, y, scoring= 'roc_auc', cv=5).mean()

0.56174384225690011

In [48]:
new_x3 = df[features + ['travel']]
cross_val_score(cls, new_x3, y, scoring= 'roc_auc', cv=5).mean()

0.55898801204906601

##  Adjusting Decision Trees to Avoid Overfitting

### Exercise: Explore the hyperparameters in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf). You can do this manually or through gridsearchCV [(documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [40]:
cls = DecisionTreeClassifier(max_depth=2,
                             min_samples_leaf=5)
cls.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Demo: Build a random forest model to predict the evergreeness of a website. 

In [54]:
from sklearn.ensemble import RandomForestClassifier

cls = RandomForestClassifier(n_estimators=10)
cls.fit(X, y)
cross_val_score(cls, new_x, y, scoring= 'roc_auc', cv=5).mean()

0.72384728808417642

## Demo: Extracting importance of features

In [55]:
features = X.columns
feature_importances = cls.feature_importances_

features_df = pd.DataFrame({'features': features, 'importance': feature_importances})
features_df.sort_values('importance', inplace=True, ascending=False)

features_df.head()

Unnamed: 0,features,importance
1,html_ratio,0.538828
0,image_ratio,0.457448
2,lengthyLinkDomain,0.003724


## Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

## Independent Practice: Evaluate Random Forest Using GridSearch

1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.
