In [16]:
import pandas as pd
import json

data = pd.read_csv("https://github.com/ga-students/DAT-NYC-37/blob/master/lessons/lesson-13/assets/dataset/stumbleupon.tsv?raw=true", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

data.head(3)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

#### What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

*A sample of URLs is below, where `label = 1` are 'evergreen' websites*

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

---

### Objective: Predict if a given site will be evergreen based on the above features

**Problem:** Some of the above features are text-only (`title`, `url`, `body`). How can I leverage the modeling techniques we've covered so far to utilize text based features?

**Solution:** Transform text features into many numerical features.
  - Count Vectorization
  - Term frequency/inverse document frequency (TF-IDF) Vectorization.
---

## Demo: Understanding Count Vectorization - A Simple Example

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Count vectorization can be thought of as a simple word count across all documents. 

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

# Input data
titles = [
    "IBM Sees Electronic Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races",
    "The Chicago Bulls won"
]

# Define a CountVectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Apply the transform
count_vectorized_titles = count_vectorizer.fit_transform(titles)

# Get the results
print "Feature names: \n", count_vectorizer.get_feature_names()
print "Feature counts: \n", count_vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
df1 = pd.DataFrame(
    columns=count_vectorizer.get_feature_names(),
    index=['Article1', 'Article2', 'Article3'],
    data=count_vectorized_titles.todense()
)

df2 = pd.DataFrame(
    columns=count_vectorizer.get_feature_names(),
    index=df1.index,
    data=count_vectorized_titles.todense()
)

from sklearn.cross_validation import train_test_split

print df1

df2.sort_index(ascending=False, inplace=True)

Feature names: 
[u'advantages', u'air', u'batteries', u'breathing', u'bulls', u'calls', u'chicago', u'electronic', u'eliminates', u'fully', u'futuristic', u'gun', u'ibm', u'races', u'sees', u'starting', u'won']
Feature counts: 
[[0 1 1 1 0 1 0 2 0 0 0 0 1 0 1 0 0]
 [1 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0]
 [0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1]]

          advantages  air  batteries  breathing  bulls  calls  chicago  \
Article1           0    1          1          1      0      1        0   
Article2           1    0          0          0      0      0        0   
Article3           0    0          0          0      1      0        1   

          electronic  eliminates  fully  futuristic  gun  ibm  races  sees  \
Article1           2           0      0           0    0    1      0     1   
Article2           1           1      1           1    1    0      1     0   
Article3           0           0      0           0    0    0      0     0   

          starting  won  
Article1         0    0

array([False,  True, False], dtype=bool)

In [9]:
# TODO: Apply count vectorization to all titles

## Demo: Term-frequency, Inverse document frequency (Tf-Idf)

An alternative bag-of-words approach to CountVectorizer is a Term Frequency - Inverse Document Frequency (TF-IDF) representation.

TF-IDF uses the product of two intermediate values, the **Term Frequency** and **Inverse Document Frequency**.

- **Term Frequency** is equivalent to CountVectorizer features, just the number of times a word appears in the document (i.e. count).

- **Document Frequency** is the percentage of documents that a particular word appears in. 

For example, “the” would be 100% while “Syria” is much lower.  

Inverse Document Frequency is just 1/Document Frequency.


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

titles = [
    "IBM Sees Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races"
]

# 
tfidf_vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf = False)
tfidf_vectorized_titles = tfidf_vectorizer.fit_transform(titles)

print "Feature names: \n", tfidf_vectorizer.get_feature_names()


# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=tfidf_vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=tfidf_vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'air', u'batteries', u'breathing', u'calls', u'electronic', u'eliminates', u'fully', u'futuristic', u'gun', u'ibm', u'races', u'sees', u'starting']


Unnamed: 0,advantages,air,batteries,breathing,calls,electronic,eliminates,fully,futuristic,gun,ibm,races,sees,starting
Article1,0.0,0.392044,0.392044,0.392044,0.392044,0.278943,0.0,0.0,0.0,0.0,0.392044,0.0,0.392044,0.0
Article2,0.364996,0.0,0.0,0.0,0.0,0.259698,0.364996,0.364996,0.364996,0.364996,0.0,0.364996,0.0,0.364996


In [18]:
# TODO: Determine Tf-Idf of title 
# Find the words with 10 highest and lowest inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf = False)

tfidf_vectorizer.fit(data['title'].dropna())
tfidf_vectorized_titles = tfidf_vectorizer.transform(titles)

feature_names         = tfidf_vectorizer.get_feature_names()  # Returns all feature names
inverse_document_freq = tfidf_vectorizer.idf_  # Returns document frequencies

tfidf = pd.DataFrame({
    "feature_names": feature_names,
    "inverse_document_freq": inverse_document_freq
})

sorted_tfidf = tfidf.sort_values(by='inverse_document_freq')

sorted_tfidf.tail(10)
sorted_tfidf.head(10)

Unnamed: 0,feature_names,inverse_document_freq
7510,recipe,3.54232
2015,com,3.556185
7513,recipes,3.920619
3623,food,4.196644
1809,chocolate,4.34255
8616,sports,4.354111
9805,video,4.477725
6292,news,4.517999
1011,best,4.584061
3375,fashion,4.714114


 ### Demo: Use of Count Vectorizer/TfIdf with ngrams
 
 We can use the `ngram_range` parameter to find ngrams -- groups of n words.

In [19]:
# Note the inclusion of `ngram_range` and `max_features`
count_vectorizer = CountVectorizer(stop_words='english', max_features=1000, ngram_range=(1, 2))
count_vectorized_titles = count_vectorizer.fit_transform(titles)

print "Feature names: \n", count_vectorizer.get_feature_names()
print "Feature counts: \n", count_vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=count_vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=count_vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'advantages races', u'air', u'air breathing', u'batteries', u'breathing', u'breathing batteries', u'calls', u'calls air', u'electronic', u'electronic calls', u'electronic futuristic', u'eliminates', u'eliminates advantages', u'fully', u'fully electronic', u'futuristic', u'futuristic starting', u'gun', u'gun eliminates', u'ibm', u'ibm sees', u'races', u'sees', u'sees electronic', u'starting', u'starting gun']
Feature counts: 
[[0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1]]



Unnamed: 0,advantages,advantages races,air,air breathing,batteries,breathing,breathing batteries,calls,calls air,electronic,...,futuristic starting,gun,gun eliminates,ibm,ibm sees,races,sees,sees electronic,starting,starting gun
Article1,0,0,1,1,1,1,1,1,1,1,...,0,0,0,1,1,0,1,1,0,0
Article2,1,1,0,0,0,0,0,0,0,1,...,1,1,1,0,0,1,0,0,1,1


In [7]:
# Q: Repeat the Tf-Idf vectorization for the `title` column to include both 1-grams and 2-grams

In [8]:
# Q: The `body` column contains the actual text of the article. Perform both TfIdf and Count Vectorization 
#  for this column (1-gram).

---

# Review Exercise

## Exercise Demo: Build a random forest model to predict evergreeness of a website using the title features

In [29]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Build the model using cross_val_score (instead of train/test split)
from sklearn.cross_validation import cross_val_score

# 1. We need to fill NaN's with an empty string, otherwise the count vectorizer will fail.
titles = data['title'].fillna('')

# 2. Use `fit` to learn the vocabulary of the titles
count_vectorizer.fit(titles)

# 3. Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# Hint: Steps 2 & 3 can be combined by using `count_vectorizer.fit_transform(titles)`
X = count_vectorizer.transform(titles).toarray()
y = data['label']

# 4. Define our RandomForestClassifier model. It will fit 20 decision trees, each on a random subsample of the dataset.
rf_model = RandomForestClassifier(n_estimators = 30)
    
# 5. Split, train & evaluate the model in one fell swoop.
# K-fold (5) cross validation, followed by fitting a model, followed by scoring with AUC
scores = cross_val_score(rf_model, X, y, cv=5, scoring='roc_auc')
print "Cross-validated AUC scores: %s (avg. AUC %0.3f, stdev: %0.3f)" % (scores, scores.mean(), scores.std())

Cross-validated AUC scores: [ 0.80783352  0.79203173  0.80865722  0.80841842  0.81503735] (avg. AUC 0.806, stdev: 0.008)


### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [22]:
# To make our lives easier, let's define a simple utility function to convert an array or series text of
# text documents into a count vectorized dataframe:
def vectorize_text(documents, vectorizer_algorithm=CountVectorizer, max_features=1000, ngram_range=(1, 2)):
    vectorizer = vectorizer_algorithm(stop_words='english', max_features=max_features, ngram_range=ngram_range)
    vectorized_results = vectorizer.fit_transform(documents)
    
    return pd.DataFrame(
        columns=vectorizer.get_feature_names(),
        data=vectorized_results.todense()
    )

In [26]:
# Example Usage
title_documents = data['title'].fillna('')

print "Preview of documents to be vectorized (input): "
print title_documents.head()

count_vectorized_titles = vectorize_text(title_documents)

print "\nVectorized Output sample: "
count_vectorized_titles.head()

count_vectorized_titles

Preview of documents to be vectorized (input): 
0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
3                  10 Foolproof Tips for Better Sleep 
4    The 50 Coolest Jerseys You Didn t Know Existed...
Name: title, dtype: object

Vectorized Output sample: 


Unnamed: 0,000,10,10 best,10 things,10 ways,100,101,11,12,13,...,year old,years,yes,yoga,yogurt,york,york best,york village,yummy,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
## TODO: We want to repeat the above, but with these features as well:

# Step 1: Prepare the input data by selecting relevant columns and dummy-encoding categorical vars
quantitative_features = [
    'numberOfLinks',
    'linkwordscore',
    'embed_ratio',
    'image_ratio',
    'html_ratio'
]

# Horizontally concantenate categorical features, quantitative features, and count_vectorized_title features into a single DF
X = pd.concat([count_vectorized_titles, data[quantitative_features]], axis=1)

X.head()

Unnamed: 0,000,10,10 best,10 things,10 ways,100,101,11,12,13,...,york,york best,york village,yummy,zucchini,numberOfLinks,linkwordscore,embed_ratio,image_ratio,html_ratio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,170,24,0.0,0.003883,0.245831
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,187,40,0.0,0.088652,0.20349
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,258,55,0.0,0.120536,0.226402
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,120,24,0.0,0.035343,0.265656
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,162,14,0.0,0.050473,0.228887


In [32]:
# Repeat the process in previous exercise, only with our new dataframe
# WARNING: It may take several minutes to run this cell!

rf_model = RandomForestClassifier(n_estimators=15)

scores = cross_val_score(rf_model, X.values, y, scoring='roc_auc')

print "Count Vectorized AUC scores (`title` + additional features) %s (avg. AUC %0.3f)" % (scores, scores.mean())

Count Vectorized AUC scores (`title` + additional features) [ 0.78596334  0.80964822  0.79658543] (avg. AUC 0.797)


<span style="color: #F00;">**Note: ** The scores are lower here. *Why?* We're __overfitting__! </span>

**Key takeaway:** Decision trees are susceptible to overfitting. RandomForests mitigate this drawback by training many simpler decision trees (`n_estimators`) on a random subset of our data. However, if the number of trees in our forest is too small, we'll still overfit.

**Solution:** Increasing `n_estimators` (the number of decision trees our Random Forest will generate) will improve the performance of our model, but will take a long time to train (possibly many hours).

In [33]:
# Repeat the process in previous exercise, only with our new dataframe
# WARNING: It may take several minutes to run this cell!

rf_model = RandomForestClassifier(n_estimators=200)

scores = cross_val_score(rf_model, X.values, y, scoring='roc_auc')

print "Count Vectorized CV AUC scores: %s (avg AUC %0.3f)" % (scores, scores.mean())

KeyboardInterrupt: 

 ### Exercise: Build a random forest model to predict evergreeness of a website using only the features extracted from the `body` column

In [34]:
## TODO

body_documents = data['body'].fillna('')
X = vectorize_text(body_documents)

# Same as before, but with a different input
rf_model = RandomForestClassifier(n_estimators=10)

scores = cross_val_score(rf_model, X.values, y, scoring='roc_auc')

print "Count Vectorized CV AUC scores: %s (avg AUC %0.3f)" % (scores, scores.mean())

Count Vectorized CV AUC scores: [ 0.82210802  0.84087714  0.83048093] (avg AUC 0.831)


 ### You do: Repeat above exercises using `TfIdfVectorizer` instead of `CountVectorizer` - is there an improvement?

In [36]:
## TODO

body_documents = data['body'].fillna('')
X = vectorize_text(body_documents, vectorizer_algorithm=TfidfVectorizer)

# Same as before, but with a different input
rf_model = RandomForestClassifier(n_estimators=30)

scores = cross_val_score(rf_model, X.values, y, scoring='roc_auc')

print "Count Vectorized CV AUC scores: %s (avg AUC %0.3f)" % (scores, scores.mean())

Count Vectorized CV AUC scores: [ 0.84328726  0.85724506  0.84801366] (avg AUC 0.850)


### Vectorization Review

**Core concept: **
Transforming text features using **Count Vectorization** & **TfIdf Vectorization**

**Why do we care?**
- Features aren't always numerical, boolean, or categorical. They can also be text!

**How do we handle such text features?**

Transform a text feature into a new dataframe containing many numerical features. The resulting transformed dataframe has a single numerical feature *per unique word from across all items in the original text*. 

Two major forms of text vectorization:
  - **Count vectorization:** Simple count of word occurrences for that document.
  - **Term Frequency - Inverse Document Frequency (Tf-Idf):** Similar to Count Vectorization, but each value is divides by the % of documents containing that word. Frequent words that only occur within a small set of documents will have a large TfIdf value. Words that are common to all documents will have a lower TfIdf value. TfIdf is good at finding "topical" keywords such as "recipe".

*Once vectorization is applied, we can apply modeling techniques as long as they scale well to datasets with large numbers of features.*

### Questions:

- What are some drawbacks of these text vectorization techniques?
- Why would you use TfIdf Vectorizer instead of CountVectorizer? Why not?
- Why did we use Random Forests in our above examples?

### Additional Resources:

- http://www.tfidf.com/
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html