# Spacy Demo

If you haven't installed spacy yet, use:
```
conda install spacy
python -m spacy.en.download
```
This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
conda install nltk
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [1]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

df = pd.read_csv('../../data/stumbleupon.tsv', sep='\t', encoding='utf-8')

df['title'] = df['boilerplate'].map(lambda x: json.loads(x).get('title', ''))
df['body'] = df['boilerplate'].map(lambda x: json.loads(x).get('body', ''))

df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [2]:
# Load spacy

import spacy
nlp = spacy.load('en')

In [3]:
# Try replacing this sentence with another sentence to see how it's parsed
title = u"IBM Sees Holographic Calls, Air Breathing Battery"
parsed = nlp(title)

for (i, word) in enumerate(parsed): 
    print 'Word: {}'.format(word)
    print '\t Phrase type: {}'.format(word.dep_)
    print '\t Is the word a known entity type? {}'.format(word.ent_type_ if word.ent_type_ else "No")
    print '\t Lemma: {}'.format(word.lemma_)
    print '\t Parent of this word: {}'.format(word.head.lemma_)

Word: IBM
	 Phrase type: nsubj
	 Is the word a known entity type? ORG
	 Lemma: ibm
	 Parent of this word: see
Word: Sees
	 Phrase type: compound
	 Is the word a known entity type? No
	 Lemma: see
	 Parent of this word: calls
Word: Holographic
	 Phrase type: compound
	 Is the word a known entity type? No
	 Lemma: holographic
	 Parent of this word: calls
Word: Calls
	 Phrase type: ROOT
	 Is the word a known entity type? No
	 Lemma: calls
	 Parent of this word: calls
Word: ,
	 Phrase type: punct
	 Is the word a known entity type? No
	 Lemma: ,
	 Parent of this word: calls
Word: Air
	 Phrase type: compound
	 Is the word a known entity type? ORG
	 Lemma: air
	 Parent of this word: battery
Word: Breathing
	 Phrase type: compound
	 Is the word a known entity type? ORG
	 Lemma: breathing
	 Parent of this word: battery
Word: Battery
	 Phrase type: appos
	 Is the word a known entity type? ORG
	 Lemma: battery
	 Parent of this word: calls


## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [4]:
def references_organization(title):
    parsed = nlp(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

df['references_organization'] = df['title'].fillna(u'').map(references_organization)

# Take a look
df[df['references_organization']][['title']].head()

Unnamed: 0,title
0,IBM Sees Holographic Calls Air Breathing Batte...
1,The Fully Electronic Futuristic Starting Gun T...
3,10 Foolproof Tips for Better Sleep
6,fashion lane American Wild Child
10,Business Financial News Breaking US Internatio...


In [5]:
df[df['references_organization']]['title'].iloc[1]

u'The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races'

>## Exercise:

>Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

In [6]:
# Add the code to this function
def references_org_person(title):
    parsed = nlp(title)
    is_org = any([word.ent_type_ == 'ORG' for word in parsed])
    is_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return is_org or is_person

df['references_org_person'] = df['title'].fillna(u'').map(references_org_person)

# Take a look
df[df['references_org_person']][['title']].head()

Unnamed: 0,title
0,IBM Sees Holographic Calls Air Breathing Batte...
1,The Fully Electronic Futuristic Starting Gun T...
3,10 Foolproof Tips for Better Sleep
4,The 50 Coolest Jerseys You Didn t Know Existed...
6,fashion lane American Wild Child


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

> ### Let's try extracting some of the text content.
> ### Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [7]:
df['has_recipe'] = df['title'].str.lower().str.contains('recipe')
df.groupby('has_recipe')['label'].mean()

has_recipe
False    0.455730
True     0.912206
Name: label, dtype: float64

 ### Demo: Use of the Count Vectorizer

In [8]:
titles = df['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    max_features=1000, 
    ngram_range=(1, 2), 
    stop_words='english',
    binary=True,
)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

In [9]:
# Let's find the most frequent words
top_words = pd.DataFrame(X.toarray()).sum().sort_values(ascending=False).head()

for word in top_words.index:
    print 'Feature: {}, Token: {}'.format(word, vectorizer.get_feature_names()[word])

Feature: 715, Token: recipe
Feature: 217, Token: com
Feature: 721, Token: recipes
Feature: 363, Token: food
Feature: 192, Token: chocolate


 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [10]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=20)
    
# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# Note: we already fit the vectorizer so we don't need to repeat that step
X = vectorizer.transform(titles).toarray()
y = df['label']

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print 'CV AUC {}, Average AUC {}'.format(scores, scores.mean())

CV AUC [ 0.7888958   0.80497036  0.80298833], Average AUC 0.798951494351


In [11]:
# Let's train a final model and apply it to a new article title
model.fit(X, y)

title = '4 Sweet Gifts for Your Valentine'
title_as_features = vectorizer.transform([title])
model.predict_proba(title_as_features)[:, 1]

array([ 0.75416667])

### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [12]:
features = [
    'image_ratio', 
    'html_ratio',  
    'avglinksize'
]

# Concatenate the quantitative features with the title features
X = pd.concat([df[features], pd.DataFrame(vectorizer.transform(df['title'].fillna('')).toarray())], axis=1)
y = df['label']

model = RandomForestClassifier(n_estimators=100)
cross_val_score(model, X, y, scoring='roc_auc', cv=5).mean()

0.81502877777265004

 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [13]:
# Try this yourself

 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
    max_features=1000, 
    stop_words='english',
)

In [15]:
titles = df['title'].fillna('')
tfidf.fit(titles)

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=u'english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [16]:
X = tfidf.transform(titles)

# Let's find the highest weighted words
top_words = pd.DataFrame(X.toarray()).sum().sort_values(ascending=False).head()

for word in top_words.index:
    print 'Feature: {}, Token: {}'.format(word, tfidf.get_feature_names()[word])

Feature: 198, Token: com
Feature: 721, Token: recipe
Feature: 722, Token: recipes
Feature: 356, Token: food
Feature: 180, Token: chocolate


In [17]:
model = RandomForestClassifier(n_estimators=20)

scores = cross_val_score(model, X, y, scoring='roc_auc')
print 'CV AUC {}, Average AUC {}'.format(scores, scores.mean())

CV AUC [ 0.80618878  0.8184249   0.81189199], Average AUC 0.812168558812


In [18]:
model.fit(X,y)

# Try the model out on a new title
title = '4 Sweet Gifts for Your Valentine'
title_as_features = tfidf.transform([title])
model.predict_proba(title_as_features)[:, 1]

array([ 0.9])