# DS_HK_14 | Class 12 | NLP with Classification

# Spacy Demo

If you haven't installed spacy yet, use:
```
pip install -U spacy
python -m spacy download en
```

Windows user might have to install **Microsoft Visual C++ Compiler for Python 2.7** first.

https://www.microsoft.com/en-us/download/confirmation.aspx?id=44266

This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
pip install nltk --yes
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [50]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [52]:
## Load spacy

import spacy
nlp_toolkit = spacy.load("en")
nlp_toolkit

<spacy.lang.en.English at 0x1159f4d10>

Another way to load `spacy`:
```
import spacy
nlp_toolkit = spacy.load("en")
```

In [55]:
title = u"I really dislike 'Bloomberg LP'"
parsed = nlp_toolkit(title)

for word in parsed: 
    print "Word: {}".format(word)
    print "\t Phrase type: {}".format(word.dep_)
    print "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No")
    print "\t Lemma: {}".format(word.lemma_)
    print "\t Parent of this word: {}".format(word.head.lemma_)

Word: I
	 Phrase type: nsubj
	 Is the word a known entity type? No
	 Lemma: -PRON-
	 Parent of this word: dislike
Word: really
	 Phrase type: advmod
	 Is the word a known entity type? No
	 Lemma: really
	 Parent of this word: dislike
Word: dislike
	 Phrase type: ROOT
	 Is the word a known entity type? No
	 Lemma: dislike
	 Parent of this word: dislike
Word: '
	 Phrase type: punct
	 Is the word a known entity type? No
	 Lemma: '
	 Parent of this word: dislike
Word: Bloomberg
	 Phrase type: compound
	 Is the word a known entity type? No
	 Lemma: bloomberg
	 Parent of this word: lp
Word: LP
	 Phrase type: dobj
	 Is the word a known entity type? No
	 Lemma: lp
	 Parent of this word: dislike
Word: '
	 Phrase type: punct
	 Is the word a known entity type? No
	 Lemma: '
	 Parent of this word: dislike


## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [26]:
def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

data['references_organization'] = data['title'].fillna(u'').map(references_organization)



Unnamed: 0,title
0,IBM Sees Holographic Calls Air Breathing Batte...
1,The Fully Electronic Futuristic Starting Gun T...
3,10 Foolproof Tips for Better Sleep
6,fashion lane American Wild Child
10,Business Financial News Breaking US Internatio...


In [28]:
# Take a look
data[data.references_organization][['title']].head(20)

Unnamed: 0,title
0,IBM Sees Holographic Calls Air Breathing Batte...
1,The Fully Electronic Futuristic Starting Gun T...
3,10 Foolproof Tips for Better Sleep
6,fashion lane American Wild Child
10,Business Financial News Breaking US Internatio...
12,9 Foods That Trash Your Teeth
14,French Onion Steaks with Red Wine Sauce french...
15,Izabel Goulart Swimsuit by Kikidoll 2012 Sport...
16,Liquid Mountaineering The Awesomer
21,BBC Food Recipes Blueberry and lemon traybake


## Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

.
.
.
.
.
.
.
.

In [25]:
list1 = [True, False]
list2 = [False, False]
list3 = [True, True]

print(any(list1))
print(any(list2))
print(any(list3))

True
False
True


In [30]:
def references_org_person(title):
    parsed = nlp_toolkit(title)
    contains_org = any([word.ent_type_ == 'ORG'  for word in parsed])
    contains_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return contains_org and contains_person

data['references_org_person'] = data['title'].fillna(u'').map(references_org_person)

# Take a look
data[data.references_org_person][['title']].head(10)

Unnamed: 0,title
29,Genevieve Morton Swimsuit by Tyler Rose Swimwe...
44,Alyssa Miller Swimsuit by Charlie by Matthew Z...
89,4 Surprising Foods to Cook on the Grill Whiske...
91,Heidi s Favorite Snacks Heidi Klum on AOL heid...
105,Chicken and Spinach Casserole Martha Stewart R...
115,BBC News UK Sweet message in a bottle
126,Jessica Gomes Swimsuit by Beach Bunny Swimwear...
136,Most Beautiful Woman By Day Inventor By Night ...
140,Ferrero Rocher Tart Trissalicious
141,Pasta Primavera The Pioneer Woman Cooks pasta ...


### BACK TO SLIDE 34

## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

> ### Let's try extracting some of the text content.
> ### Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [31]:
# Option 1: Create a function to check for this

# def has_recipe(text_in):
#     try:
#         if 'recipe' in str(text_in).lower():
#             return 1
#         else:
#             return 0
#     except: 
#         return 0
        
# data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

# data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.contains('recipe')

In [37]:
data.sample(10)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body,references_organization,references_org_person,recipe
797,http://sugarcrafter.net/2009/08/20/sesame-chic...,5154,"{""title"":""Sesame Chicken Sugarcrafter "",""body""...",arts_entertainment,0.722537,2.573529,0.45122,0.195122,0.146341,0.097561,...,82,2,0.329268,0.07377,1,Sesame Chicken Sugarcrafter,August 20 2009 Print E mail Filed under Asian ...,False,False,False
1607,http://www.amateurgourmet.com/,7965,"{""title"":""The Amateur Gourmet A Funny Food Blo...",recreation,0.647761,1.980392,0.398058,0.174757,0.009709,0.009709,...,103,0,0.019417,0.117318,1,The Amateur Gourmet A Funny Food Blog with Rec...,March 16 2012 By Adam Roberts 0 Comments You h...,True,False,False
2416,http://www.buzzfeed.com/rchemel/florida-state-...,6425,"{""title"":""Florida State Freeze Play VIDEO flor...",sports,0.442106,1.648026,0.439076,0.07563,0.029412,0.016807,...,476,8,0.191176,0.099265,0,Florida State Freeze Play VIDEO florida state ...,hb set header buzz link buzz null form video c...,False,False,False
3057,http://www.cbc.ca/stevenandchris/2012/01/sweet...,9823,"{""title"":""Steven and Chris Mini Sweet Potato C...",science_technology,0.420955,1.566667,0.416667,0.197917,0.083333,0.010417,...,96,6,0.104167,0.054393,1,Steven and Chris Mini Sweet Potato Cheesecakes...,1 2 package phyllo covered in a damp towel abo...,False,False,False
5293,http://o5.com/cleanse-your-body-with-vegetable...,9285,"{""title"":""Cleanse your body with vegetable jui...",?,?,3.224138,0.512821,0.264957,0.119658,0.025641,...,117,6,0.239316,0.062827,1,Cleanse your body with vegetable juices,Written by Jamie Tired and sluggish Try taking...,False,False,False
1669,http://www.huffingtonpost.com/2012/05/02/salte...,6159,"{""title"":""Salted Caramel Banana Bread Puddings...",business,0.707981,1.947059,0.67581,0.241895,0.052369,0.027431,...,401,3,0.209476,0.0626,1,Salted Caramel Banana Bread Puddings salted ca...,Salted Caramel Banana Bread Puddings Sang An 1...,False,False,False
3722,http://sportsillustrated.cnn.com/2013_swimsuit...,8929,"{""url"":""sportsillustrated cnn 2013 swimsuit mo...",recreation,0.655718,0.073529,0.0,0.0,0.0,0.0,...,79,4,0.088608,0.038462,0,Genevieve Morton Swimsuit Photos Sports Illust...,genevieve morton si swimsuit 2013 photos and ...,True,False,False
644,http://www.medindia.net/news/new-plastics-heal...,894,"{""title"":""New Plastics Heals Like Skin After I...",health,0.776904,1.973366,0.781818,0.364773,0.107955,0.022727,...,880,10,0.267045,0.083481,0,New Plastics Heals Like Skin After It Bleeds W...,A new genre of plastics that imitate the human...,True,False,False
3164,http://www.ivillage.com/cucumber-black-eyed-pe...,2802,"{""title"":""Cucumber Black Eyed Pea Salad EW "",""...",?,?,2.232295,0.756906,0.367403,0.066298,0.022099,...,362,5,0.129834,0.011494,1,Cucumber Black Eyed Pea Salad EW,An easy salad to serve with grilled chicken or...,False,False,False
5070,http://www.womansday.com/Recipes/Good-Old-Fash...,6083,"{""title"":""Good Old Fashioned Apple Pie at Woma...",?,?,2.252941,0.651429,0.262857,0.108571,0.017143,...,175,6,0.022857,0.025641,1,Good Old Fashioned Apple Pie at WomansDay com ...,Photo Charles Schiller Yield 1 pie Servings 8 ...,True,True,False


 ### Demo: Use of the Count Vectorizer

In [39]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=False)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()

 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [40]:
vectorizer.transform(titles).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [41]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.78852784  0.80821542  0.80886048], Average AUC 0.801867912845


In [57]:
vectorizer.get_feature_names()[:10]

[u'000',
 u'10',
 u'10 best',
 u'10 things',
 u'10 ways',
 u'100',
 u'101',
 u'11',
 u'12',
 u'13']

#### What features are the most important

In [58]:
# What features of these are most important?
model.fit(X, y)

all_feature_names = vectorizer.get_feature_names()
feature_importances = pd.DataFrame({'Features' : all_feature_names, 'Importance Score': model.feature_importances_})
feature_importances.sort_values('Importance Score', ascending=False).head()

Unnamed: 0,Features,Importance Score
719,recipe,0.050203
724,recipes,0.02389
193,chocolate,0.013942
184,chicken,0.013808
153,cake,0.012926


In [59]:
feature_importances['Importance Score'].sum()

0.9999999999999996

### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [None]:
## TODO



 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [None]:
## TODO

 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [None]:
## TODO