# Class 18 - Solution Code

Pipelines

In [1]:
import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head(1)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...


### Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### Review: Use of the Count Vectorizer

We previously used the Count Vectorizer to extract text features for this classification task

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

titles = data['title'].fillna('')

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# # Use `fit` to learn the vocabulary of the titles
# vectorizer.fit(titles)

# # Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# X = vectorizer.transform(titles)

### Review: Build a model to predict evergreeness of a website

Then we used those features to build a classification model

In [3]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty = 'l1')
# y = data['label']

# from sklearn.cross_validation import cross_val_score

# scores = cross_val_score(model, X, y, scoring='roc_auc')
# print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

### Demo: Pipelines
 
Often we will want to combine these steps to evaluate on some future dataset. For that incoming, future dataset, we need to make sure we perform the **exact same** transformations on the data. If `has_brownies_in_text` is column 19, we need to make sure it is column 19 when it comes to evaluation time. 

Pipelines combine all of the pre-processing steps and model building into a single object.

Rather than manually evaluating the transformers and then feeding them into the model, pipelines tie these steps together. Similar to models and vectorizers in scikit-learn, they are equipped with `fit` and `predict` or `predict_proba` methods as any model would be, but they ensure the proper data transformations are performed

[Docs](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for Pipeline

In [4]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('features', vectorizer),
        ('model', model)   
    ])

In [5]:
# Split the data into a training set
training_data = data[:6000]
X_train = training_data['title'].fillna('')
y_train = training_data['label']

# These rows are rows obtained in the future, unavailable at training time
X_new = data[6000:]['title'].fillna('')
y_new = data[6000:]['label'].fillna('')

In [6]:
# Fit the full pipeline
# This means we perform the steps laid out above
# First we fit the vectorizer, 
# and then feed the output of that into the fit function of the model
pipeline.fit(X_train, y_train)

Pipeline(steps=[('features', CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
  ...ty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [7]:
# Here again we apply the full pipeline for predictions
# The text is transformed automatically to match the features from the pipeline
pipeline.predict_proba(X_new)

array([[ 0.54496132,  0.45503868],
       [ 0.40246279,  0.59753721],
       [ 0.01265229,  0.98734771],
       ..., 
       [ 0.29678178,  0.70321822],
       [ 0.61249958,  0.38750042],
       [ 0.63559493,  0.36440507]])

In [8]:
pipeline.score(X_new, y_new)

0.76415770609319

In [9]:
pipeline.get_params()

{'features': CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
         dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, max_df=1.0, max_features=1000, min_df=1,
         ngram_range=(1, 2), preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None),
 'features__analyzer': u'word',
 'features__binary': True,
 'features__decode_error': u'strict',
 'features__dtype': numpy.int64,
 'features__encoding': u'utf-8',
 'features__input': u'content',
 'features__lowercase': True,
 'features__max_df': 1.0,
 'features__max_features': 1000,
 'features__min_df': 1,
 'features__ngram_range': (1, 2),
 'features__preprocessor': None,
 'features__stop_words': 'english',
 'features__strip_accents': None,
 'features__token_pattern': u'(?u)\\b\\w\\w+\\b',
 'features__tokenizer': None,
 'features__vocabulary': None,
 'model': LogisticRegression(C=1.0, class_

### Exercise: Add a `MaxAbsScaler` scaling step to the pipeline as well, this should occur after the vectorization

[Docs](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) for `MaxAbsScaler`

In [10]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()

In [11]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('features', vectorizer),
        ('scaling', scaler),
        ('model', model)   
    ])

pipeline.fit(X_train, y_train)

Pipeline(steps=[('features', CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
  ...ty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [12]:
pipeline.predict_proba(X_new)

array([[ 0.54500589,  0.45499411],
       [ 0.4024352 ,  0.5975648 ],
       [ 0.01265275,  0.98734725],
       ..., 
       [ 0.29679762,  0.70320238],
       [ 0.61249995,  0.38750005],
       [ 0.63447925,  0.36552075]])

In [13]:
pipeline.score(X_new, y_new)

0.76415770609319

In [14]:
pipeline.get_params()

{'features': CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
         dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, max_df=1.0, max_features=1000, min_df=1,
         ngram_range=(1, 2), preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None),
 'features__analyzer': u'word',
 'features__binary': True,
 'features__decode_error': u'strict',
 'features__dtype': numpy.int64,
 'features__encoding': u'utf-8',
 'features__input': u'content',
 'features__lowercase': True,
 'features__max_df': 1.0,
 'features__max_features': 1000,
 'features__min_df': 1,
 'features__ngram_range': (1, 2),
 'features__preprocessor': None,
 'features__stop_words': 'english',
 'features__strip_accents': None,
 'features__token_pattern': u'(?u)\\b\\w\\w+\\b',
 'features__tokenizer': None,
 'features__vocabulary': None,
 'model': LogisticRegression(C=1.0, class_