# PRÁTICA GUIADA: Pipelines StumbleUpon Evergreen

## 1. Introdução

Usaremos o dataset do StambleUpon para montar o nosso primeiro Pipeline. StambleUpon é um site que recomenda páginas e conteúdo a seus usuários baseados nos interesses destes. Entre as páginas recomendadas estão algumas que têm curtos períodos de relevância (notícias, receitas de culinária, etc.) e outras que mantêm o interesse ao longo do tempo e podem ser recomendadas aos usuários muito depois de terem sido publicadas. As páginas podem ser classificadas em "ephemeral" (efêmeras) ou "evergreen" (persistentes).

O objetivo é, então, poder construir um classificador que classifique as páginas nessas duas categorias a fim de melhorar o sistema de recomendação do site.

Para isso, tentaremos mostrar a utilidade dos pipelines.

**Nota:** esta prática está baseada em um [desafio de Kaggle](https://www.kaggle.com/c/stumbleupon).

## 2. Pipelines "simples"

* Primeiro importaremos os dados, pacotes, etc.

In [1]:
from sklearn.pipeline import Pipeline
import pandas as pd
import json

data = pd.read_csv("train.tsv", sep='\t')
data.boilerplate.head()

0    {"title":"IBM Sees Holographic Calls Air Breat...
1    {"title":"The Fully Electronic Futuristic Star...
2    {"title":"Fruits that Fight the Flu fruits tha...
3    {"title":"10 Foolproof Tips for Better Sleep "...
4    {"title":"The 50 Coolest Jerseys You Didn t Kn...
Name: boilerplate, dtype: object

In [2]:
data.boilerplate[1]

'{"title":"The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races","body":"And that can be carried on a plane without the hassle too The Omega E Gun Starting Pistol Omega It s easy to take for granted just how insanely close some Olympic races are and how much the minutiae of it all can matter The perfect example is the traditional starting gun Seems easy You pull a trigger and the race starts Boom What people don t consider When a conventional gun goes off the sound travels to the ears of the closest runner a fraction of a second sooner than the others That s just enough to matter and why the latest starting pistol has traded in the mechanical boom for orchestrated electronic noise Omega has been the watch company tasked as the official timekeeper of the Olympic Games since 1932 At the 2010 Vancouv

* Do campo “boilerplate” pegamos os subcampos “title” e “body” e os adicionamos a data
* Preenchemos os espaços vazios com ''

* Verificamos os valores obtidos no vetor

In [3]:
data['title'] = data.boilerplate.apply(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.apply(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [4]:
titles = data['title'].fillna('')
body = data['body'].fillna('')
y = data['label']
titles[0:3]

#y[0:3]
#y.value_counts() / len(y)

0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
Name: title, dtype: object

* Vamos usar a classe `CountVectorizer` para extrair um vetor de palavras a partir dos títulos.

    **Parâmetros:**

    1. `max_features`: Considera apenas as primeiras X características, ordenadas por frequência.
    2. `ngram_range` : tuple (min_n, max_n): Analisa as palavras sozinhas ou em pares.
    3. `stop_words`: Descarta artigos e palavras sem poder preditivo do idioma inglês. É possível usar listas personalizadas.
    4. `binary`: As possibilidades são 0 ou 1 (não se acumulam).

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000,
                             stop_words='english',
                             binary=True)
vectorizer.fit(['IBM Sees Holographic Calls Air Breathing'])

vectorizer.get_feature_names()

['air', 'breathing', 'calls', 'holographic', 'ibm', 'sees']

*Testamos o vetorizador usando uma frase.

* O argumento `todense` permite que vejamos a matriz obtida (em formato "denso" não "sparse").

* A partir do vetor original, é possível observar quais foram os elementos da matriz obtidos

* Exemplos: 
      
  * ‘air’ é o primeiro elemento. Como está presente, obtemos um 1.
  * ‘air breathing’ é o segundo. Como não está presente, obtemos um 0 

In [6]:
vectorizer.transform(['IBM Sees Holographic Air']).todense()

matrix([[1, 0, 0, 1, 1, 1]], dtype=int64)

* Voltamos a inicializar e executar o vetorizador, desta vez, com todos os títulos.
* Usando a matriz `X`, que contém todos os termos comuns do nosso dataset (1 e 2 palavras), vamos alimentar o nosso classificador.
* Instanciamos um modelo de regressão logística a partir do `sklearn` e o treinamos usando cross-validation para avaliação. 
* Em seguida, imprimimos os scores obtidos. São 3, porque é a estratégia padrão de crossval (é possível controlar por parâmetro).

In [7]:
vectorizer.fit(titles)

X = vectorizer.transform(titles)

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression()
scores = cross_val_score(model, X, y)

print('CV scores: {}'.format(scores))

print('Average CVScore: {:0.3f} +/- {:0.3f}'.format(scores.mean(),
                                                    scores.std()))

CV scores: [ 0.75709651  0.75659229  0.75121753]
Average CVScore: 0.755 +/- 0.003


* Agora vamos criar um pipeline com os passos executados anteriormente:
    1. O vetorizador de texto
    2. O modelo de regressão


* Primeiro fazemos um split a partir dos dados para ficar com um set de treinamento e outro de teste.

In [9]:
training_data = data[:6000]
X_train = training_data['title'].fillna('')
y_train = training_data['label']

X_new = data[6000:]['title'].fillna('')
y_new = data[6000:]['label']

X_new.head()

6000    Nomskulls Skull Cupcake Mold nomskulls - skull...
6001                    Medicine Ball Exercises Workouts 
6002    Spanakopita Greek Spinach Pie Recipe spanakopi...
6003    Dishes Made With The Drink PHOTOS 28 amazing w...
6004    The best sports clips of 2012 Classic YouTube ...
Name: title, dtype: object

Importamos e criamos o pipeline
Vamos treiná-lo com o set de treinamento e executá-lo no testset

In [10]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vec', vectorizer),
        ('model', LogisticRegression())])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_new)
pred

array([1, 1, 1, ..., 1, 0, 0], dtype=int64)

* Vamos comparar a previsão com o label
* Para isso, passamos o array de previsões a um boolean para comparar com os labels e executamos o relatório de classificação

In [11]:
#pred_bool=pred[:,0]&lt;0.5
from sklearn.metrics import classification_report
print(classification_report(y_new, pred))

             precision    recall  f1-score   support

          0       0.71      0.85      0.78       675
          1       0.83      0.68      0.75       720

avg / total       0.77      0.76      0.76      1395



## 3. Combinando pipelines e GridSearchCV

Vejamos agora como usar os pipelines junto com o tunning de hiperparâmetros com o `GridSearchCV`

In [12]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

Geramos um pipeline que tem três etapas:

1. Um vetorizador de texto: `CountVectorizer`
2. Um transformador da matriz original `TfidfTransformer`
3. Um classificador baseado em Support Vector Machines

Note que neste caso não os instanciamos previamente.

In [13]:
pipeline = Pipeline([
   ('vect', CountVectorizer()), 
   ('tfidf', TfidfTransformer()), 
   ('svc', svm.SVC()), 
])

* Definimos os parâmetros a buscar.
  - É importante notar a forma em que os parâmetros são passados: geralmente são escritos `[nome da etapa]__[parâmetro]`.
* Então, os parâmetros que usamos no `GridSeachCV` são 
  - para `CountVectorizer` (chamado `vect` no pipeline): `max_df` y `n_gram_range`
  - para `SVC` (chamado `svc` no pipeline): `kernel` e `C`

In [14]:
parameters = {
    'vect__max_df': np.linspace(0.01,0.7,5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'svc__kernel': ['linear'],
    'svc__C': np.linspace(0.01,0.7,10)
}

In [15]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs = 3 , verbose = 2 )

In [16]:
print("Performing grid search...")
grid_search.fit(X_train, y_train)

Performing grid search...
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:   41.1s
[Parallel(n_jobs=3)]: Done 156 tasks      | elapsed:  2.8min
[Parallel(n_jobs=3)]: Done 300 out of 300 | elapsed:  5.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid={'vect__max_df': array([ 0.01  ,  0.1825,  0.355 ,  0.5275,  0.7   ]), 'vect__ngram_range': ((1, 1), (1, 2)), 'svc__kernel': ['linear'], 'svc__C': array([ 0.01   ,  0.08667,  0.16333,  0.24   ,  0.31667,  0.39333,
        0.47   ,  0.54667,  0.62333,  0.7    ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

* E imprimimos os melhores parâmetros

In [17]:
print("Best score: %0.3f" % grid_search . best_score_) 
print("Best parameters set:" )
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()):
    print("\t %s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.766
Best parameters set:
	 svc__C: 0.54666666666666663
	 svc__kernel: 'linear'
	 vect__max_df: 0.1825
	 vect__ngram_range: (1, 1)


** BÔNUS:** O tempo de computação é muito melhor com `RandomizedSearchCV`?

In [18]:
from sklearn.model_selection import RandomizedSearchCV
rand_search = RandomizedSearchCV(pipeline, parameters, n_jobs = 3 , verbose = 2, n_iter=50)

In [19]:
print("Performing randomized search...") 
rand_search.fit(X_train, y_train)

Performing randomized search...
Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:   42.9s
[Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed:  2.5min finished


RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
          fit_params=None, iid=True, n_iter=50, n_jobs=3,
          param_distributions={'vect__max_df': array([ 0.01  ,  0.1825,  0.355 ,  0.5275,  0.7   ]), 'vect__ngram_range': ((1, 1), (1, 2)), 'svc__kernel': ['linear'], 'svc__C': array([ 0.01   ,  0.08667,  0.16333,  0.24   ,  0.31667,  0.39333,
        0.47   ,  0.54667,  0.62333,  0.7    ])},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [20]:
print("Best score: %0.3f" % rand_search . best_score_) 
print("Best parameters set:" )
best_parameters_rand = rand_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()): 
    print("\t %s: %r" % (param_name, best_parameters_rand[param_name]))

Best score: 0.766
Best parameters set:
	 svc__C: 0.54666666666666663
	 svc__kernel: 'linear'
	 vect__max_df: 0.35499999999999998
	 vect__ngram_range: (1, 1)


## 4. Pré-processamento

Às vezes, as classes existentes no módulo de pré-processamento do sklearn podem "ficar pequenas". Ou seja, podemos ter que definir alguma outra transformação para o pré-processamento que não exista no módulo.

Vamos ver dois exemplos muito simples de como criar a extensão da funcionalidade do módulo. Há duas formas básicas de executar essa extensão de funcionalidade.

1. Estender a BaseClass
2. Usar a função Transformer

### 4.1. Estender a BaseClass no Scikit-Learn. 


Neste exemplo, criamos um transformador muito simples que retorna a entrada multiplicada por um fator X:

In [22]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

In [23]:
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, factor):
        self.factor = factor
        
    def transform(self, X, *_):
        return X * self.factor
    
    def fit(self, *_):
        return self

In [24]:
fm = FeatureMultiplier(2)
test = np.diag([1,2,3,4])
print(test)

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]


In [25]:
fm.transform(test)

array([[2, 0, 0, 0],
       [0, 4, 0, 0],
       [0, 0, 6, 0],
       [0, 0, 0, 8]])

* Vamos supor que queremos gerar um transformador que extraia o comprimento do corpo dos textos...

In [32]:
class len_body(BaseEstimator, TransformerMixin):
    def __init__(self):
        self = self
    
    def transform(self, body):
        return [len(w) for w in body]
    
    def fit(self):
        return self

In [34]:
# largo = len_body()
# largo.transform(body)

### 4.2. Usando a função FunctionTransformer do módulo de pré-processamento

* Vamos criar um transformador personalizado que retorne o logaritmo do input.

In [35]:
from sklearn.preprocessing import FunctionTransformer

In [36]:
transformer = FunctionTransformer(np.log1p)

In [37]:
X = np.array([[0,1],[2,3]])

In [38]:
transformer.transform(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])

* E vamos replicar o transformador que quantifica o comprimento dos textos

In [40]:
def ll(x):
    l = [len(w) for w in x]
    return(l)
    
largo2 = FunctionTransformer(ll, validate=False)

In [42]:
largo2.transform(body)[:10]

[6215, 3092, 1391, 2594, 11642, 2425, 892, 1534, 869, 2973]

In [48]:
# Exportando binario

import pickle
def save_model(model, title='model'):
    with open(title + '.pkl', 'wb') as fid:
        pickle.dump(model, fid)
        
    print('Model save')

In [60]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [50]:
save_model(a,title='teste')

Model save


In [61]:
save_model(pipeline, 'pipeline')

Model save


In [62]:
def openpkl(path=''):
    path = path + '.pkl'
    file = open(path, 'rb')
    model = pickle.load(file)
    file.close
    return model

teste_modelo = openpkl('pipeline')

In [63]:
teste_modelo.predict(X_train)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [64]:
X_train

0       IBM Sees Holographic Calls Air Breathing Batte...
1       The Fully Electronic Futuristic Starting Gun T...
2       Fruits that Fight the Flu fruits that fight th...
3                     10 Foolproof Tips for Better Sleep 
4       The 50 Coolest Jerseys You Didn t Know Existed...
5                               Genital Herpes Treatment 
6                       fashion lane American Wild Child 
7       Racing For Recovery by Dean Johnson racing for...
8                      Valet The Handbook 31 Days 31 days
9             Cookies and Cream Brownies How Sweet It Is 
10      Business Financial News Breaking US Internatio...
11      A Tip of the Cap to The Greatest Iron Man of T...
12                         9 Foods That Trash Your Teeth 
13                                                       
14      French Onion Steaks with Red Wine Sauce french...
15      Izabel Goulart Swimsuit by Kikidoll 2012 Sport...
16                    Liquid Mountaineering The Awesomer 
17            

In [65]:
import xgboost

In [66]:
pipeline2 = Pipeline([
        ('vec', vectorizer),
        ('model', xgboost.XGBClassifier())])

In [67]:
pipeline2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vec', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        st...
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])