In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Pipelining estimators

In this section we study how different estimators maybe be chained.

## A simple example: feature extraction and selection before an estimator

### Standardize your data

Preprocessing might be required before learning a model. For instance, a user could be interested in creating hand-crafted features or an algorithm might make some apriori assumptions about the data. 

In our case, the solver used by the `LogisticRegression` expects the data to be normalized. Thus, we need to standardize the data before training the model. To observe this necessary condition, we will check the number of iterations required to train the model.

In [4]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))

Accuracy score of the LogisticRegression is 0.95
LogisticRegression required 1535 iterations to be fitted


The `MinMaxScaler` transformer is used to normalise the data. This scaler should be applied in the following way: learn (i.e., `fit` method) the statistics on a training set and standardize (i.e., `transform` method) both the training and testing sets. Finally, we will train and test the model and the scaled datasets.

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

scaler = MinMaxScaler()
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)

X_train_scaled = scaler.fit_transform(X_train)
clf.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
accuracy = clf.score(X_test_scaled, y_test)

print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))

Accuracy score of the LogisticRegression is 0.96
LogisticRegression required 189 iterations to be fitted


By scaling the data, the convergence of the model happened much faster than with the unscaled data.

### 2.2 The wrong preprocessing patterns

We highlighted how to preprocess and adequately train a machine learning model. It is also interesting to spot what would be the wrong way of preprocessing data. There are two potential mistakes which are easy to make but easy to spot.

The first pattern is to standardize the data before spliting the full set into training and testing sets.

In [5]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(
    X_scaled, y, stratify=y, random_state=42)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train_prescaled)
accuracy = clf.score(X_test_prescaled, y_test_prescaled)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

Accuracy score of the LogisticRegression is 0.96


The second pattern is to standardize the training and testing sets independently. It comes back to call the `fit` methods on both training and testing sets. Thus, the training and testing sets are standardized differently.

In [6]:
scaler = MinMaxScaler()
X_train_prescaled = scaler.fit_transform(X_train)
X_test_prescaled = scaler.fit_transform(X_test)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train)
accuracy = clf.score(X_test_prescaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

Accuracy score of the LogisticRegression is 0.96


### 2.3 Keep it simple, stupid: use the pipeline connector from `scikit-learn`

The two previous patterns are an issue with data leaking. However, this is difficult to prevent such a mistake when one has to do the preprocessing by hand. Thus, `scikit-learn` introduced the `Pipeline` object. It sequentially connects several transformers and a classifier (or a regressor). We can create a pipeline as:

In [7]:
from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[('scaler', MinMaxScaler()),
                       ('clf', LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42))])

We see that this pipeline contains the parameters of both the scaler and the classifier. Sometimes, it can be tedious to give a name to each estimator in the pipeline. `make_pipeline` will give a name automatically to each estimator which is the lower case of the class name.

In [8]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42, max_iter=1000))

The pipeline will have an identical API. We use `fit` to train the classifier and `score` to check the accuracy. However, calling `fit` will call the method `fit_transform` of all transformers in the pipeline. Calling `score` (or `predict` and `predict_proba`) will call internally `transform` of all transformers in the pipeline. It corresponds to the normalization procedure in Sect. 2.1.

In [9]:
pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))

Accuracy score of the Pipeline is 0.96


We can check all the parameters of the pipeline using `get_params()`.

In [10]:
pipe.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
  ('logisticregression',
   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, max_iter=1000, multi_class='auto',
             n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
             tol=0.0001, verbose=0, warm_start=False))],
 'minmaxscaler': MinMaxScaler(copy=True, feature_range=(0, 1)),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='auto',
           n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
           tol=0.0001, verbose=0, warm_start=False),
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'logisticregression__C': 1.0,
 'logisticregression__class_weight': None,
 'logisticregression__dual': False,
 'logisticregression__fit_intercept': True,
 'logisticregression__int

### Exercise

* Use the breast dataset of the first exercise to train a `SGDClassifier` which you can import from `linear_model`. Make a pipeline with this classifier and a `StandardScaler` transformer imported from `sklearn.preprocessing`. Train and test this pipeline.

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import balanced_accuracy_score

In [40]:
X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=0,
    test_size=0.3)

In [41]:
pipe = make_pipeline(StandardScaler(),
                     SGDClassifier(max_iter=1000))

In [44]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))

Accuracy score of the Pipeline is 0.95




In [47]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
balanced_accuracy_score(y_test, y_pred)



0.9610105140186915

In [46]:
# %load solutions/15_exo_01.py
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import balanced_accuracy_score

X_breast, y_breast = load_breast_cancer(return_X_y=True)

X_breast_train, X_breast_test, y_breast_train, y_breast_test = train_test_split(
    X_breast, y_breast, stratify=y_breast, random_state=0,
    test_size=0.3)

pipe_breast = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000))
pipe_breast.fit(X_breast_train, y_breast_train)
y_pred = pipe_breast.predict(X_breast_test)
balanced_accuracy_score(y_breast_test, y_pred)




0.9531980140186915

* Load the titanic dataset as before. Use the `'survived'` column as target and the columns `['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']` as data.
* Create a `preprocessor` which will be created by using `make_column_transformer`. We will have two different pipeline: one for the categorical data and another for the numerical data.
* The numerical data should be treated by pipelining a `StandardScaler` and a `SimpleImputer` (use mean imputation).
* The categorical data shoul be treated by pipelining a `SimpleImputer` (with a constant strategy) and a one-hot encoder.
* Finally, pipeline this `preprocessor` with a `RandomForestClassifier` to predict the survival.

In [49]:
# %load solutions/15_exo_02.py
import os
import pandas as pd

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score

titanic = pd.read_csv(os.path.join('datasets', 'titanic3.csv'))
target = titanic['survived']
titanic = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

categorical_columns = ['pclass', 'sex', 'embarked']
numerical_columns = ['age', 'sibsp', 'parch', 'fare']

categorical_pipeline = make_pipeline(
    SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore')
)
numerical_pipeline = make_pipeline(
    StandardScaler(), SimpleImputer()
)

preprocessor = make_column_transformer(
    (categorical_pipeline, categorical_columns),
    (numerical_pipeline, numerical_columns)
)

pipe = make_pipeline(preprocessor, RandomForestClassifier(n_estimators=100))
print(pipe.fit(titanic, target).score(titanic, target))

print(cross_val_score(pipe, titanic, target, cv=5))


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


0.9679144385026738


  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


[0.50763359 0.74045802 0.66030534 0.69465649 0.62835249]


  Xt = transform.transform(Xt)


* Create a pipeline out of a StandardScaler and Ridge regression and apply it to the Boston housing dataset (load using ``sklearn.datasets.load_boston``). Try adding the ``sklearn.preprocessing.PolynomialFeatures`` transformer as a second preprocessing step, and grid-search the degree of the polynomials (try 1, 2 and 3).

In [None]:
# %load solutions/15A_ridge_grid.py