Pipelines can be used to chain multiple estimators into one. This is useeful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalisation and classification. Pipeline serves two purposes here:
    a) Convenience and Encapsulation
    You only have to call fit and predict once on your data to fit a whole sequence of 
    estimators.
    b) Joint Parameter Selection 
    You can grid search over parameters of all estimators in the pipeline in the pipeline at 
    once.
    c) Safety
    Pipelines help avoid leaking statistics from your test data into the trained model in 
    cross-validation, by ensuring that the same samples are used to train the transformers and
    predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e must have a 
transform method). The last estimator may be any type (transformer, classifier etc.)

The pipeline is built using a list of (key, value) pairs, where the key is a string containing
the name you want to give this step and value is an estimator object.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe 
Pipeline(memory=None,steps=[('reduce_dim', PCA(copy=True)),('clf', SVC(C=1.0))])

Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

The utility function make_pipeline is a shorthand for constructing pipelines; it takes a 
variable number of estimators and returns a pipeline, filling in the names automatically.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB()) 
Pipeline(memory=None,steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
                ('multinomialnb', MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True))])

Pipeline(memory=None,
     steps=[('binarizer', Binarizer(copy=True, threshold=0.0)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

The estimators of a pipeline are stored as a list in the steps attribute

In [6]:
pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))

('reduce_dim',
 PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False))

and as a dict in named_steps

In [8]:
pipe.named_steps['reduce_dim'] 
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto',
    tol=0.0, whiten=False)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [10]:
pipe.set_params(clf__C=10) 
Pipeline(memory=None,steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',)),
                ('clf', SVC(C=10, cache_size=200, class_weight=None,))])

Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Attributes of named_steps map to keys, enabling tab completion in interactive environments.

In [11]:
pipe.named_steps.reduce_dim is pipe.named_steps['reduce_dim']

True

This is particularly important for doing grid searches

In [12]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10], clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

Individual steps may also be replaced as parameters, and  non-final steps may have been ignored
by setting then to None

In [13]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],clf=[SVC(), LogisticRegression()],
 clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

In [14]:
grid_search

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'reduce_dim': [None, PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False), PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)], 'clf': [SVC(C=1.0, cache_size..., solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)], 'clf__C': [0.1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
      

# New Example