# 4.1.1. Pipeline: chaining estimators

- **Pipeline** can be used to chain multiple estimators into one. 
- This is useful as there is often a fixed sequence of steps in processing the data
    - for example feature selection, normalization and classification. 

Pipeline serves two purposes here:

- **Convenience**: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
- **Joint parameter selection**: You can grid search over parameters of all estimators in the pipeline at once.

Pipeline structure
- All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). 
- The last estimator may be any type (transformer, classifier, etc.).

## relevant modules
- `sklearn.pipeline.Pipeline` - choose your own **key**
- `sklearn.pipeline.make_pipeline` - automatically fills in a **key** for you

## 4.1.1.1. Usage - list of (key,value) pair
- **key** = any string you want to name the step as
- **value** = estimator object



In [2]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from pprint import pprint
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)
pprint(clf)

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])


With `make_pipeline`, the names(keys) are filled in automatically

In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
pipe = make_pipeline(Binarizer(), MultinomialNB()) 
pipe

Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [6]:
dir(clf) 

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_final_estimator',
 '_get_param_names',
 '_pairwise',
 '_pre_transform',
 'classes_',
 'decision_function',
 'fit',
 'fit_transform',
 'get_params',
 'inverse_transform',
 'named_steps',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'score',
 'set_params',
 'steps',
 'transform']

In [8]:
clf.steps

[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)),
 ('svm',
  SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False))]

In [17]:
# the estimators of a pipeline are stored as a list in the "steps" attribute
print clf.steps[0]
print clf.named_steps['reduce_dim'] # dict form

('reduce_dim', PCA(copy=True, n_components=None, whiten=False))
PCA(copy=True, n_components=None, whiten=False)


In [23]:
print clf.steps[1]
print clf.steps[1][0]
print clf.steps[1][1]

('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))
svm
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)


In [24]:
# Parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax:
clf.set_params(svm__C=10)

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

In [29]:
# This is particularly important for doing grid searches:
from sklearn.grid_search import GridSearchCV
params = dict(reduce_dim__n_components=[2, 5, 10],
              svm__C=[0.1, 10, 100])
grid_search = GridSearchCV(clf, param_grid=params)