<font color="#CC3D3D"><p>
# ML Workflow Optimization

## Pipeline: chaining estimators   
- Pipeline can be used to chain multiple estimators into one.
- Pipeline serves two purposes:
  - Convenience and encapsulation
  - Joint parameter selection
- All estimators in a pipeline, except the last one, must be transformers. 
  - The last estimator may be any type (transformer, classifier, etc.)
- Training and prediction procedure of the pipeline
<br>
<img align="left" src="http://drive.google.com/uc?export=view&id=1pIde-P6d7EnjL3xYo8eE3cWAUEvzV7tS" >

### Building Pipelines

In [1]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [2]:
# load and split the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

In [3]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

<font color = "blue"><p>
The **Pipeline** is built using a list of **(key, value)** pairs, where the **key** is a string containing the name you want to give this step and **value** is an estimator object:

In [4]:
pipe.fit(X_train, y_train).score(X_test, y_test)

0.972027972027972

<font color = "blue"><p>
You only have to call **fit** and **predict** once on your data to fit a whole sequence of estimators

### Using Pipelines in Grid-searches

In [5]:
from sklearn.model_selection import GridSearchCV

In [6]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

<font color = "blue"><p>
Parameters of the estimators in the pipeline shoud be defined using the **estimator__parameter** syntax

In [7]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(
    grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.98
Test set score: 0.97
Best parameters: {'svm__C': 1, 'svm__gamma': 1}


### Convenient Pipeline creation with *make_pipeline* 

In [8]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), 
                      ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

In [9]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('svc', SVC(C=100))]


<font color = "blue"><p>
**Make_pipeline** does not require, and does not permit, naming the estimators. Instead, their names will be set to the **lowercase of their types** automatically.

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2), 
                     StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]


### Combining Features with *FeatureUnion* 
<img align='left' src='https://image.slidesharecdn.com/featureengineeringpipelines1-161106200348/95/feature-engineering-pipelines-11-638.jpg?cb=1478462927' width=600 height=400>

In [11]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [12]:
# create feature union
features = []

features.append(('pca', PCA(n_components=3)))
features.append(('univ_select', SelectKBest(k=10)))
feature_union = FeatureUnion(features)


# create pipeline
estimators = []
estimators.append(('features', feature_union))
estimators.append(('scaler', MinMaxScaler()))
estimators.append(("svm", SVC()))
pipe = Pipeline(estimators)

In [13]:
pipe.fit(X_train, y_train).score(X_test, y_test)

0.958041958041958

In [14]:
# Do grid search
param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[9, 10, 11],
                  svm__C=[0.1, 1, 10],
                  svm__gamma=[0.1, 1, 10])
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5)
print(grid_search.fit(X_train, y_train).score(X_test, y_test))
print(grid_search.best_estimator_)

0.951048951048951
Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=1)),
                                                ('univ_select',
                                                 SelectKBest())])),
                ('scaler', MinMaxScaler()), ('svm', SVC(C=10, gamma=10))])


### Wrap-up example:

<img align='left' src='http://zacstewart.com/images/pipelines-of-featureunions-of-pipelines/featureunion-pipelines.svg' width=500 height=400>

```
pipeline = Pipeline([
  ('extract_essays', EssayExractor()),
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts', CountVectorizer()),
      ('tf_idf', TfidfTransformer())
    ])),
    ('essay_length', LengthTransformer()),
    ('misspellings', MispellingCountTransformer())
  ])),
  ('classifier', MultinomialNB())
])
```

### References:
- Using ColumnTransformer to combine data processing steps  
https://towardsdatascience.com/using-columntransformer-to-combine-data-processing-steps-af383f7d5260
- ML Data Pipelines with Custom Transformers in Python  
https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

<font color="#CC3D3D"><p>
# End