# FLIP(01):  Advanced Data Science
**(Module 02: Machine Learning)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 00 - MLPipeline

# Pipeline
## 1. Pipeline and FeatureUnion: combining estimators

### Pipeline: chaining estimators
<font color = 'blue'><b>Pipeline</b></font> can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. 

<br/><font color = 'blue'><b>Pipeline</b></font> serves two purposes here:

<br/><font size = 2.5><b>Convenience and encapsulation</b></font></br>
    <br/><font size = 2>You only have to call fit and predict once on your data to fit a whole sequence of estimators.</br></font>
<br/><font size = 2.5><b>Joint parameter selection</b></font></br>
    <br/><font size = 2> You can grid search over parameters of all estimators in the pipeline at once.</br></font>    
<font size = 2.5><b>Safety</b></font></br>
    <br/><font size = 2> Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.</br></font>       

<font size = 2>All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).</font>

<b>Usage</b>

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe 

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB()) 

The estimators of a pipeline are stored as a list in the steps attribute:

In [None]:
pipe.steps[0]

and as a dict in named_steps:

In [None]:
pipe.named_steps['reduce_dim']

Parameters of the estimators in the pipeline can be accessed using the __ syntax:

In [None]:
pipe.set_params(clf__C=10) 

Attributes of named_steps map to keys, enabling tab completion in interactive environments:

In [None]:
pipe.named_steps.reduce_dim is pipe.named_steps['reduce_dim']

This is particularly important for doing grid searches:

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to None:

In [None]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
                  clf=[SVC(), LogisticRegression()],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

<b>Notes</b>

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

<b>Caching transformers: avoid repeated computation</b>

Fitting transformers may be computationally expensive. With its memory parameter set, <font color = 'blue'>Pipeline</font> will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.

The parameter memory is needed in order to cache the transformers. memory can be either a string containing the directory where to cache the transformers or a <font color = 'blue'>joblib.Memory</font> object:

In [None]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
pipe 

# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

<font color = 'red'>Warning: Side effect of caching transformers</font>
<br/><font color = 'red'>Using a </font>Pipeline <font color = 'red'>without cache enabled, it is possible to inspect the original instance such as:</font></br>

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
pca1 = PCA()
svm1 = SVC()
pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
pipe.fit(digits.data, digits.target)

# The pca instance can be inspected directly
print(pca1.components_) 

<font color = 'red'>Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. In following example, accessing the </font>PCA <font color = 'red'>instance </font>pca2 <font color = 'red'>will raise an </font>AttributeError <font color = 'red'>since</font> pca2 <font color = 'red'>will be an unfitted transformer. Instead, use the attribute </font>named_steps <font color = 'red'>to inspect estimators within the pipeline:</font>

In [None]:
cachedir = mkdtemp()
pca2 = PCA()
svm2 = SVC()
cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
                       memory=cachedir)
cached_pipe.fit(digits.data, digits.target)

print(cached_pipe.named_steps['reduce_dim'].components_)

# Remove the cache directory
rmtree(cachedir)

## 2. FeatureUnion: composite feature spaces

<br/><font color = 'blue'><b>FeatureUnion</b></font> combines several transformer objects into a new transformer that combines their output. A <font color = 'blue'><b>FeatureUnion</b></font> takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.</br>

<br/><font color = 'blue'><b>FeatureUnion</b></font> serves the same purposes as <font color = 'blue'><b>Pipeline</b></font> - convenience and joint parameter estimation and validation.

<br/><font color = 'blue'><b>FeatureUnion</b></font> and <font color = 'blue'><b>Pipeline</b></font> can be combined to create complex models.

(A <font color = 'blue'>FeatureUnion</font>has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are the caller’s responsibility.)

<b>Usage</b>

A <font color = 'blue'>FeatureUnion</font> is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
combined 

Like pipelines, feature unions have a shorthand constructor called <font color = 'blue'>make_union</font> that does not require explicit naming of the components.

Like Pipeline, individual steps may be replaced using set_params, and ignored by setting to None:

In [None]:
combined.set_params(kernel_pca=None)