<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn: Pipelines

### Pipelines

Often, we'll have several preprocessing steps and then our main model. For example, we might have a data with missing values which would also benefit from having the square of each feature -- ideas we discussed in the previous notebook/video.


In [2]:
import numpy as np
X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

### Pipelines
* A *pipeline* is sequence of Scikit-Learn estimators
* Each transforms the `X`
* All but the last must have `transform()`: its output becomes the input of the next `fit()` and `transform()`.

<center><img src=img/sklearn_pipeline.svg width=80%></center>

### Motivation

* **Convenience and encapsulation**: Call `fit` and `predict` just once
* **Joint parameter selection**: grid search over parameters of pipeline components together
* **Safety**: avoid errors leaking test data into training.



In [1]:
from sklearn.pipeline import make_pipeline

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

model = make_pipeline(SimpleImputer(strategy='mean'),
                      PolynomialFeatures(degree=2),
                      LinearRegression())

In [4]:
model.fit(X, y)

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('polynomialfeatures',
                 PolynomialFeatures(degree=2, include_bias=True,
                                    interaction_only=False, order='C')),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

This is equivalent to writing:

```python
X = SimpleImputer(strategy="mean").fit_transform(X)
X = PolynomialFeatures(degree=2).fit_transform(X)
model = LinearRegression()
model.fit(X, y)
```

### `ColumnTransformer`

Often we'll want to transform some columns with e.g. one-hot encoding, and others with e.g. missing value imputation, and leave other columns alone. 

See [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) for this.

### Reference

https://scikit-learn.org/stable/modules/compose.html


### Exercise 1

Confirm that our trained `model` can handle `np.nan` transparently and makes sensible predictions, by passing in a query point in the *original* format.

### Exercise 2

We mentioned that "convenience and encapsulation" are part of the motivation for using pipelines. One situation where this is especially true is when we are *saving* models to disk. Scikit-Learn doesn't provide its own method for this, but instead uses the `pickle` module built-in to the standard library. For any object `c`, we can write:

```python
import pickle
pickle.dump(c, open("data/my_object.pkl", "wb"))
# later...
c2 = pickle.load(open("data/my_object.pkl", "rb"))
```

1. We have to open the file ourselves -- we  pass a `file` object, not a filename, to `dump` and to `load`.
2. We have to read/write using *binary mode* (`"wb"` and `"rb"`) because the pickle format is binary, not plain-text.

So, the exercise is to save our trained Scikit-Learn pipeline model to disk, and then read it in again. Confirm that the model we read in from disk gives the same results as the model we wrote to disk.

See https://scikit-learn.org/stable/tutorial/basic/tutorial.html#model-persistence.

### Exercise 3 

Check how large is the model saved on disk, e.g. using `ls` on Unix or `dir` on Windows. For the curious: how does the size compare to a bare LR model?


### Solution 1

In [11]:
model.predict([[ np.nan, 0.5, 3.5]])

array([13.91644421])

### Solution 2

In [10]:
import pickle
model = make_pipeline(SimpleImputer(strategy='mean'),
                      PolynomialFeatures(degree=2),
                      LinearRegression())
model.fit(X, y)
pickle.dump(model, open("data/LR_pipeline.pkl", "wb"))
model2 = pickle.load(open("data/LR_pipeline.pkl", "rb"))
model2.predict([[ np.nan, 0.5, 3.5]])

array([13.91644421])

### Solution 3

For me, the file is 1252 bytes:
```
$ ls -l LR_pipeline.pkl
-rw-r--r--  1 jmmcd  staff  1252 28 Oct 19:18 LR_pipeline.pkl
```
