# 4. Transformers and Pipelines

## Summary of commands

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import KFold, GridSearchCV
from mymetrics import root_mean_squared_log_error

hs = pd.read_csv('data/housing_sample.csv')
X = hs[['YearBuilt', 'GrLivArea', 'GarageArea']].values
y = hs.pop('SalePrice').values

kf = KFold(n_splits=5, shuffle=True)
dtr = DecisionTreeRegressor()

grid = {'max_depth': range(2, 11), 'min_samples_split': [5, 10, 20, 50, 100]}
gs = GridSearchCV(estimator=dtr, param_grid=grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(X, y)
df_results = pd.DataFrame(gs.cv_results_)
gs.best_params_
gs.best_estimator_

## Transformers

Transformers are a special class of estimators that transform either the input or output data independently. Transformations are applied to the data before the machine learning happens. Many transformers are found in the [preprocessing module][1].

Although transformers don't do machine learning themselves, they still learn something from data and use the same three-step process - import, instantiate, fit. The `SimpleImputer` transformer from the `impute` module imputes (fills) missing data. Let's look at the number of missing values in each column.

[1]: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

In [None]:
hs.isna().sum()

If we wish to use the `LotFrontage` column, we will need to impute the missing values (or drop the rows containing them entirely from the DataFrame). There are many strategies to imputing missing data. As the name suggests, the `SimpleImputer` only provides simple strategies that are set during instantiation. Set the `strategy` parameter to either 'mean', 'median', 'constant', or 'most_frequent'. If you select 'constant', you'll have to provide that constant with the `fill_value'` parameter. Let's complete the three-step process below by choosing to fill in missing values with.

In [None]:
X = hs[['LotFrontage']].values

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='mean')
si.fit(X)

### `fit` doesn't fill the missing values
Calling the `fit` method does not fill in the missing values. The mean of each column was learned which you can access with the `statistics_` attribute.

In [None]:
si.statistics_

### Complete the transformation with the `transform` method
To actually fill the missing data (to be returned as a new copy), use the `transform` method after you have used the `fit` method.

In [None]:
X_filled = si.transform(X)
X_filled[:5]

Verify that there are no more missing values.

In [None]:
np.isnan(X_filled).sum()

The original data was not changed.

In [None]:
np.isnan(X).sum()

### Skip a step - fit and transform with `fit_transform`
It is very common to call the `transform` method right after `fit`. scikit-learn provides all transformers the `fit_transform` method to both learn from the data and returned the new transformed dataset. The three-step process now becomes:

In [None]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='mean')
X_filled = si.fit_transform(X)

### Now do machine learning
Once the missing values have been imputed, you can proceed to do machine learning in the same manner as we did before.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
kf = KFold(n_splits=5, shuffle=True)
cross_val_score(lr, X_filled, y, cv=kf, scoring=root_mean_squared_log_error)

## Using a pipeline to automate the process
Notice how we had to assign the transformed data to a new variable name `X_filled`. If you had lots of transformations and wanted to do them in succession, this might start looking a bit cumbersome. Instead, you can use the `Pipeline` meta-estimator found in the `pipeline` module. To use, you must construct a list of all the transformers you'd want to apply to your dataset. If you want to do machine learning as well, you can include it as your last step.

### Instantiate pipeline with a list of two-item tuples
Specifically, the `Pipeline` estimator must be instantiated with a list of two-item tuples, where the first item in the tuple is a string naming that step in the pipeline and the second is the instantiated estimator. Let's create this list of two item tuples.

In [None]:
si = SimpleImputer(strategy='mean')
lr = LinearRegression()

step1 = ('impute', si)
step2 = ('lin_reg', lr)

steps = [step1, step2]

### Three-step process with pipeline
The same three-step process works with the pipeline. It imputes missing values and the learns the linear regression parameter.

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps)
pipe.fit(X, y)

You can make predictions:

In [None]:
pipe.predict(X)

You can do cross validation:

In [None]:
cross_val_score(pipe, X, y, cv=kf, scoring=root_mean_squared_log_error)

## Grid searching with the pipeline
Completing a grid search is a little different with a pipeline. You can perform a search over all the hyperparameters in any part of the pipeline. To uniquely identify the parts of the pipeline, you need to use the name you provided to the part of the pipeline during instantiation and append it with two underscores. For instance, 'impute__strategy' is used to refer to the `strategy` hyperparameter. 

In [None]:
X = hs[['YearBuilt', 'LotFrontage', 'GrLivArea', 'GarageArea']].values

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

si = SimpleImputer(strategy='mean')
dtr = DecisionTreeRegressor()

step1 = ('impute', si)
step2 = ('tree', dtr)
steps = [step1, step2]
pipe = Pipeline(steps)

grid = {'impute__strategy': ['mean', 'median'],
        'tree__max_depth': range(2, 11),
        'tree__min_samples_split': [5, 10, 50, 100]}
gs = GridSearchCV(estimator=pipe, param_grid=grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(X, y)
gs.best_params_

## Adding another transformer
Standardization is one of the most common transformation techniques. It subtracts the mean from each column and divides by the standard deviation. All the data is now scaled as the number of standard deviations away from the mean. Each resulting column will have a mean of 0 and standard deviation of 1. 

Standardization is necessary for machine learning models that depend on the relative size of column values. Penalized regression (Lasso and Ridge), k-nearest neighbors, and support vector machines all require input data to be standardized. Below, we use the `StandardScaler` transformer from the `preprocessing` module. We also use Ridge regression and optimize for the `alpha`, the size of the penalty.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

si = SimpleImputer(strategy='mean')
ss = StandardScaler()
ridge = Ridge()

step1 = ('impute', si)
step2 = ('standardize', ss)
step3 = ('ridge', ridge)

steps = [step1, step2, step3]
pipe = Pipeline(steps)

grid = {'impute__strategy': ['mean', 'median'],
        'ridge__alpha': np.logspace(-5, 5)}
gs = GridSearchCV(estimator=pipe, param_grid=grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(X, y)
gs.best_params_

## Exercise

Practice using some of the transformers separately and then together in a pipeline that ends with a machine learning algorithm. Finally, practice grid searching on different sections of the pipeline.