# Grid Searching Pipelines

Our pipelines in the previous chapter can be grid searched to find the optimal values for the hyperparameters. scikit-learn provides us the ability to not only optimize the hyperparameters of the machine learning estimator within the pipeline, but of the transformers as well. Let's begin by reading in our data selecting the same features as the previous chapter.

In [None]:
import pandas as pd
housing = pd.read_csv('../data/housing_sample.csv')
X = housing[['GrLivArea', 'GarageArea', 'LotFrontage', 'OverallQual']]
y = housing['SalePrice']
X.head()

Let's rebuild the three steps of our previous pipeline by importing and instantiating each estimator with its default hyperparameters. Even though the `SimpleImputer` and `StandardScaler` are not machine learning models, we can consider the values used during instantiation as hyperparameters.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
si = SimpleImputer()
ss = StandardScaler()
dtr = DecisionTreeRegressor()
steps = [('si', si), ('ss', ss), ('dtr', dtr)]
pipe = Pipeline(steps)

### Cross Validated Scores

We were able to use this pipeline to get cross-validated scores but did not tune any of the hyperparameters. Let's repeat our cross-validated results.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=123)
cross_val_score(pipe, X, y, cv=kf).mean()

## Using `GridSearchCV` on a pipeline

It is possible to use the `GridSearchCV` meta-estimator on a pipeline, but there are special rules that apply. The hyperparameter names that appear in the grid are still strings, but must follow the **name** of the pipeline step and be separated from it by two underscores. For instance, to search the `max_depth` hyperparmaeter of the decision tree, we use the string `dtr__max_depth`. The grid below is used to search both the decision tree and the `SimpleImputer`. The `StandardScaler` does have hyperparameters to control whether or not to use the mean and standard deviation when scaling, but we choose not to deviate from the defaults.

In [None]:
from sklearn.model_selection import GridSearchCV
grid = {'dtr__max_depth': range(4, 10),
        'dtr__min_samples_leaf': [10, 20, 30, 40],
        'si__strategy': ['mean', 'median', 'most_frequent']}
gs = GridSearchCV(pipe, grid, cv=kf)
gs.fit(X, y);

Let's retrieve the best hyperparameter combination and the score.

In [None]:
gs.best_params_

In [None]:
gs.best_score_

The results are easier to read when visualized in a table.

In [None]:
df_results = pd.DataFrame(gs.cv_results_)
df_results.pivot_table(index='param_dtr__max_depth',
                       columns=['param_si__strategy', 'param_dtr__min_samples_leaf'],
                       values='mean_test_score').round(3) \
          .style.background_gradient('coolwarm', axis=None)

We can assign the best pipeline to its own variable for use in the future.

In [None]:
pipe_best = gs.best_estimator_

### All steps of the pipeline can be searched

Each step of the pipeline may be part of the grid search. Reference the hyperparmeter by separating the name of the step and the hyperparameter name with two consecutive underscores.

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Create different pipelines and grid search them to find the best parameters.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Make a grid with a large number of combinations and use randomized search to choose the hyperparameters.</span>