# 3. Model Selection

## Summary of Commands

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
from mymetrics import root_mean_squared_log_error

hs = pd.read_csv('data/housing_sample.csv')
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

lr = LinearRegression()
lr.fit(X, y)

kf = KFold(n_splits=5, shuffle=True)
cross_val_score(lr, X, y, cv=kf, scoring=root_mean_squared_log_error)

## Hyperparameter Tuning

You can change the 'specifications' of how an estimator is constructed during instantiation (step 2). These specifications are called **hyperparameters** and once set during instantiation are not changed. They differ from **model parameters** which are learned during training (step 3). Using different values for these hyperparameters can lead to drastically different results.

Linear regression does not have many hyperparameters, so in this notebook we will work with decision trees instead, which have many more. Common hyperparameters that are often set for decision trees are `max_depth` (the maximum depth of a tree) and `min_samples_split` (the minimum number of samples a node must contain for a split to be possible.

### Default hyperparameter values

All estimators provide default values for these hyperparameters which is why you can instantiate them without setting them explicitly. Let's construct our first decision tree estimator with the default values. The defaults are `None` for both `max_depth` and `min_samples_split`. There are several other hyperparameters that also have default values, but we will not inspect them. We will also use a couple more input variables.

In [None]:
hs.head(3)

In [None]:
X = hs[['YearBuilt', 'GrLivArea', 'GarageArea']].values
X[:5]

Import and instantiate the estimator.

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()

Instead of fitting the model directly with the `fit` method, we can go straight to cross validation which fits a new model for each iteration.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(dtr, X, y, cv=kf, scoring=root_mean_squared_log_error)
scores

We can take the mean of the cross validated scores to get a single overall performance score.

In [None]:
scores.mean()

### Select new hyperparameters
Let's instantiate our decision tree with a different set of hyperparameters by setting `max_depth` to 4 and a `min_samples_split` to 50.

In [None]:
dtr = DecisionTreeRegressor(max_depth=5, min_samples_split=50)
scores = cross_val_score(dtr, X, y, cv=kf, scoring=root_mean_squared_log_error)
scores

In [None]:
scores.mean()

## Search many hyperparameter combinations with `GridSearchCV`
The second set of hyperparameters from above returned a better score than the defaults. To help find the best combination of hyperparameters, use `GridSearchCV` from the `model_selection` module. To work with it, you must first create a **parameter grid** mapping the parameter name as a string to the possible values you want to test.

In [None]:
grid = {'max_depth': range(2, 11), 'min_samples_split': [5, 10, 20, 50, 100]}

### Three-step process for `GridSearchCV`

`GridSearchCV` is a meta-estimator or an estimator that fits other estimators. It follows the same three-step process as the other estimators, but must be instantiated with the estimator you would like to search and the parameter grid. It also does cross validation and has the same `cv` and `scoring` parameters as `cross_val_score` which can also be set during instantiation.

During the `fit` method, every single combination of hyperparameters is scored with cross validation. So, in our specific example, we have 9 possible values for `max_depth` and 5 for `min_samples_split` leading to 45 different combinations of hyperparameters. Each of these combinations will have a mean cross validated score which will be used to rank the models.

In [None]:
from sklearn.model_selection import GridSearchCV
dtr = DecisionTreeRegressor()
gs = GridSearchCV(estimator=dtr, param_grid=grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(X, y)

### Retrieving the best combination

The best combination of hyperparameters is found with the `best_params_` attribute.

In [None]:
gs.best_params_

### Analyzing the results
The results are contained in the `cv_results_` attribute which is a large dictionary that can be converted to a DataFrame for easier interpretation.

In [None]:
df_results = pd.DataFrame(gs.cv_results_)
df_results

If you have exactly two hyperparameters you can pivot them into a DataFrame

In [None]:
df_results.pivot('param_max_depth', 'param_min_samples_split', 'mean_test_score').round(3)

### Getting the best model
The best model is saved to the `best_estimator_` attribute. You can now use this to make future predictions. It's a decision tree trained on all the data with the optimal hyperparameters.

In [None]:
gs.best_estimator_

## Summary

In [None]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import KFold, GridSearchCV
from mymetrics import root_mean_squared_log_error

hs = pd.read_csv('data/housing_sample.csv')
X = hs[['YearBuilt', 'GrLivArea', 'GarageArea']].values
y = hs.pop('SalePrice').values

kf = KFold(n_splits=5, shuffle=True)
dtr = DecisionTreeRegressor()

grid = {'max_depth': range(2, 11), 'min_samples_split': [5, 10, 20, 50, 100]}
gs = GridSearchCV(estimator=dtr, param_grid=grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(X, y)
gs.best_params_

In [None]:
df_results = pd.DataFrame(gs.cv_results_)
gs.best_estimator_

## Exercise
Practice using `GridSearchCV` with different combinations of hyperparameters using different regressors.