# Model Tuning: Grid Search + Pipeline

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as stats

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score, RandomizedSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Objectives

- Explain what hyperparameters are
- Describe the purpose of grid searching
- Implement grid searching for the purposes of model optimization.

# Model Tuning

![](https://imgs.xkcd.com/comics/machine_learning.png)

## Hyperparameters

Many of the models we have looked at are really *families* of models in the sense that they make use of **hyperparameters**.

Thus for example the $k$-nearest-neighbors algorithm allows us to make:

- a 1-nearest-neighbor model
- a 2-nearest-neighbors model
- a 3-nearest-neighbors model
- etc.

Or, for another example, the decision tree algorithm allows us to make:

- a classifier that branches according to information gain
- a classifier that branches according to Gini impurity
- a regressor that branches according to mean squared error
- etc.

Depending on the sort of problem and data at hand, it is natural to experiment with different values of these hyperparameters to try to improve model performance.

> We can think of these **hyperparameters** as _dials_ of the base model

<img width=60% src='images/dials.png'/>

### Difference from Parametric / Non-Parametric Models

Contrast the notion of hyperparameters with the distinction between parametric and non-parametric models.

A linear regression model is parametric in the sense that we start with a given model *form* and we then search for the optimal parameters to fill in that form. But *those* parameters are not the sort we might tweak for the purposes of improving model performance. On the contrary, there is one best set of parameters, and the training of the model is a matter of finding those optimal values.

## Data Example

![Penguins](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/69530276d74b99df81cc385f4e95c644da69ebfa/man/figures/lter_penguins.png)

> Images source: @allison_horst [github.com/allisonhorst/penguins](github.com/allisonhorst/penguins)

In [2]:
penguins = sns.load_dataset('penguins')

![Bill length & depth](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/69530276d74b99df81cc385f4e95c644da69ebfa/man/figures/culmen_depth.png)

> Images source: @allison_horst [github.com/allisonhorst/penguins](github.com/allisonhorst/penguins)

In [3]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [4]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [5]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


### Data Prep

We'll try to predict species given the other columns' values. Let's dummy-out `island` and `sex`:

In [6]:
penguins.isna().sum().sum()

19

In [7]:
penguins = penguins.dropna()

In [8]:
y = penguins.pop('species')
y

0      Adelie
1      Adelie
2      Adelie
4      Adelie
5      Adelie
        ...  
338    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 333, dtype: object

In [9]:
penguins

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...
338,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Biscoe,45.2,14.8,212.0,5200.0,Female


In [10]:
# Note we're dedicating a lot of data to the testing set just for demonstrative purposes
X_train, X_test, y_train, y_test = train_test_split(
    penguins, y, test_size=0.5, random_state=42)

In [11]:
X_train_cat = X_train.select_dtypes('object')

ohe = OneHotEncoder(sparse=False)

dums = ohe.fit_transform(X_train_cat)
dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                       index=X_train_cat.index)

In [12]:
dums_df.head()

Unnamed: 0,x0_Biscoe,x0_Dream,x0_Torgersen,x1_Female,x1_Male
160,0.0,1.0,0.0,1.0,0.0
237,1.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,0.0
121,0.0,0.0,1.0,0.0,1.0
179,0.0,1.0,0.0,0.0,1.0


In [13]:
X_train_nums = X_train.select_dtypes('float64')

ss = StandardScaler()

ss.fit(X_train_nums)
nums_df = pd.DataFrame(ss.transform(X_train_nums),
                      index=X_train_nums.index)
nums_df

Unnamed: 0,0,1,2,3
160,0.362748,0.903276,-0.472344,-0.094599
237,0.973499,-0.977375,1.408317,2.512546
2,-0.725152,0.445820,-0.472344,-1.185963
121,-1.221387,1.360731,-0.255345,-0.882806
179,1.030757,0.954104,-0.110678,-0.519018
...,...,...,...,...
194,1.297960,1.004932,-0.400011,-0.822175
77,-1.316816,1.157417,-1.268008,-0.397756
112,-0.839667,0.293335,-0.617010,-1.246594
277,0.267318,-1.079032,1.335984,0.936132


In [14]:
X_train_clean = pd.concat([nums_df, dums_df], axis=1)

In [15]:
X_train_clean.head()

Unnamed: 0,0,1,2,3,x0_Biscoe,x0_Dream,x0_Torgersen,x1_Female,x1_Male
160,0.362748,0.903276,-0.472344,-0.094599,0.0,1.0,0.0,1.0,0.0
237,0.973499,-0.977375,1.408317,2.512546,1.0,0.0,0.0,0.0,1.0
2,-0.725152,0.44582,-0.472344,-1.185963,0.0,0.0,1.0,1.0,0.0
121,-1.221387,1.360731,-0.255345,-0.882806,0.0,0.0,1.0,0.0,1.0
179,1.030757,0.954104,-0.110678,-0.519018,0.0,1.0,0.0,0.0,1.0


#### Preparing the Test Set

In [16]:
X_test_cat = X_test.select_dtypes('object')

test_dums = ohe.transform(X_test_cat)
test_dums_df = pd.DataFrame(test_dums,
                       columns=ohe.get_feature_names(),
                      index=X_test_cat.index)

In [17]:
X_test_nums = X_test.select_dtypes('float64')

test_nums = ss.transform(X_test_nums)
test_nums_df = pd.DataFrame(test_nums,
                           index=X_test_nums.index)

In [18]:
X_test_clean = pd.concat([test_nums_df,
                 test_dums_df], axis=1)

In [19]:
X_test_clean.head()

Unnamed: 0,0,1,2,3,x0_Biscoe,x0_Dream,x0_Torgersen,x1_Female,x1_Male
30,-0.877839,-0.214949,-1.702007,-1.185963,0.0,1.0,0.0,1.0,0.0
317,0.534522,-1.282345,1.48065,0.784554,1.0,0.0,0.0,1.0,0.0
79,-0.381604,1.004932,-0.472344,-0.276493,0.0,0.0,1.0,0.0,1.0
201,1.088015,0.090021,-0.255345,-0.670597,0.0,1.0,0.0,1.0,0.0
63,-0.572464,0.547477,-0.689343,-0.215862,1.0,0.0,0.0,0.0,1.0


### Trying Different Models & Values

#### Decision Tree

In [20]:
dt = DecisionTreeClassifier(random_state=10)

dt.fit(X_train_clean, y_train)

DecisionTreeClassifier(random_state=10)

In [21]:
dt.score(X_test_clean, y_test)

0.9700598802395209

##### Changing the branching criterion

In [22]:
dt2 = DecisionTreeClassifier(criterion='entropy',
                          random_state=10)

dt2.fit(X_train_clean, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=10)

In [23]:
dt2.score(X_test_clean, y_test)

0.9700598802395209

##### Changing the max_depth

In [24]:
# How much tho?
dt3 = DecisionTreeClassifier(max_depth=None, criterion='entropy',
                          random_state=10)

dt3.fit(X_train_clean, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=10)

In [25]:
dt3.score(X_test_clean, y_test)

0.9700598802395209

# Automatically Searching with Grid Search

It's not a bad idea to experiment with the values of your models' hyperparameters a bit as you're getting a feel for your models' performance. But there are more systematic ways of going about the search for optimal hyperparameters. One method of hyperparameter tuning is **grid searching**. 

The idea is to build multiple models with different hyperparameter values and then see which one performs the best. The hyperparameters and the values to try form a sort of *grid* along which we are looking for the best performance. For example:


    1           | 'minkowski' | 'uniform'
    3           | 'manhattan' | 'distance'
    5           |
    ______________________________________
    n_neighbors | metric      | weights

Scikit-Learn has a [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class whose `fit()` method runs this procedure. Note that this can be quite computationally expensive since:

- A model is constructed for each combination of hyperparameter values that we input; and
- Each model is cross-validated.

In [26]:
3 * 2 * 2 * 5

60

### `GridSearchCV`

In [27]:
# Define the parameter grid

grid = None

**Question: How many models will we be constructing with this grid?**

In [28]:
dt_grid = DecisionTreeClassifier(random_state=10)

In [29]:
# Initialize the grid search object with five-fold cross-validation

gs = GridSearchCV(estimator=dt_grid, param_grid=grid, verbose=2)

TypeError: 'NoneType' object is not iterable

In [None]:
gs.fit(X_train_clean, y_train)

In [None]:
gs.best_params_

In [None]:
gs.best_score_

In [None]:
gs.best_estimator_

In [None]:
gs.cv_results_

In [None]:
pd.DataFrame(gs.cv_results_)

### Choice of Grid Values

Which values should you pick for your grid? Intuitively, you should try both "large" and "small" values, but of course what counts as large and small will really depend on the type of hyperparameter.

- ALWAYS INCLUDE THE DEFAULT IN YOUR FIRST SEARCH
- For a k-nearest neighbors model, 1 or 3 would be a small value for the number of neighbors and 15 or 17 would be a large value.
- For a decision tree model, what counts as a small `max_depth` will really depend on the size of your training data. A `max_depth` of 5 would likely have little effect on a very small dataset but, at the same time, it would probably significantly decrease the variance of a model where the dataset is large.
- For a logistic regression's regularization constant, you may want to try a set of values that are exponentially separated, like \[1, 10, 100, 1000\].
- **If a grid search finds optimal values at the ends of your hyperparameter ranges, you might try another grid search with more extreme values.**

In [None]:
grid_1 = {'max_depth': [1, 3, 5], 'min_samples_split': [2, 3, 4, 5, 6]}

# Best values were 5 and 6

grid_2 = {'max_depth': [5, 10, 15], 'min_samples_split': [6, 8, 10]}

# New best values were 10 and 8

grid_3 = {'max_depth': [8, 9, 10, 11, 12], 'min_samples_split': [7, 8, 9]}

# Better Process: Pipelines

> **Pipelines** can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

![](https://imgs.xkcd.com/comics/data_pipeline.png)

The `Pipeline` class from [Scikit-Learn's API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is especially convenient since it allows us to use our other Estimators that we know and love!

## Advantages of `Pipeline`

### Reduces Complexity

> You can focus on particular parts of the pipeline one at a time and debug or adjust parts as needed.

### Convenient

> The pipeline summarizes your fine-detail steps. That way you can focus on the big-picture aspects.

### Flexible

> You can use pipelines with different models and with GridSearch.

### Prevent Mistakes

> We can focus on one section at a time.
>
> We also can ensure data leakage between our training and doesn't occur between our training dataset and validation/testing datasets!

## Example of Using `Pipeline`

In [None]:
# Getting some data
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=27)

### Without the Pipeline class

In [None]:
# Define transformers (will adjust/massage the data)
imputer = SimpleImputer(strategy="median") # replaces missing values
std_scaler = StandardScaler() # scales the data

# Define the classifier (predictor) to train
rf_clf = DecisionTreeClassifier(random_state=42)

# Have the classifer (and full pipeline) learn/train/fit from the data
X_train_filled = imputer.fit_transform(X_train)
X_train_scaled = std_scaler.fit_transform(X_train_filled)
rf_clf.fit(X_train_scaled, y_train)

# Predict using the trained classifier (still need to do the transformations)
X_test_filled = imputer.transform(X_test)
X_test_scaled = std_scaler.transform(X_test_filled)
y_pred = rf_clf.predict(X_test_scaled)
print(y_pred)

> Note that if we were to add more steps in this process, we'd have to change both the *training* and *testing* processes.

### With `Pipeline` Class

In [None]:
pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), 
        ('std_scaler', StandardScaler()),
        ('rf_clf', DecisionTreeClassifier(random_state=42)),
])


# Train the pipeline (tranformations & predictor)
pipeline.fit(X_train, y_train)

# Predict using the pipeline (includes the transfomers & trained predictor)
predicted = pipeline.predict(X_test)
print(predicted)

In [None]:
pipeline['imputer']

In [None]:
pipeline.named_steps

In [None]:
pipeline['rf_clf'].feature_importances_

> If we need to change our process, we change it _just once_ in the Pipeline

## Grid Searching a Pipeline

> Let's first get our data prepared like we did before

In [None]:
penguins = sns.load_dataset('penguins')
penguins = penguins.dropna()

In [None]:
y = penguins.pop('species')
X_train, X_test, y_train, y_test = train_test_split(
    penguins, y, test_size=0.5, random_state=42)

In [None]:
X_train_nums = X_train.select_dtypes('float64')

ss = StandardScaler()

ss.fit(X_train_nums)
nums_df = pd.DataFrame(ss.transform(X_train_nums),
                      index=X_train_nums.index)

In [None]:
X_train_cat = X_train.select_dtypes('object')

ohe = OneHotEncoder(sparse=False)

dums = ohe.fit_transform(X_train_cat)
dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                       index=X_train_cat.index)

> Intermediary step to treat categorical and numerical data differently

### Using `ColumnTransformer`

In [None]:
X_train_nums.columns

In [None]:
numerical_pipeline = Pipeline(steps=[('ss', StandardScaler())])

categorical_pipeline = Pipeline(steps=[('ohe', OneHotEncoder(sparse=False, 
                                                            handle_unknown='ignore'))])
transformer = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, X_train_nums.columns),
    ('cat', categorical_pipeline, X_train_cat.columns)])

In [None]:
model_pipe = Pipeline(steps=[('col_tr', transformer),
                             ('log_reg', LogisticRegression(random_state=42))])

> Finally showing we can fit the full pipeline

In [None]:
model_pipe.fit(X_train, y_train)

In [None]:
model_pipe.score(X_train, y_train)

In [None]:
model_pipe.score(X_test, y_test)

> Performing grid search on the full pipeline

In [None]:
model_pipe.named_steps

In [None]:
model_pipe.named_steps['log_reg']

In [None]:
model_pipe['col_tr'].named_transformers_

In [None]:
pipe_grid = None
gs_pipe = GridSearchCV(estimator=model_pipe, param_grid=pipe_grid, verbose=2)

In [None]:
gs_pipe.fit(X_train, y_train)

In [None]:
pd.DataFrame(gs_pipe.cv_results_)

In [None]:
gs_pipe.best_params_

In [None]:
gs_pipe.best_estimator_.score(X_test, y_test)

# Grid Search Exercise

Use a classifier of your choice to predict the category of price range for the phones in this dataset. Try tuning some hyperparameters using a grid search, and then write up a short paragraph about your findings.

In [None]:
phones_train = pd.read_csv('data/train.csv')
phones_test = pd.read_csv('data/test.csv')

# Level Up: Random Searching

It is also possible to search for good hyperparameter values randomly. This is a nice choice if computation time is an issue or if you are tuning over continuous hyperparameters.

### `RandomizedSearchCV` with `LogisticRegression`

In [None]:
log_reg_grid = {'C': stats.uniform(loc=0, scale=10),
               'l1_ratio': stats.expon(scale=0.2)}

In [None]:
rs = RandomizedSearchCV(estimator=LogisticRegression(penalty='elasticnet',
                                                    solver='saga',
                                                    max_iter=1000,
                                                    random_state=42),
                        param_distributions=log_reg_grid,
                       random_state=42, n_iter=100, verbose=2)

rs.fit(X_train_clean, y_train)

rs.best_params_