# `GridSearchCV`

Let's learn about a quick and easy way to conduct a cross-validation grid search without having to write a `for` loop.

## What we will accomplish

In this notebook we will:
- Work on a synthetic regression example,
- Refresh ourselves on hyperparameter tuning using cross-validation,
- Demonstrate what we would do using `KFold` and a `for` loop and
- Compare that to what we would do using `GridSearchCV`

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

## Hyperparameter tuning

Over the course of our supervised learning content we talked a lot about choosing the "best" hyperparameter values (think $\alpha$ in Ridge/Lasso regression, or $k$ in $k$NN) using cross-validation. What this entails is fitting the model with each potential set of hyperparameter values on each cross-validation split and then recording the performance on the holdout set.

Typically you <i>tune</i> hyperparameters, i.e. find the "best" values, by setting up grids for your potential hyperparameter values. For example, in a $k$NN setting you would set up a grid that starts at your minimum number of neighbors (like $k=1$) and incrementally increases to your maximum number of neighbors (like $k=50$).

### A synthetic regression example

Let's do just that with this synthetic data set.

In [2]:
np.random.seed(403940)
X_train = np.random.randn(500,2)
X_train[:,0] = 1.3*X_train[:,0] - 2
X_train[:,1] = .8*X_train[:,1] + 1.2

y_train = X_train[:,0] + 2.3*X_train[:,1] - 2 + np.random.randn(500)*1.3

We will fit a $k$NN regression model, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html</a>, on these data. In particular we will use a `for` loop and `KFold` to find the optimal values for `n_neighbors` and `weights`, where optimal indicates the pair with lowest average cross-validation mean squared error.

In [3]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

For `n_neighbors` we will go from `1` to `50` and for `weights` we will examine `'uniform'` and `'distance'`, the two arguments allowed by `sklearn`.

In [4]:
n_neighbors = range(1,51)
weights = ['uniform', 'distance']

## this array will track performance across all splits, n_neighbors, and weights
cv_mses = np.zeros((5, len(n_neighbors), len(weights)))

kfold = KFold(5, shuffle=True, random_state = 30293)

## keeps track of cv_split
i = 0
for train_index, test_index in kfold.split(X_train, y_train):
    X_tt = X_train[train_index, :]
    y_tt = y_train[train_index]
    X_ho = X_train[test_index, :]
    y_ho = y_train[test_index]
    
    ## keeps track of neighbor split
    j = 0
    for neighbors in n_neighbors:
        ## keeps track of weight split
        k = 0
        for weighting in weights:
            ## make the model object
            knn = KNeighborsRegressor(n_neighbors = neighbors,
                                         weights = weighting)
            
            ## fit the model
            knn.fit(X_tt, y_tt)
            
            ## get the prediction
            pred = knn.predict(X_ho)
            
            ## store the mse
            cv_mses[i,j,k] = mean_squared_error(y_ho, pred)
            
            k = k + 1
        j = j + 1
    i = i + 1

In [5]:
cv_mses

array([[[2.75474562, 2.75474562],
        [1.99994236, 2.09735104],
        [1.87835382, 1.92426462],
        [1.90543114, 1.90421841],
        [1.79600825, 1.80271275],
        [1.69080425, 1.73611774],
        [1.72373852, 1.74347667],
        [1.64698842, 1.6762637 ],
        [1.6188231 , 1.66280987],
        [1.61049348, 1.66230556],
        [1.58338572, 1.64534584],
        [1.57833326, 1.62499989],
        [1.5395133 , 1.5988052 ],
        [1.52037592, 1.58720613],
        [1.49018358, 1.56351529],
        [1.54302363, 1.57719165],
        [1.54535418, 1.57532226],
        [1.56890101, 1.58199961],
        [1.59530466, 1.5928034 ],
        [1.6080759 , 1.59151122],
        [1.61693128, 1.59549096],
        [1.62756032, 1.59831609],
        [1.62256692, 1.59834839],
        [1.6307974 , 1.60060228],
        [1.64711433, 1.60803054],
        [1.65801783, 1.60835818],
        [1.65543865, 1.60513952],
        [1.6643452 , 1.60855608],
        [1.68621437, 1.62061204],
        [1.675

In [6]:
np.mean(cv_mses, axis=0)

array([[3.40597303, 3.40597303],
       [2.48933939, 2.58118603],
       [2.16764589, 2.28241016],
       [2.07896939, 2.17419534],
       [1.96631815, 2.05821748],
       [1.90499419, 1.98915227],
       [1.88519734, 1.95839293],
       [1.85779432, 1.92633648],
       [1.87010324, 1.92586594],
       [1.86275467, 1.91984766],
       [1.87659966, 1.92196352],
       [1.88392735, 1.91861041],
       [1.87968395, 1.90893559],
       [1.89579771, 1.91203652],
       [1.88883957, 1.90580837],
       [1.92221424, 1.92047625],
       [1.91350006, 1.9133372 ],
       [1.91358352, 1.91170776],
       [1.93375882, 1.91990663],
       [1.94956733, 1.9264907 ],
       [1.95552602, 1.92722104],
       [1.97012458, 1.9354523 ],
       [1.99354604, 1.94853973],
       [1.99196601, 1.9464596 ],
       [1.99616217, 1.94808583],
       [2.00806386, 1.95309745],
       [2.01658474, 1.95590844],
       [2.02940948, 1.96279616],
       [2.04447708, 1.97085423],
       [2.05598166, 1.97690823],
       [2.

In [7]:
## This will get you the index where the minimum occurs
np.unravel_index(np.mean(cv_mses, axis=0).argmin(), np.mean(cv_mses, axis=0).shape)

(7, 0)

In [8]:
## This is the min avg cv
np.mean(cv_mses, axis=0)[7,0]

1.8577943177900278

In [9]:
## this was the number of neighbors that got us there
n_neighbors[7]

8

In [10]:
## this was the weighting
weights[0]

'uniform'

Now this example involed a two dimensional grid search, which was not too difficult to write up, but what if we wanted to test a random forest model which can involve tuning way more than two hyperparameters.

At a certain point, explicitly writing a `for` loop becomes tedious.

Luckily, `sklearn` provides a model selection object called `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, which will implement the grid search hyperparameter tuning for you.

Let's demonstrate how now.

In [11]:
## import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [13]:
## when defining a GridSearchCV object you first
## place in an empty model object,
## Then a dictionary containing the grids for each of the parameters you
## want to test into param_grid,
## Then a string with how you want to "score" the model in scoring,
## Then how many splits you want to use in cv, the default is 5
grid_cv = GridSearchCV(KNeighborsRegressor(),
                          param_grid = {'n_neighbors':range(1,51),
                                           'weights':['uniform', 'distance']},
                          scoring = 'neg_mean_squared_error',
                          cv = 5)


## Then you call to fit
grid_cv.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': range(1, 51),
                         'weights': ['uniform', 'distance']},
             scoring='neg_mean_squared_error')

When you call `.fit` the `GridSearchCV` object goes through the same `for` loop procedure we manually coded above and finds the `scoring` value for each of the grid points you have fed into the object.

When it is done we can access the results.

In [14]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

{'n_neighbors': 10, 'weights': 'uniform'}

In [15]:
## You can find the best score like so
## .best_score_
grid_cv.best_score_

-1.9408206698529198

In [16]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

{'mean_fit_time': array([0.0006247 , 0.00045681, 0.000284  , 0.00024347, 0.00028338,
        0.00023279, 0.00022817, 0.00024595, 0.00025039, 0.00028114,
        0.00026884, 0.00028048, 0.00021338, 0.00019336, 0.00019073,
        0.00018363, 0.00018039, 0.00016966, 0.00016308, 0.00015502,
        0.00018144, 0.00015039, 0.00014362, 0.00014496, 0.00014234,
        0.0001442 , 0.00014706, 0.00014281, 0.00014281, 0.00014567,
        0.0001667 , 0.00015607, 0.0001513 , 0.00014482, 0.00015025,
        0.00014353, 0.00014439, 0.0001442 , 0.00014496, 0.00014157,
        0.00014577, 0.00014639, 0.00017762, 0.00016217, 0.00014882,
        0.00014796, 0.00014672, 0.00014596, 0.00014768, 0.00014839,
        0.00014682, 0.00015092, 0.00017204, 0.00014901, 0.00014796,
        0.00014453, 0.00014486, 0.00014591, 0.00014534, 0.00014572,
        0.00015473, 0.00016556, 0.00015249, 0.00014505, 0.00014539,
        0.00014315, 0.00014982, 0.00014348, 0.00014582, 0.00014386,
        0.00014377, 0.00018959,

In [17]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
grid_cv.best_estimator_

KNeighborsRegressor(n_neighbors=10)

In [18]:
grid_cv.best_estimator_.predict(X_train)

array([ 2.66406182e+00, -2.45968771e+00, -4.89492604e+00, -4.76496435e-01,
       -4.45621552e+00, -1.25865530e+00,  1.36711840e+00,  5.08317141e-01,
       -1.02505020e+00,  1.03691282e-01, -2.65138839e+00, -3.18878434e+00,
       -2.02163229e+00, -6.34562885e-01, -9.39628106e-01, -7.23541915e-01,
        7.41504489e-01, -2.77229871e+00, -1.64875801e+00, -1.94042017e+00,
       -2.45204726e+00, -8.09182741e-01, -1.35923745e+00,  1.29652798e+00,
       -3.95679706e+00, -3.73020639e+00,  1.98105267e+00, -1.91736169e+00,
       -1.12763929e+00, -3.58315507e+00, -1.54321779e+00,  2.22248803e+00,
       -4.95989624e+00, -6.40415392e-01, -5.81709063e+00, -1.44056473e+00,
       -2.07722656e+00,  3.21889574e-01, -1.29334651e+00, -2.82134424e+00,
       -4.52088407e+00, -1.18745183e+00, -3.47863494e+00, -1.65165548e+00,
       -1.96751391e+00, -4.56189396e-01, -4.55661743e+00, -6.21850596e+00,
       -1.88342299e+00, -4.06844198e+00, -1.92134892e+00, -1.62113676e+00,
       -2.75825189e+00, -

Before we end this notebook you may have a couple of questions.

First you may have noticed that our score was `'neg_mean_squared_error'` instead of `'mean_square_error'`. This is because the `GridSearchCV` does not offer mean squared error as an option.

To see what metrics are available as scoring options we can run the following code.

In [19]:
from sklearn.metrics import SCORERS

In [20]:
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

Another thing you may notice is that the model chosen by the `GridSearchCV` is not the same as the model we found with our handwritten `for` loop.

If you compare the performance, their avg cv scores are close, indicating that this is likely due to `GridSearchCV` using a different cross-validation split than ours. We can rectify that like so.

In [21]:
grid_cv = GridSearchCV(KNeighborsRegressor(), 
                         param_grid = {'n_neighbors':range(1,50),
                                          'weights':['uniform', 'distance']}, 
                         scoring = 'neg_mean_squared_error',
                         cv = KFold(5, shuffle=True, random_state = 30293))

grid_cv.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=30293, shuffle=True),
             estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': range(1, 50),
                         'weights': ['uniform', 'distance']},
             scoring='neg_mean_squared_error')

In [22]:
grid_cv.best_params_

{'n_neighbors': 8, 'weights': 'uniform'}

In [23]:
grid_cv.best_score_

-1.8577943177900278

You now have a good base understanding of `GridSearchCV`. If you would like to learn more you can experiment on your own, or read the documentation, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)