- Parameters and hyperparameters
- hyperparameters optimization: the valid dataset
- Cross validation
- Gridsearch and randomsearch

## Hyperparameters 💪

___

![](https://cdn-images-1.medium.com/max/1600/0*K-0v6zWiCXt2_FHB)

___

Previously, we learnt to prepare our data in order to train machine learning models.

Today, we will see that we can *boost* our model results by playing with **hyperparameters**.

# I. Cross validation

## I.1. Evaluating the performance of a model

Until now, to evaluate the performance of a model, we have been checking its score on a test set. But this process is actually quite problematic : **in real life, we do not have a test set** ! The test data are actually real life data, for which we do not have a label (e.g. no `y_test`), and we need to get a prediction. And since we do not have a `y_test`, there is no way to compute the score of the model !

> 🔍 For instance, if we need to predict tomorrow's weather, we can train a model with the data from the previous days, and then make a prediction ; but we don't actually know yet what will be tomorrow's weather, and so we have no way to evaluate the accuracy of the model's prediction !

That's why we will use the **cross-validation technique**.

## I.2. The validation dataset

We are going to split our data into three different sets :
- the training set
- the test set
- **the validation set**

A common split is for example: 60% - 20% - 20%.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1ahZYvfiqQumVw-z0FkDVue7Pt9krAGCk" width="400px">
</p>

This new **Validation Set** is an other set of observations used to **evaluate the performance of the model**, and then **tune the hyperparameters** (not the weights, but some options given by our model before fitting) of the model. 

> ➡️ In other words, **the validation set is a subset used for evaluating your model performance along the way in choosing/building the best model**.

The validation dataset is commonly noted `X_val` (for the features) and `y_val` (for the corresponding targets).

## I.3. The cross-validation technique

The **cross-validation** technique allows you to train (and evaluate your model) on all your labeled data available (except the test set). 

The process of the cross-validation is the following:
- You split your data into a training set and a test set (for example 80%-20%)
- You split your training set into `cv` folds (for example `cv`=10 splits)
- You train on all folds except one - and you do that `cv` times, keeping each time a different fold aside
- For each training you compute your error
- The error is then averaged over the k folds and is named **cross-validation error**.

This way, **you will have a better understanding of your model performance** - as it **will have seen all data available except test set** for **both training and validation**.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=18iNqOegmMWxGkGbVD2rYPzsH4D97jRNC" width="700px">
</p>

> 🔦 **Hint**: This method works great on small to medium-sized datasets (< 10,000 - 100,000 lines). This is however not approprieted for big datasets (> 100,000 - 1M lines depending on your model complexity).

## I.3.A. Implementing a K-Fold cross-validation

Scikit-learn provides an easy tool to split data into a train set and a validation set according to a given number of splits (also called **folds**).

In [45]:
from sklearn.model_selection import KFold

In order to understand what the `KFold` does, let's work on a very small and fictional dataset.

In [46]:
# Let's work on fictional data
data = np.array([2, 4, 6, 5, 1, 7, 3, 9, 11])               # The features
y = np.array(['R', 'R', 'R', 'B', 'B', 'B', 'B', 'B', 'B']) # The target

In [47]:
# Instanciating the KFold object
kfold = KFold(n_splits=3, shuffle=True, random_state=0)

⚠️ The KFold object returns arrays of **indices**, not of values.

In [48]:
# See the results of the kfold split by calling the method `.split()`
list(kfold.split(data))

[(array([0, 3, 4, 5, 6, 8]), array([1, 2, 7])),
 (array([0, 1, 2, 3, 5, 7]), array([4, 6, 8])),
 (array([1, 2, 4, 6, 7, 8]), array([0, 3, 5]))]

In [49]:
# Looping to see the splitted values
counter = 1
for train, val in kfold.split(data, y):
    print(f"Fold n°{counter}")
    counter += 1
    print(f"X_train : {data[train]}, y_train : {y[train]}")
    print(f"X_val : {data[val]}, y_val : {y[val]}")
    print("----------")

Fold n°1
X_train : [ 2  5  1  7  3 11], y_train : ['R' 'B' 'B' 'B' 'B' 'B']
X_val : [4 6 9], y_val : ['R' 'R' 'B']
----------
Fold n°2
X_train : [2 4 6 5 7 9], y_train : ['R' 'R' 'R' 'B' 'B' 'B']
X_val : [ 1  3 11], y_val : ['B' 'B' 'B']
----------
Fold n°3
X_train : [ 4  6  1  3  9 11], y_train : ['R' 'R' 'B' 'B' 'B' 'B']
X_val : [2 5 7], y_val : ['R' 'B' 'B']
----------


In a more visual way : 

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1NMSWJv8XMzY8tAM8pVibu_NashNYGhQA">
</p>

The `KFold` object only gives out the `K` splits of the data. It does not fit and evaluate any model, which is something we have to do ourselves. Thankfully, it is quite easy to do with a simple **for loop** !

In [50]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [60]:
scores = []
for train, val in kfold.split(data, y):
    
    # Creating the train set and validation set
    X_train = data[train].reshape(-1, 1)
    y_train = y[train]
    X_val = data[val].reshape(-1, 1)
    y_val = y[val]
    
    # Selecting the model, fitting it, and making a prediction
    svc = SVC()
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_val)
    
    # Filling a list of accuracy scores
    acc = accuracy_score(y_val, y_pred)
    scores.append(acc)

In [61]:
# We get a list of the model's scores on each fold
scores

[0.3333333333333333, 0.3333333333333333, 0.3333333333333333]

In [62]:
# The mean accuracy score gives us a good idea of the performance of our model
np.mean(scores)

0.3333333333333333

The score is not very good, which can be explained by two factors :
- We are working on very few and fictional data, which doesn't allow for a good performance of our classification model.
- More importantly, the K-Fold split is not **stratified**.

### I.3.B. Implementing a Stratified K-Fold cross-validation

The `StratifiedKFold` object in `scikit-learn` works exactly like the `KFold`. The only exception is that, when calling the method `.split()`, we need to give it the target (the `y`), as it will use it to create the folds.

In [54]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=3)

In [55]:
# StratifiedKFold gives K arrays of indexes
list(skfold.split(data, y))

[(array([1, 2, 5, 6, 7, 8]), array([0, 3, 4])),
 (array([0, 2, 3, 4, 7, 8]), array([1, 5, 6])),
 (array([0, 1, 3, 4, 5, 6]), array([2, 7, 8]))]

In [56]:
# Looping to see the splitted data
counter = 1
for train, val in skfold.split(data, y):
    print(f"Fold n°{counter}")
    counter += 1
    print(f"X_train : {data[train]}, y_train : {y[train]}")
    print(f"X_val : {data[val]}, y_val : {y[val]}")
    print("----------")

Fold n°1
X_train : [ 4  6  7  3  9 11], y_train : ['R' 'R' 'B' 'B' 'B' 'B']
X_val : [2 5 1], y_val : ['R' 'B' 'B']
----------
Fold n°2
X_train : [ 2  6  5  1  9 11], y_train : ['R' 'R' 'B' 'B' 'B' 'B']
X_val : [4 7 3], y_val : ['R' 'B' 'B']
----------
Fold n°3
X_train : [2 4 5 1 7 3], y_train : ['R' 'R' 'B' 'B' 'B' 'B']
X_val : [ 6  9 11], y_val : ['R' 'B' 'B']
----------


What do you notice, compared to the splits created by the `KFold` ?

➡️ The splits given by the `StratifiedKFold` object are **stratified**. 

> From the [Wikipedia definition](https://en.wikipedia.org/wiki/Stratified_sampling) : *Stratification is the process of dividing members of the population into homogeneous subgroups before sampling.*

In other words, the `StratifiedKFold` created subgroups that all contain the same repartition of the `"R"` and `"B"` classes. More importantly, this repartition (1/3 of `"R"` and 2/3 of `"B"`) is coherent with the initial repartition of classes in the dataset.

In a more visual way :

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1NuwKkKmp0yX2Z3S2q8nMYg32KkBVaBlR">
</p>

In [57]:
# Let's actually do the cross-validation process now

scores = []
for train, val in skfold.split(data, y):
    
    # Creating the train set and validation set
    X_train = data[train].reshape(-1, 1)
    y_train = y[train]
    X_val = data[val].reshape(-1, 1)
    y_val = y[val]
    
    # Fitting the model on the data
    svc = SVC()
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_val)
    
    # Getting the model's score on each fold
    acc = accuracy_score(y_val, y_pred)
    scores.append(acc)

In [58]:
# We get a list of the model's score on each fold
scores

[0.3333333333333333, 0.6666666666666666, 0.6666666666666666]

In [59]:
# The mean score gives a good idea of the model's performance !
np.mean(scores)

0.5555555555555555

The score is better, because we have trained and validated the model on subsets of the data that are actually **representative of the whole dataset**.

### I.3.C. Recap : `KFold` vs `StratifiedKFold`

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1gVp5H4P_ObplwLRZSEiRLSnJrwwpGW9k">
</p>

From now on, we will **always use the cross-validation technique** in order to evaluate the performance of our model. It is also always better to use a **stratified** method of sub-sampling the data.

But how do you apply it in order to **find the optimal hyperparameters**? And what do we mean exactly by **hyperparameters**?

# II. Previous results

Let's work on a real dataset. We cleaned our **Titanic dataset**, and saved it in **pickle format**.

Let's load it again.

In [64]:
import pickle
import os

data_path = os.path.join("..", "..", "05-Data-Preparation", "00-Lectures", "data_cleaned.pkl")

with open(data_path, "rb") as f:
    data_cleaned = pickle.load(f)

data_cleaned.head(n=3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S,isAlone,Title,Title_val_Master.,Title_val_Miss.,Title_val_Mr.,Title_val_Mrs.,Title_val_Rare_Title
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,...,0,0,1,0,Mr.,0,0,1,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,...,1,0,0,0,Mrs.,0,0,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,...,0,0,1,1,Miss.,0,1,0,0,0


In [65]:
# Picking the columns we want to work on
features_to_use = ["Pclass", "Age", "Fare", "SibSp", "Parch",
                   "Sex", "Embarked_Val_C", "Embarked_Val_Q",
                   "isAlone", "Title_val_Mr.", "Title_val_Mrs.",
                   "Title_val_Rare_Title", "Title_val_Miss."] 

In [66]:
# Creating the features X and the target y
X = data_cleaned[features_to_use]
y = data_cleaned['Survived']

## II.1. Let's use the cross-validation technique to evaluate the performance of a SVM classifier on our dataset.

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

Obviously, we still need to split our dataset with a `train_test_split`.

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Let's now evaluate the performance of our model with a cross-validation.

❓But do we apply the cross validation on `X` or on `X_train` ?

➡️ The cross validation is **always applied to the training set**. **The test set must remain unseen** until the very end of the process, the final evaluation.

In [69]:
# Cross val with a SVC
# Let's use the StratifiedKFold right away

skfold = StratifiedKFold(n_splits=5)

scores = []
for train, val in skfold.split(X_train, y_train):
    
    # Creating the train set and validation set
    X_train_val = X_train.iloc[train]
    y_train_val = y_train.iloc[train]
    X_val = X_train.iloc[val]
    y_val = y_train.iloc[val]
    
    # Fitting the model on the data
    svc = SVC()
    svc.fit(X_train_val, y_train_val)
    y_pred = svc.predict(X_val)
    
    # Getting the model's score on each fold
    acc = accuracy_score(y_val, y_pred)
    scores.append(acc)

In [70]:
np.mean(scores)

0.7440220652387268

We obtain an accuracy score of 74.4% with a SVC.
Let's improve that score !

---

# III. Hyperparameters

By reading the scikit-learn documentation for [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), we observe that the model can take various optional parameters.

For example `C`, `kernel`, etc.

<img src="https://drive.google.com/uc?export=view&id=1AeUUZSyCwYvc-mwzOBH4exV0ZvyL3WZq" width="100%">

We call them **hyperparameters**.

> 🔦 **Hint**: Hyperparameters are parameters that are not directly learnt by the model. They are passed as arguments to the model constructor. It is like "options“/settings of the model.

By default, the hyperparameter C=1.0

What if we retrained our model with a new values of C? With C=0.1 ? C=10? C=100?

In [71]:
# Looping through different values of the parameter `C`
# And still evaluating the results with a cross-validation

skfold = StratifiedKFold(n_splits=5)  # Let's use stratification right away

for C in [0.1, 1, 10, 100, 1000]:
    print(f"C={C}")
    scores = []
    
    for train, val in skfold.split(X_train, y_train):
        # Creating the train set and the validation set
        X_train_val = X_train.iloc[train]
        y_train_val = y_train.iloc[train]
        X_val = X_train.iloc[val]
        y_val = y_train.iloc[val]
        
        # Fitting the model with a varying `C` parameter on the data
        svc = SVC(C=C)
        svc.fit(X_train_val, y_train_val)
        y_pred = svc.predict(X_val)
        
        # Getting the scores
        acc = accuracy_score(y_val, y_pred)
        scores.append(acc)
    
    print(f"Mean accuracy on 5 folds : {np.mean(scores)}")


C=0.1
Mean accuracy on 5 folds : 0.6244779693386227
C=1
Mean accuracy on 5 folds : 0.7440220652387268
C=10
Mean accuracy on 5 folds : 0.753851951664358
C=100
Mean accuracy on 5 folds : 0.7412248624415241
C=1000
Mean accuracy on 5 folds : 0.7312562474983811


Our accuracy changes! In our case, setting the value of `C` to 10 leads to an **increase of the accuracy**. Interesting.. 😏

This could mean we can **search for the optimal values of hyperparameters** (that maximize the accuracy).

---

## III.1. Tuning hyperparameters

Training a model can take some time, and we do not want to test all possible values of hyperparameters in order to find the optimal ones (this would be too consuming - in time and energy).

Instead, we could **follow a search strategy** such as: 
- Testing different specified combinations, also called `GridSearchCV`
- Testing hyperparameters randomly, also called `RandomizedSearchCV`

## III.2. `GridSearchCV`: Exhaustive Grid Search

The grid search provided by `GridSearchCV` exhaustively generates candidates from a grid of parameter values specified with the `param_grid` parameter.

For instance, the following `param_grid`:

``` python
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
```

specifies that two grids should be explored:
- one with a **linear kernel** and **C** values in [1, 10, 100, 1000]
- a second one with an **RBF kernel**, and the cross-product of **C** values ranging in [1, 10, 100, 1000] and **gamma** values in [0.001, 0.0001]

When searching for the hyperparameters, we use the training set and the **cross_validation** technique for choosing the optimal hyperparameters - hence the `CV` at the end of the method name. 

In [73]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [1, 10, 100], 
              'gamma': [0.001, 0.01, 0.1]}

clf = SVC()
grid = GridSearchCV(clf,
                   param_grid,
                   cv=3, # In order to test the different hyperparameters (on the train set), 
                         # we use the `cross validation` technique.
                         # 3 represents the number of folds of the cross-val.
                   verbose=1,  # Setting Verbose adds some "prints" (logs) detailing
                   n_jobs=-1    # what is happening in backend
                                # The higher the setting, the higher the nb of logs printed
                  )
grid.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    0.4s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'C': [1, 10, 100], 'gamma': [0.001, 0.01, 0.1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

### All results: `cv_results_`

What if I want to retrieve the best hyperparameters found by the grid search?

In [74]:
grid.cv_results_

{'mean_fit_time': array([0.013304  , 0.01612131, 0.02281213, 0.01000094, 0.01259081,
        0.02087164, 0.02064967, 0.02612297, 0.02734367]),
 'std_fit_time': array([0.00229677, 0.00048986, 0.0007653 , 0.00125901, 0.00064471,
        0.00408818, 0.00380952, 0.0033561 , 0.00799329]),
 'mean_score_time': array([0.00608134, 0.00878922, 0.00975386, 0.00394114, 0.00394376,
        0.00563534, 0.00261768, 0.00288335, 0.00463994]),
 'std_score_time': array([5.00377837e-04, 2.38371811e-03, 4.68590354e-03, 8.98845871e-04,
        1.16538358e-03, 1.94929950e-03, 7.79261277e-05, 2.58881544e-05,
        7.65181078e-04]),
 'param_C': masked_array(data=[1, 1, 1, 10, 10, 10, 100, 100, 100],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'param_gamma': masked_array(data=[0.001, 0.01, 0.1, 0.001, 0.01, 0.1, 0.001, 0.01, 0.1],
              mask=[False, False, False, False, False, False, False,

In [75]:
# To visualize these cv_results more easily, we can transform the dict into a DataFrame
import pandas as pd

pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.013304,0.002297,0.006081,0.0005,1,0.001,"{'C': 1, 'gamma': 0.001}",0.746835,0.708861,0.649789,0.701828,0.03993,9
1,0.016121,0.00049,0.008789,0.002384,1,0.01,"{'C': 1, 'gamma': 0.01}",0.7173,0.738397,0.687764,0.714487,0.020766,8
2,0.022812,0.000765,0.009754,0.004686,1,0.1,"{'C': 1, 'gamma': 0.1}",0.742616,0.7173,0.729958,0.729958,0.010335,5
3,0.010001,0.001259,0.003941,0.000899,10,0.001,"{'C': 10, 'gamma': 0.001}",0.822785,0.827004,0.78903,0.81294,0.016994,1
4,0.012591,0.000645,0.003944,0.001165,10,0.01,"{'C': 10, 'gamma': 0.01}",0.801688,0.763713,0.772152,0.779184,0.016281,3
5,0.020872,0.004088,0.005635,0.001949,10,0.1,"{'C': 10, 'gamma': 0.1}",0.725738,0.704641,0.729958,0.720113,0.011075,6
6,0.02065,0.00381,0.002618,7.8e-05,100,0.001,"{'C': 100, 'gamma': 0.001}",0.839662,0.810127,0.78903,0.81294,0.020766,1
7,0.026123,0.003356,0.002883,2.6e-05,100,0.01,"{'C': 100, 'gamma': 0.01}",0.805907,0.751055,0.780591,0.779184,0.022415,3
8,0.027344,0.007993,0.00464,0.000765,100,0.1,"{'C': 100, 'gamma': 0.1}",0.738397,0.71308,0.708861,0.720113,0.013043,6


### Optimal hyperparameters: `best_params_`

In [76]:
grid.best_params_

{'C': 10, 'gamma': 0.001}

### Best score found (mean score on all folds used as validation set): `best_score_`

In [77]:
# All the mean scores found for each point of the param_grid
grid.cv_results_["mean_test_score"]

array([0.70182841, 0.71448664, 0.72995781, 0.81293952, 0.77918425,
       0.72011252, 0.81293952, 0.77918425, 0.72011252])

In [78]:
# Best score found
grid.best_score_

0.8129395218002813

### Best estimator: `best_estimator_`

In [79]:
grid.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

### Hyperparameters tested: `param_grid`

In [80]:
grid.param_grid

{'C': [1, 10, 100], 'gamma': [0.001, 0.01, 0.1]}

## III.3. `RandomizedSearchCV` : Randomized Search

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. 

**`RandomizedSearchCV` implements a randomized search over parameters**, where **each setting is sampled from a distribution of possible parameter values**. This has two main benefits over an exhaustive search:
- A budget (e.g. a number of search iterations) can be chosen independently of the number of parameters and possible values
- Adding parameters that do not influence the performance does not decrease efficiency

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for `GridSearchCV`. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the `n_iter` parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified:

``` python
random_grid = {
    'C': scipy.stats.expon(scale=100),
    'gamma': scipy.stats.expon(scale=.1),
    'kernel': ['rbf', 'linear']
}
```

This example uses the `scipy.stats` module, which contains many useful distributions for sampling parameters, such as expon, gamma, uniform or randint. In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A call to the rvs function should provide independent random samples from possible parameter values on consecutive calls.

> 🔦 **Hint**: In contrast to `GridSearchCV`, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by `n_iter`.

In [99]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

clf = SVC()

param_dist = {
    'C': scipy.stats.expon(scale=100),
    'gamma': scipy.stats.expon(scale=.1)
    }

n_iter_search = 50 # n_iter is the number of hyperparameters settings that are tried
grid = RandomizedSearchCV(clf,
                         param_distributions=param_dist,
                         n_iter=n_iter_search,
                         verbose=1,
                         cv=5, 
                         n_jobs=-1)

grid.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:    3.8s finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=SVC(C=1.0, cache_size=200, class_weight=None,
                                 coef0=0.0, decision_function_shape='ovr',
                                 degree=3, gamma='auto_deprecated',
                                 kernel='rbf', max_iter=-1, probability=False,
                                 random_state=None, shrinking=True, tol=0.001,
                                 verbose=False),
                   iid='warn', n_iter=50, n_jobs=-1,
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a23beca20>,
                                        'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a23becbe0>},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=1)

Again, we can retrieve all information from our classifier search.

### All results: `cv_results_`

What if I want to retrieve the best hyperparameters found by the random search?

In [100]:
grid.cv_results_

{'mean_fit_time': array([0.04259734, 0.05489106, 0.08129706, 0.06270065, 0.03509617,
        0.0495028 , 0.04700789, 0.05259528, 0.0507885 , 0.0670177 ,
        0.05155001, 0.04905071, 0.04042497, 0.05466361, 0.06375461,
        0.06093712, 0.05347896, 0.11769476, 0.03962045, 0.05049601,
        0.03568912, 0.06046133, 0.05123944, 0.01982584, 0.07772617,
        0.05772867, 0.03116751, 0.05205765, 0.0683208 , 0.05970421,
        0.05134821, 0.05301933, 0.07111435, 0.03898697, 0.08681226,
        0.04507689, 0.03253412, 0.04118514, 0.03665118, 0.07701507,
        0.05429244, 0.07025409, 0.0600204 , 0.07002354, 0.05429382,
        0.0431911 , 0.08901358, 0.02432766, 0.04371715, 0.03723125]),
 'std_fit_time': array([0.00893756, 0.01205306, 0.01507677, 0.0185296 , 0.0052873 ,
        0.01155773, 0.00870881, 0.01092525, 0.01473729, 0.01542342,
        0.01083921, 0.01876623, 0.00788362, 0.01099752, 0.01208729,
        0.0142953 , 0.01268239, 0.04385797, 0.002584  , 0.0110039 ,
        0.002

### Optimal hyperparameters: `best_params_`

Especially, we can check for the best hyperparameters found.

In [101]:
grid.best_params_

{'C': 155.32250835495762, 'gamma': 0.0013170314227432364}

### Best score found (mean score on all folds used as validation set): `best_score_`

In [102]:
grid.best_score_

0.8143459915611815

### Best estimator: `best_estimator_`

In [86]:
grid.best_estimator_

SVC(C=77.81749106356084, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.013763799748375237,
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

# III.4. `GridSearchCV` vs `RandomizedSearchCV`

Intuitively, people are more likely to choose to work with a GridSearch rather than a RandomizedSearch, probably because it seems less "scary" : humans do not like random things.

But in reality, the RandomizedSearch has several advantages over the GridSearch :
- In terms of **computational time**, the main drawback of the GridSearch is that it suffers when the number of hyperparameters grows. With as few as four parameters, this problem can become highly impractical, because the number of evaluations required for this strategy increases exponentially with each additional parameter. The RandomizedSearch does not have the same problem, since we can choose precisely the number of iterations of the search (with the parameter `n_iter`).
- In terms of **finding the optimal combinations of hyperparameters**, the RandomizedSearch is more likely to find an optimal combination of hyperparameters, since it does not have to follow the rigid grid pattern of testing hyperparameters.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1C2bh--ETFTMS9d3TPl2c_WlmFtQPviUv">
</p>

> 📚 **Digging deeper** : In the paper [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/v13/bergstra12a.html) by Bergstra and Bengio, the authors show empirically and theoretically that random search is more efficient for parameter optimization than grid search.