- Parameters and hyperparameters
- hyperparameters optimization: the valid dataset
- Cross validation
- Gridsearch and randomsearch

In [18]:
import warnings
warnings.filterwarnings('ignore')

# Hyperparameters & Optimization

___

For the moment, we learnt how to handle most of the steps of a machine learning project:
- data visualization
- data preparation
- model training and regularization

But how to choose carefully, and optimally, the regularization parameters?

# I. Parameters vs hyperparameters

In machine learning, we make a difference between parameters and hyperparameters:
- parameters are learned by the model while fitting: we can't modify them directly
- hyperparameters can be tuned: we can modify them directly

> Optimizing hyperparameters may sometimes lead to sensitive performance improvements

Up to now, we spoke almost only about parameters, without distinction, for pedagogical reasons.

Any example of hyperparameters you already know?

Examples of hyperparameters:
- learning rate alpha
- number of iterations of gradient descent
- regularization type: L1 or L2
- regularization value: `C` in logistic regression in scikit-learn

### Hyperparameter optimization

There are several ways to optimize hyperparameters, but this is always the same idea at the end: **finding the set of hyperparameters that maximize a given metric**.

How would you do that using train and test set?

Here are the possibilities:
1. training model and optimizing hyperparameters on train set
2. training model on train set, optimizing hyperparameters on test set

What are the limits of such approaches?
1. We train model and optimize on the same dataset, this might lead to several overfitting
2. We optimize on the test dataset, this will bias the final evaluation

So what is the solution?

Introduce the **validation dataset**: the dataset used only for hyperparameter optimization.

So that now the typical dataset splitting would be the following:
- train set ~ 60% of the data: fitting the model
- validation set ~ 20% of the data: tuning the hyperparameters
- test set ~ 20% of the data: evaluate the performance of the model


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1ahZYvfiqQumVw-z0FkDVue7Pt9krAGCk" width="400px">
</p>

# II. Cross validation

In practice, most people do not use this kind of static splitting between train and valid sets: they use the cross-validation method.

## II.1. The cross-validation method

The **cross-validation** technique allows you to both train your model and tune hyperparameters on all your labeled data available, except the test set (of course).

The process of the cross-validation is the following:
- You split your data into a training set and a test set (e.g. 80%-20%)
- You split your training set into cross-validation folds (for example `k=10` folds)
- You train on all folds but one - and you do that `k` times, keeping each time a different fold aside
- For each training you compute your evaluation metric
- The error is then averaged over the `k` folds and is called the **cross-validation error**.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=18iNqOegmMWxGkGbVD2rYPzsH4D97jRNC" width="700px">
</p>

- Finally, once you find the best set of hyperparameters, you train your model on the entire train set, using those hyperparameters

> Using cross-validation, you can expect to have a more robust model evaluation: you optimize your model not only on a subset of the train set, but on the whole train set

## II.2. Stratified cross validation

For classification issues, it might happen that you classes are imbalanced.

Meaning, at least one class is overrepresented compared to the rest.

The consequence of this is that the splitting may bias the training or the evaluation.

Let's consider an example of binary classification (red or blue), with the following balance:
- 1/3 blue
- 2/3 red

Now let's consider a cross validation on 9 samples

![](images/cross_val.png)

As one can see, this kind of cross validation might not be ideal.

> the training and evaluation on each fold might be biased: sometimes blue is overrepresented, sometimes it's red

So the solution is the stratified cross validation, keeping the same balance of each class.

![](images/stratified_cross_val.png)

In order to optimize the model, this cross validation technique will be used, **together with a hyperparameter research** method.

# III. Hyperparameter optimization

Let's look at the logistic regression signature in scikit-learn and check the hyperparameters:
```python
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
```

We have for example the following hyperparameters:
- `penalty` is the regularization type: it can be `'l1'`, `'l2'`, `'elasticnet'`, `'none'`
- `C` is the regularization factor: the higher, the less regularized e.g. `0.01`, `0.1`, `1.`
- `max_iter` it the maximum number of iterations: e.g. `10`, `100`, `1000`

How would you test all those values?

There are two main ways of optimizing those hyperparameters:
- Grid Search
- Random Search

Let's have a look at them, and use them with scikit-learn.

But first, let's reuse the Titanic survival prediction task to apply those methods.

In [13]:
from sklearn.model_selection import train_test_split
import pickle

# Import the data
with open('titanic_cleaned.pkl', 'rb') as file:
    X, y = pickle.load(file)
    
# Split it
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

X.head()

Unnamed: 0_level_0,Pclass,Age,Fare,SibSp,Parch,Sex,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Title_Others
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0.824744,-0.589288,-0.499958,0.431108,-0.474059,0,0,1,0,1,0,0
2,-1.571327,0.644485,0.788503,0.431108,-0.474059,1,0,0,0,0,1,0
3,0.824744,-0.280845,-0.486376,-0.474932,-0.474059,1,0,1,1,0,0,0
4,-1.571327,0.413153,0.422623,0.431108,-0.474059,1,0,1,0,0,1,0
5,0.824744,0.413153,-0.483861,-0.474932,-0.474059,0,0,1,0,1,0,0


## III.1. `GridSearchCV`: Explicit Grid Search

The grid search of scikit-learn `sklearn.model_selection.GridSearchCV` will test, for each provided combination of hyperparameters, an evaluation using cross validation.

GridSearchCV stands for Grid Search Cross Validation.

Why grid? Because all the hyperparameters combinations are tested, just like trying all the points in a grid of hyperparameters.

How to implement it?

That's pretty easy thanks to scikit-learn. The signature of the `GridSearchCV` is the following:
```python
class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
```

with:
- `estimator`: a model instance, e.g. `LogisticRegression()`
- `param_grid`: a dict with all hyperparams and values to test
- `scoring`: the metric to maximize, e.g. `'accuracy'`
- `cv`: the number of cross validation folds, e.g. `5`

Let's implement it.

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters we want to test
param_grid = {'penalty': ['l2', 'none'], 
              'C': [0.01, 0.1, 1.],
              'max_iter': [10, 100, 1000]}
# Define the gridsearch object
grid = GridSearchCV(LogisticRegression(),
                    param_grid,
                    scoring='accuracy',
                    cv=5
                   )
# Fit and wait
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1.0], 'max_iter': [10, 100, 1000],
                         'penalty': ['l2', 'none']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

Once the fit finished, many informations are available:
- all results in `.results_`: quite hard to read but useful if you need anything
- optimal set of hyperparameters in `.best_params_`
- best score in `.best_score_`
- best estimator in `best_estimator_` if you need to retrieve the object itself

In [40]:
# The best params here
grid.best_params_

{'C': 1.0, 'max_iter': 100, 'penalty': 'l2'}

In [24]:
# The best score
grid.best_score_

0.8143405889884763

## III.2. `RandomizedSearchCV`: a randomized search

A grid search is quite explicit and straightforward: it tests all the specified values. But sometimes, this is not the most efficient approach.

What if we test `C` in `0.1`, `1` but the optimal value appears to be more close to `0.7`? With grid search we would never know.

A randomized search can tackle this issue: ask to test, randomly, values between `0.1` and `1`.

It might find out a best value to be `0.6`, or `0.85`, not the absolute optimal, but better than `1`.

How to implement it? Same as a grid search. The only difference will be in giving not a list of values, but a **distribution** of values, and a **number of combinations of hyperparameters** to test.

To do so, we will use `scipy` library. Let's have an example.

In [47]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

param_dist = {
    'penalty': ['l2', 'none'],
    'C': scipy.stats.expon(scale=0.5),
    'max_iter': scipy.stats.expon(scale=500) 
    }

n_iter_search = 50 # The number of set of hyperparameters to try out
randsearch = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=param_dist,
                                n_iter=n_iter_search,
                                scoring='accuracy',
                                cv=5,
                                random_state=0
                               )

randsearch.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=LogisticRegression(C=1.0, class_weight=None,
                                                dual=False, fit_intercept=True,
                                                intercept_scaling=1,
                                                l1_ratio=None, max_iter=100,
                                                multi_class='auto', n_jobs=None,
                                                penalty='l2', random_state=None,
                                                solver='lbfgs', tol=0.0001,
                                                verbose=0, warm_start=False),
                   iid='deprecated', n_iter=50, n_jobs=None,
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc69d53d5d0>,
                                        'max_iter': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc69d0ace10>,
                                        

Then you can have all the attributes you had with a `GridSearchCV`:
- `cv_results_`
- `best_params_`
- `best_score_`
- `best_estimator_`

In [50]:
print('best params are:', randsearch.best_params_)
print('best score is:', randsearch.best_score_)

best params are: {'C': 1.575085018411753, 'max_iter': 75.61542857136058, 'penalty': 'l2'}
best score is: 0.8171574903969271


# III.3. `GridSearchCV` vs `RandomizedSearchCV`

Intuitively, people are more likely to choose to work with a GridSearch rather than a RandomizedSearch, probably because it seems less "scary" : humans do not like random things.

Most people use grid search most of the time. Random search is quite powerful though, even if less intuitive.
<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1C2bh--ETFTMS9d3TPl2c_WlmFtQPviUv">
</p>