# Hyperparameter tuning with scikit-learn

This notebooks contains a few examples on how hyperparameter tuning works with scikit-learn.

Author: Umberto Michelucci (umberto.michelucci@toelt.ai).

In [1]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV, GridSearchCV
import pandas as pd

First of all we always need to define what hyperparameters we want to test.

In [2]:
param_grid = {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5, 10]}

Then we need to define what kind of model we want to test.

In [3]:
base_estimator = RandomForestClassifier(random_state=0)

The following cell will generate some *fake* data to use for our tuning.

The `make_classification` function is a utility provided by scikit-learn. This function is specifically used for generating a random n-class classification problem. It is an essential tool for creating synthetic datasets, which are particularly useful for testing and benchmarking machine learning algorithms.

Here are the key aspects of the `make_classification` function:

1. **Purpose**: The primary purpose of `make_classification` is to generate a random, multiclass classification problem. This helps in creating controlled datasets to test classification algorithms and to understand their behavior under various scenarios.

2. **Customizability**: It offers a high degree of customizability. Users can specify several parameters such as the number of samples, the number of features (total, informative, redundant, and repeated), the number of classes, and the level of class separation. This flexibility allows for the creation of datasets with specific characteristics, tailored to particular testing needs.

3. **Output**: The function outputs a tuple containing two arrays: the first array is the generated features (X), and the second array is the class labels for each feature vector (y).

4. **Parameters**:
   - `n_samples`: The number of samples to generate.
   - `n_features`: The total number of features. These include informative, redundant, and noise features.
   - `n_informative`: The number of informative features, i.e., features actually used to build the classification model.
   - `n_redundant`: The number of redundant features, which are linear combinations of the informative features.
   - `n_classes`: The number of classes (or labels) in the dataset.
   - `weights`: The proportions of samples assigned to each class.
   - `class_sep`: A parameter that controls the degree of class separation.
   - And many more parameters to control various aspects of the dataset.


5. **Applications**: It is widely used in machine learning for algorithm development, testing, and comparison. Researchers and practitioners often use it to simulate various data distributions and imbalances to evaluate the performance of classification algorithms under controlled conditions.

6. **Ease of Use**: Like many scikit-learn functions, `make_classification` is user-friendly, making it accessible for both beginners and experienced practitioners in the field of machine learning.

In [4]:
X, y = make_classification(n_samples=1000, random_state=0)

## Hyperparameter Search - Halving Grid Search

`HalvingGridSearchCV` is a function provided by scikit-learn, a widely used Python library for machine learning. It is part of the model selection module and offers an efficient approach to hyperparameter tuning. This function is designed to find the best parameters for a given model through a process known as "successive halving," which is a more resource-efficient version of the traditional grid search.

Key characteristics and functionalities of `HalvingGridSearchCV` include:

1. **Purpose**: The primary objective of `HalvingGridSearchCV` is to identify the best hyperparameters for a given model. It does this by systematically working through multiple combinations of parameter values, evaluating each combination's performance.

2. **Successive Halving Algorithm**: Unlike the traditional grid search that evaluates all combinations of parameter values, `HalvingGridSearchCV` employs the successive halving algorithm. This algorithm initially evaluates a large number of hyperparameter combinations with a small number of resources and then successively halves the number of combinations, allocating more resources to the more promising ones.

3. **Resource Efficiency**: This approach significantly reduces computation time and resource usage, making it more efficient, especially when dealing with large datasets or complex models.

4. **Parameters and Attributes**:
   - It inherits parameters similar to `GridSearchCV`, like `estimator` for the model, `param_grid` for the grid of parameters, `scoring` for the scoring method, and `cv` for cross-validation strategy.
   - `factor`: Determines the rate at which the number of configurations is reduced at each iteration.
   - `resource`: The resource that is increased at each iteration (e.g., the number of iterations for an iterative algorithm).
   - `min_resources`: The minimum amount of resource allocated to each configuration during the first iteration.

5. **Cross-Validation**: It uses cross-validation to evaluate the performance of each set of parameters, ensuring a thorough and unbiased assessment.

6. **Applications**: `HalvingGridSearchCV` is particularly useful in scenarios where the parameter space is large, and a traditional grid search would be too time-consuming or computationally expensive.

7. **Results**: It provides detailed results that include the best parameters found, the score of the best parameters, and the complete results of the search process.


Finally we can do the actual search. By default, the resource is defined in terms of number of samples. That is, each iteration will use an increasing amount of samples to train on. You can however manually specify a parameter to use as the resource with the resource parameter. Here is an example where the resource is defined in terms of the number of estimators of a random forest:

In [5]:
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5,
                    factor=2, resource='n_estimators',
                    max_resources=30).fit(X, y)

And now we can get the best parameters.

In [6]:
sh.best_estimator_

As mentioned above, the number of resources that is used at each iteration depends on the ```min_resources``` parameter. If you have a lot of resources available but start with a low number of resources, some of them might be wasted (i.e. not used). Let us try a different example.

In [7]:
param_grid= {'kernel': ('linear', 'rbf'),
              'C': [1, 10, 100]}
base_estimator = SVC(gamma='scale')

In [8]:
X, y = make_classification(n_samples=1000)

In [9]:
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5,
                          factor=2, min_resources=20).fit(X, y)

In [10]:
sh.best_estimator_

The search process will only use 80 resources at most, while our maximum amount of available resources is ```n_samples=1000```. Here, we have ```min_resources = r_0 = 20```. For ```HalvingGridSearchCV```, by default, the min_resources parameter is set to ```exhaust```. This means that min_resources is automatically set such that the last iteration can use as many resources as possible, within the max_resources limit

There are many more possibilities, and looking at the official documentation is always a good idea to explore all possibilities.

## Analysis of the results

The cv_results_ attribute contains useful information for analyzing the results of a search. It can be converted to a pandas dataframe with ```df = pd.DataFrame(est.cv_results_)```.

The `cv_results_` attribute in scikit-learn, particularly in model selection tools like `GridSearchCV` and `HalvingGridSearchCV`, is a dictionary that holds a lot of detailed information about the results of the cross-validation process. It is generated after the fitting process (`fit` method) of the model selection tool is completed.

Key aspects of the `cv_results_` attribute include:

1. **Detailed Results**: It contains detailed outcomes for each parameter combination that was evaluated during the cross-validation. This information is invaluable for analyzing and understanding the performance of different hyperparameter configurations.

2. **Structure**: The dictionary includes various keys, each corresponding to different aspects of the results. Common keys include:
   - `mean_test_score`: The mean score of the test set on different folds.
   - `std_test_score`: The standard deviation of the test set score over different folds.
   - `mean_train_score` and `std_train_score`: Similar metrics for the training set (if `return_train_score` is set to True).
   - `params`: A list of parameter settings corresponding to each result.
   - `rank_test_score`: The rank of each parameter setting based on the test score.
   - `split_i_test_score` (where i is the fold number): The test score for each split.
   - `time_fit`: The time taken to fit the model on the train set for each parameter setting.
   - `time_score`: The time taken to score the model on the test set for each parameter setting.

3. **Analysis and Visualization**: This attribute is often used for analyzing the results in detail. You can sort and filter the results to identify the best performing parameter combinations. It is also commonly used for visualizing the results, such as plotting the scores against different hyperparameters to see how they affect the model performance.

4. **Decision Making**: The information in `cv_results_` is critical for making informed decisions about which hyperparameters are the most effective for a given model and dataset.

5. **Post-Processing**: Researchers and practitioners can export this data into data frames (e.g., using Pandas) for easier manipulation and more advanced analyses.


In [11]:
df = pd.DataFrame(sh.cv_results_)
df

Unnamed: 0,iter,n_resources,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0,20,0.000621,0.000165,0.000423,6.4e-05,1,linear,"{'C': 1, 'kernel': 'linear'}",0.75,...,0.6,0.2,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,0,20,0.0005,0.000185,0.000374,6.1e-05,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.5,...,0.65,0.122474,6,1.0,1.0,1.0,0.875,1.0,0.975,0.05
2,0,20,0.000368,1.1e-05,0.000307,3e-06,10,linear,"{'C': 10, 'kernel': 'linear'}",0.75,...,0.6,0.2,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0
3,0,20,0.000345,7e-05,0.000273,4.8e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.5,...,0.7,0.1,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0,20,0.000332,1.2e-05,0.00028,1.1e-05,100,linear,"{'C': 100, 'kernel': 'linear'}",0.75,...,0.6,0.2,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,0,20,0.000319,1.4e-05,0.000267,9e-06,100,rbf,"{'C': 100, 'kernel': 'rbf'}",0.5,...,0.7,0.1,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,1,40,0.000332,1.6e-05,0.000266,1.2e-05,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.5,...,0.725,0.145774,1,1.0,1.0,1.0,0.96875,1.0,0.99375,0.0125
7,1,40,0.000316,1e-05,0.000251,2e-06,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.625,...,0.725,0.165831,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,1,40,0.000295,7e-06,0.000244,3e-06,100,rbf,"{'C': 100, 'kernel': 'rbf'}",0.625,...,0.725,0.165831,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
9,2,80,0.00037,1e-05,0.000259,4e-06,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.5625,...,0.625,0.189572,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Each row corresponds to a given parameter combination (a candidate) and a given iteration. The iteration is given by the ```iter column```. The ```n_resources``` column tells you how many resources were used.

In case you are interested in knowing how hyperband works, you can refer to L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization, in Machine Learning Research 18, 2018.

# Classical Grid Search

In [12]:
sh = GridSearchCV(estimator=base_estimator, param_grid=param_grid, cv=5).fit(X, y)

In [13]:
df = pd.DataFrame(sh.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.012224,0.001255,0.001083,0.000148,1,linear,"{'C': 1, 'kernel': 'linear'}",0.82,0.83,0.845,0.87,0.88,0.849,0.022891,3
1,0.005698,0.000157,0.002706,0.000123,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.85,0.81,0.86,0.875,0.855,0.85,0.021679,1
2,0.053159,0.008238,0.000891,4.8e-05,10,linear,"{'C': 10, 'kernel': 'linear'}",0.825,0.83,0.845,0.87,0.88,0.85,0.021679,1
3,0.007648,0.000139,0.002575,4.4e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.78,0.785,0.82,0.845,0.785,0.803,0.025417,5
4,0.412888,0.060072,0.000922,3.8e-05,100,linear,"{'C': 100, 'kernel': 'linear'}",0.82,0.825,0.845,0.87,0.88,0.848,0.023791,4
5,0.011064,0.000119,0.002721,9.5e-05,100,rbf,"{'C': 100, 'kernel': 'rbf'}",0.765,0.8,0.825,0.83,0.785,0.801,0.024372,6


In [14]:
sh.best_estimator_

# Appendix A - the Halving Grid Search Algorithm

The Halving Grid Search algorithm, implemented as `HalvingGridSearchCV` in scikit-learn, is an efficient method for hyperparameter tuning. It is a part of the broader family of algorithms known as Successive Halving algorithms. The key idea behind this approach is to iteratively refine the search for the best hyperparameters by allocating resources more efficiently. Here's a detailed description of how it works:

1. **Initial Setup**:
   - The algorithm begins with a predefined grid of hyperparameter values. Each combination of hyperparameters in this grid is considered a candidate.
   - It also requires the definition of a resource, which is usually a measure of how much training a model receives (like the number of iterations, depth of a tree, etc.).

2. **Initial Evaluation**:
   - In the first iteration, all hyperparameter candidates are trained with a small, equal amount of the resource. This initial resource level is usually set by the user.
   - Once trained, each candidate is evaluated using a specified metric (like accuracy, F1-score, etc.).

3. **Selection and Halving**:
   - After evaluating all candidates, a subset of the best-performing candidates is selected. Typically, this selection cuts the number of candidates in half, hence the term "halving."
   - The key aspect of this step is that only a fraction of candidates (the most promising ones) continue to the next round.

4. **Resource Doubling**:
   - For the next iteration, the amount of resource allocated to each remaining candidate is doubled (or increased by a factor specified by the user).
   - These candidates are then re-trained with this increased resource and re-evaluated.

5. **Iterative Process**:
   - Steps 3 and 4 are repeated. In each iteration, the number of candidates is halved, and the resources allocated to each remaining candidate are increased (typically doubled).
   - This process continues until a stopping criterion is met, which could be a maximum resource limit or a minimum number of candidates remaining.

6. **Final Selection**:
   - The final set of hyperparameters is chosen based on the best performance observed in the last iteration of the process.

### Advantages of the Halving Grid Search Algorithm:

- **Efficiency**: By allocating resources more effectively and focusing on promising candidates, it significantly reduces the computational cost compared to traditional grid search, which is beneficial for large hyperparameter spaces.
- **Flexibility**: It works with any type of model and is applicable to a wide range of hyperparameter optimization problems.
- **Scalability**: The algorithm scales well with the size of the dataset and the complexity of the model.

### Key Considerations:

- **Resource Definition**: The choice of resource and how it impacts model training is crucial. The resource could be the number of iterations, number of trees (in ensemble methods), or any other measure that directly affects the training extent of the model.
- **Initial Resource Allocation**: The amount of resource allocated in the initial step can impact the algorithm's effectiveness. Too little resource might not be sufficient to reveal the potential of certain candidates.
