# Hyperparameter Tuning

Hyperparameter tuning is the process of finding the best hyperparameters for a model. Hyperparameters are the parameters that are not learned by the model, and they are set before the training process. In this notebook, I will show you how to use the `optuna` library to perform hyperparameter tuning for a simple machine learning model.

**Types of Hyperparameters**

There are two types of hyperparameters:

Grid Search: In this method, we define a grid of hyperparameters and then train the model with each combination of hyperparameters. This method is simple but computationally expensive.
Random Search: In this method, we define a range of hyperparameters and then train the model with random combinations of hyperparameters. This method is less computationally expensive than grid search.
Bayesian Optimization: In this method, we use the results of the previous iterations to select the next hyperparameters to try. This method is more computationally expensive than random search but less computationally expensive than grid search.
Gradient-based Optimization: In this method, we use the gradient of the loss function with respect to the hyperparameters to find the best hyperparameters. This method is computationally expensive but can be very effective.

**Optuna**

Optuna is a hyperparameter optimization framework that is designed for machine learning. It is easy to use and supports various optimization algorithms, including grid search, random search, Bayesian optimization, and gradient-based optimization. In this notebook, I will show you how to use Optuna to perform hyperparameter tuning for a simple machine learning model.


# Cross Validation

Cross-validation is a technique used to evaluate the performance of a machine learning model. It is used to estimate the performance of a model on an independent dataset. In this notebook, I will show you how to use the `sklearn` library to perform cross-validation for a simple machine learning model.

**Types of Cross-Validation**

There are several types of cross-validation:

**K-Fold Cross-Validation:** In this method, the dataset is divided into k subsets, and the model is trained k times, each time using a different subset as the test set and the remaining subsets as the training set. The performance of the model is then averaged over the k iterations.

**Stratified K-Fold Cross-Validation:** In this method, the dataset is divided into k subsets, and the model is trained k times, each time using a different subset as the test set and the remaining subsets as the training set. The performance of the model is then averaged over the k iterations. This method is used when the dataset is imbalanced.

**Leave-One-Out Cross-Validation:** In this method, the model is trained k times, each time using all the data except one sample as the training set and the remaining sample as the test set. The performance of the model is then averaged over the k iterations.

**Leave-P-Out Cross-Validation:** In this method, the model is trained k times, each time using all the data except p samples as the training set and the remaining p samples as the test set. The performance of the model is then averaged over the k iterations.

**Stratified Shuffle Split Cross-Validation:** In this method, the dataset is divided into k subsets, and the model is trained k times, each time using a different subset as the test set and the remaining subsets as the training set. This method is used when the dataset is imbalanced.

**Time Series Cross-Validation:** In this method, the dataset is divided into k subsets, and the model is trained k times, each time using a different subset as the test set and the remaining subsets as the training set. This method is used when the dataset is a time series.
In this notebook, I will show you how to use the `sklearn` library to perform k-fold cross-validation for a simple machine learning model.


In [2]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [7]:
# load the data
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

In [10]:
# define the model
model = RandomForestClassifier()

para_grid = {'n_estimators': [100, 200, 300, 400, 500],
             'max_features': ['auto', 'sqrt', 'log2'],
                'max_depth': [10, 20, 30, 40, 50],
             'criterion': ['gini', 'entropy']
             }

# set up the grid
grid_model = GridSearchCV(model, 
                          para_grid, 
                          cv=5, 
                          scoring='accuracy',
                          verbose=1,
                          n_jobs=-1)

#fit the grid
grid_model.fit(X, y)


# print the best parameters
print(grid_model.best_params_)



Fitting 5 folds for each of 150 candidates, totalling 750 fits


250 fits failed out of a total of 750.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
219 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParame

{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 400}


In [15]:
# define the model

from sklearn.model_selection import RandomizedSearchCV

model = RandomForestClassifier()

para_grid = {'n_estimators': [100, 200, 300, 400, 500],
             'max_features': ['auto', 'sqrt', 'log2'],
                'max_depth': [10, 20, 30, 40, 50],
             'criterion': ['gini', 'entropy']
             }

# set up the grid
grid_model = RandomizedSearchCV(model, 
                          para_grid, 
                          cv=5, 
                          scoring='accuracy',
                          verbose=1,
                          n_jobs=-1)

#fit the grid
grid_model.fit(X, y)


# print the best parameters
print(grid_model.best_params_)



Fitting 5 folds for each of 10 candidates, totalling 50 fits


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
23 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameter

{'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 30, 'criterion': 'entropy'}
