`HyperParameter Tuning in ScikitLearn through GridSearchCV`


Hyperparameter tuning is a critical step in the machine learning model development process. It involves finding the best combination of hyperparameters for your model to achieve optimal performance. Hyperparameters are parameters that are not learned from the data but are set prior to training, and they can significantly impact the model's performance. It's important because selecting the right hyperparameters can make the difference between a model that performs poorly and one that excels.

Here's why hyperparameter tuning is important:

1. **Improved Model Performance**: Choosing the right hyperparameters can lead to significant improvements in your model's accuracy, precision, recall, and other performance metrics.

2. **Avoiding Overfitting and Underfitting**: Properly tuned hyperparameters help prevent overfitting (model learning noise in the training data) and underfitting (model lacking the capacity to capture the underlying patterns in the data).

3. **Generalization**: Optimized hyperparameters result in a model that generalizes well to unseen data, making it more robust and useful for real-world applications.

4. **Efficient Resource Utilization**: Hyperparameter tuning can help you make more efficient use of computational resources by reducing the need for excessive model training.

Here are the steps to perform hyperparameter tuning in scikit-learn:

1. **Define a Range of Hyperparameters to Tune**:

   - Determine which hyperparameters you want to tune. These may include learning rate, the number of hidden layers in a neural network, the depth of a decision tree, regularization strength, etc.

2. **Choose a Scoring Metric**:

   - Select a metric to evaluate the model's performance, such as accuracy, mean squared error, F1-score, etc. This metric will guide the tuning process.

3. **Create a Validation Set**:

   - Split your data into training, validation, and test sets. The validation set is used to evaluate the model's performance with different hyperparameter combinations.

4. **Set Up the Hyperparameter Search**:

   - Use techniques like Grid Search or Random Search to search through a range of hyperparameter values. Scikit-learn provides `GridSearchCV` and `RandomizedSearchCV` classes for this purpose.

5. **Define the Model**:

   - Instantiate your machine learning model with default hyperparameters or initial values.

6. **Perform Hyperparameter Tuning**:

   - Use the hyperparameter search tool (e.g., `GridSearchCV` or `RandomizedSearchCV`) to fit your model with different combinations of hyperparameters on the training data.

   ```python
   from sklearn.model_selection import GridSearchCV

   # Define hyperparameters and their possible values
   param_grid = {'parameter_name': [value1, value2, ...]}

   # Create a grid search instance
   grid_search = GridSearchCV(estimator, param_grid, scoring='desired_metric', cv=5)

   # Fit the grid search to the training data
   grid_search.fit(X_train, y_train)
   ```

7. **Select the Best Hyperparameters**:

   - After hyperparameter tuning, you can access the best hyperparameters and the best model from the search.

   ```python
   best_params = grid_search.best_params_
   best_model = grid_search.best_estimator_
   ```

8. **Evaluate the Model**:

   - Evaluate the best model on the validation set or using cross-validation to ensure it performs well.

9. **Test on the Test Set**:
   - Finally, test the best model on a separate test set to assess its generalization performance.

Hyperparameter tuning can be a computationally expensive process, especially when trying a large number of hyperparameter combinations. However, it's crucial for optimizing your model's performance and ensuring it works effectively in real-world applications.


In [34]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(context='notebook', style='darkgrid',
              palette='dark', font_scale=1.2)
%matplotlib inline


In [35]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [36]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=14)

# Finding the Accuracy with random hyper-parameters


In [37]:
clf = RandomForestClassifier(
    n_estimators=1, min_samples_split=2, min_samples_leaf=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy-score:{accuracy}")

Accuracy-score:0.9333333333333333


`With Those Params our accuracy came out to be about 93%`


# Now with HyperParameter Tunning


In [38]:
# This param_grid will contain all the hyper-parameter and their corresponding values we want it to have
# GridSearchCV use all the combinations of
# The More the value the more time GridSearchCV will takes as it try out all possible combinations
param_grid = {
    'n_estimators': (5, 10, 2, 20, 14),
    'min_samples_split': (2, 3, 5),
    'min_samples_leaf': (1, 2)
}

grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_estimators=1),
             param_grid={'min_samples_leaf': (1, 2),
                         'min_samples_split': (2, 3, 5),
                         'n_estimators': (5, 10, 2, 20, 14)})

In [39]:
print(f"Best Params:{grid_search.best_params_}")
print(f"Best Estimator:{grid_search.best_estimator_}")


Best Params:{'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10}
Best Estimator:RandomForestClassifier(min_samples_split=5, n_estimators=10)


In [40]:
# Use the best estimator as clf/model
clf = grid_search.best_estimator_
# clf.fit(X_train,y_train)
grid_search_y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, grid_search_y_pred)
print(f"Accuracy-score after GridSearchCV:{accuracy}")

Accuracy-score after GridSearchCV:0.9666666666666667


`After GridSearchCV our Accuracy shoots up to 96%`


# Note

- GridSearchCV can also lead to Overfitting so be carefull about that too
- GridSearchCV can be computationally expensive process, especially when trying a large number of hyperparameter combinations.
