# Hyperparameter Tuning with GridSearchCV

## 1. Introduction 

When training machine learning models, choosing the right hyperparameters is crucial for achieving optimal performance. In this tutorial, we will see how to evaluate performance and how to tune hyperparameters with GridSearchCV and  RandomizedSearchCV.

In [21]:
# Basic imports 
import numpy as np
import pandas as pd

## 2. train_test_split

In [3]:
from sklearn.datasets import load_iris

# Load the iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)

In [4]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


We will use Support Vector Machine Classifier as a model.

In [5]:
from sklearn.svm import SVC

# Instanciate the model
svm = SVC()

We split our data in two sets (X_train, X_test). First we train our model on X_train and then we evaluate it's performance on X_test.

In [6]:
from sklearn.model_selection import train_test_split

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train the model
svm.fit(X_train, y_train)

# Evaluate the model
svm.score(X_test, y_test)


1.0

## 3. Kfold

The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.

In [7]:
from sklearn.model_selection import KFold

In [8]:
# Instatnciate a KFold object with 4 splits.
kf = KFold(n_splits=4)

In [9]:
# Let's split a simple array
kf.split(np.arange(8))

<generator object _BaseKFold.split at 0x00000208A29A6320>

In [10]:
# We can use iteration to get the index values
for train_index, test_index in kf.split(np.arange(8)):
    print(train_index, test_index)

[2 3 4 5 6 7] [0 1]
[0 1 4 5 6 7] [2 3]
[0 1 2 3 6 7] [4 5]
[0 1 2 3 4 5] [6 7]


In [11]:
# Let's try it in a 2-D matrix
X = np.arange(12).reshape(4,3)
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [12]:
kf = KFold(n_splits=3)
indexes = kf.split(X)
for i, (train_index, test_index) in enumerate(indexes):
    print(f'Split {i+1}')
    print(train_index, test_index)
    print()
    print('train set: ')
    print(X[train_index])
    print('test set: ')
    print(X[test_index])
    print('\n')

    

Split 1
[2 3] [0 1]

train set: 
[[ 6  7  8]
 [ 9 10 11]]
test set: 
[[0 1 2]
 [3 4 5]]


Split 2
[0 1 3] [2]

train set: 
[[ 0  1  2]
 [ 3  4  5]
 [ 9 10 11]]
test set: 
[[6 7 8]]


Split 3
[0 1 2] [3]

train set: 
[[0 1 2]
 [3 4 5]
 [6 7 8]]
test set: 
[[ 9 10 11]]




Now back to the iris dataset. We will use 5 splits and keep track of the test scores.

In [15]:
# Load dataset
X, y = load_iris(return_X_y=True, as_frame=True)

# Define K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores=[]

for train_index, test_index in kf.split(X,y):
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    # Train the model
    svm = SVC()
    svm.fit(X_train, y_train)

    # Evaluate the model
    scores.append(svm.score(X_test, y_test))

scores


[1.0, 1.0, 0.9333333333333333, 0.9333333333333333, 0.9666666666666667]

## 3. cross_val_score

Instead of manually splitting data, we can automate the process with sklearn's cross_val_score.

In [16]:
from sklearn.model_selection import cross_val_score

# Split the data, train and evaluate the model
scores = cross_val_score(SVC(), X, y, cv=5, scoring='accuracy')
scores

array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])

In [77]:
# Compute the average performance
print(f'Average accuracy: {scores.mean():.4f}')

Average accuracy: 0.9667


Now at this point it should be tempting to change the value of a hyperparameter and see if we can do any better..

In [19]:
cross_val_score(SVC(C=10), X, y, cv=5, scoring='accuracy').mean()

0.9800000000000001

This seems to be better. And now what? Maybe try C=100 or change another hyperparameter like kernel or gamma? GridSearchCV is a tool from sklearn that will help us automate this process.

## 5. GridSearchCV 

GridSearchCV tries all possible comninations of hyperparameters and selects the best combination based on cross-validation performace. Another option is RandomizedSearchCV that only tries a random combination of the hyperparameters and is preferred when the search grid is large.

In [23]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid 
param_grid = {
    'C' : [0.1, 1, 10], 
    'kernel' : ['linear', 'rbf'], 
    'gamma' : ['scale', 'auto']
}

clf = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')

In [24]:
clf.fit(X,y)

In [25]:
dir(clf)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__sklearn_tags__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_estimator_type',
 '_format_results',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_routed_params_for_fit',
 '_get_scorers',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_run_sea

In [22]:
results = pd.DataFrame(clf.cv_results_)
results

NameError: name 'clf' is not defined

In [29]:
results = results[['param_C', 'param_gamma', 'param_kernel', 'mean_test_score']]
results

Unnamed: 0,param_C,param_gamma,param_kernel,mean_test_score
0,0.1,scale,linear,0.973333
1,0.1,scale,rbf,0.92
2,0.1,auto,linear,0.973333
3,0.1,auto,rbf,0.946667
4,1.0,scale,linear,0.98
5,1.0,scale,rbf,0.966667
6,1.0,auto,linear,0.98
7,1.0,auto,rbf,0.98
8,10.0,scale,linear,0.973333
9,10.0,scale,rbf,0.98


In [30]:
clf.best_params_

{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

We can also use GridSearchCV and RandomizedSeachCV to select the best combination of model and hyperparameters. In order to automate this process with a loop, we need a data structure that will store for each model an instance of it and the param_grid to be searched. A good option is to use a dictionary with the following format 'model_name' : {'model': \<model_instance\>, 'params': \<param_grid\>}

In [88]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

mp = {
    'svm': {
        'model': SVC(),
        'params': {
            'C' : [0.1, 1, 10], 
            'kernel' : ['linear', 'rbf'], 
            'gamma' : ['scale', 'auto']
        }
    },

    'random_forest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [5, 10, 100],
            'max_depth': [None, 5, 10 ,20]
        }
    },

    'logistic': {
        'model': LogisticRegression(),
        'params': {
            'penalty': ['l2', 'elasticnet'],
            'C': [0.1, 1, 10],
            'solver': ['lbfgs','linear']
        }
    }
}

In [89]:
results = []
for model_name, mp in mp.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv=5)
    clf.fit(X,y)
    
    results.append({
        'model': model_name, 
        'best_params': clf.best_params_,
        'best_score': clf.best_score_
    })
    

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [91]:
pd.DataFrame(results)

Unnamed: 0,model,best_params,best_score
0,svm,"{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}",0.98
1,random_forest,"{'max_depth': 5, 'n_estimators': 100}",0.966667
2,logistic,"{'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}",0.98


## 6. Conclusion

In this tutorial, we explored different approaches to model evaluation and hyperparameter tuning, including K-Fold Cross-Validation, cross_val_score(), GridSearchCV, and RandomizedSearchCV.
<ul>
<li>GridSearchCV provides an exhaustive search for the best hyperparameters but can be computationally expensive.</li>

<li>RandomizedSearchCV is a more efficient alternative when dealing with large parameter spaces.</li>

<li>Using cross-validation techniques ensures a more reliable estimation of model performance.</li>
</ul>
By effectively applying these techniques, you can improve model performance while efficiently managing computational resources.