# Hyperparameter Tuning using GridSearchCV

Link to the Youtube video tutorial: https://www.youtube.com/watch?v=HdlDYng8g9s&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=17

# Load the dataset

In [233]:
from sklearn import datasets

# load the iris dataset to the variable called iris
iris = datasets.load_iris()

# Data exploration

In [234]:
import pandas as pd

# create a dataframe called df. Load the data available in the data attribute of the dataset to the df dataframe. Set the column names of the df dataframe using the data available in the feature_names attribute of the dataset.
df = pd.DataFrame(iris.data, columns = iris.feature_names)

# show the df dataframe
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [235]:
# create a new column called flower to the df dataframe. Load the data available in the target attribute of the dataset to the newly created column
df['flower'] = iris.target

# show the df dataframe
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [236]:
# encode the integer labels of the flower column into text labels which available in the target_names attribute of the dataset, using apply().
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])

# show the df dataframe
df 

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


# Data preprocessing

Split the dataset into train and test sets using train_test_split method

In [237]:
from sklearn.model_selection import train_test_split

# split the dataset into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# Develop machine learning model (SVM)

In [238]:
from sklearn import svm

# create the SVM model with the specified parameters
model = svm.SVC(kernel='rbf',C=30,gamma='auto')

# train the SVM model
model.fit(X_train,Y_train)

# show the accuracy of the trained model
model.score(X_test,Y_test)

'''
Since using train_test_split method, we will get different accuracies of the trianed model whenever we run the script (because train_test_split method splits the dataset into the specified proportion randomly).
This dataset splitting method cannot provide us the useful insights. Hence, we use K-fold cross validation.
K-fold cross validation is a good approach because you are going across all the samples of the dataset. Here, we use cross_val_score to perform K-fold cross validation
'''

'\nSince using train_test_split method, we will get different accuracies of the trianed model whenever we run the script (because train_test_split method splits the dataset into the specified proportion randomly).\nThis dataset splitting method cannot provide us the useful insights. Hence, we use K-fold cross validation.\nK-fold cross validation is a good approach because you are going across all the samples of the dataset. Here, we use cross_val_score to perform K-fold cross validation\n'

# Perform hyperparameter tuning by involving cross validation through cross_val_score, with same number of folds but on different parameter values of the machine learning model

1) Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a machine learning model.  <br  /> 
2) It is an important step in the model development process, as the choice of hyperparameters can have a significant impact on the model's performance.  <br  /> 
3) A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains. Some examples of hyperparameters in machine learning: <br  /> 
    1) Learning Rate
    2) Number of Epochs
    3) Momentum
    4) Number of branches in a decision tree
    5) Number of clusters in a clustering algorithm (like k-means)
    6) https://deepai.org/machine-learning-glossary-and-terms/hyperparameter

### Perform hyperparameter tuning by specifying the parameter values of the machine learning model manually

In [239]:
from sklearn.model_selection import cross_val_score

# perform cross validation with 5 folds on SVM model, with the specified model parameters
print(cross_val_score(svm.SVC(kernel='linear',C=10,gamma='auto'), iris.data, iris.target, cv=5))

# perform cross validation with 5 folds on SVM model, with the specified model parameters
print(cross_val_score(svm.SVC(kernel='rbf',C=10,gamma='auto'), iris.data, iris.target, cv=5))

# perform cross validation with 5 folds on SVM model, with the specified model parameters
print(cross_val_score(svm.SVC(kernel='rbf',C=20,gamma='auto'), iris.data, iris.target, cv=5))

[1.         1.         0.9        0.96666667 1.        ]
[0.96666667 1.         0.96666667 0.96666667 1.        ]
[0.96666667 1.         0.9        0.96666667 1.        ]


### Perform hyperparameter tuning by specifying the parameter values of the machine learning model using for loop

**Issue of using for loop in hyperparameter tuning:**  <br />
We will have to specify too many for loops (codes) and its not convenient if we have many parameters to consider in hyperparameter tuning


In [240]:
import numpy as np

kernels = ['rbf', 'linear']
C = [1,10,20]

# create an empty dictionary called avg_scores
avg_scores = {}

# get each value of the kernels variable/list at each iteration
for kval in kernels:
    # get each value of the C variable/list at each iteration
    for cval in C:
        # assign the values stored in kval and cval to the model's parameters respectively. Then, perform cross-validation with 5 folds.
        cv_scores = cross_val_score(svm.SVC(kernel=kval,C=cval,gamma='auto'), iris.data, iris.target, cv=5)
        # calculate the average of the model's accuracy on the 5 folds using average(). Then, save the result to the avg_scores dictionary with the key of format 'kernelvalue_Cvalue' 
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)

# show the average of the model's accuracy on the 5 folds, at different parameter values
print(avg_scores)

{'rbf_1': 0.9800000000000001, 'rbf_10': 0.9800000000000001, 'rbf_20': 0.9666666666666668, 'linear_1': 0.9800000000000001, 'linear_10': 0.9733333333333334, 'linear_20': 0.9666666666666666}


### Perform hyperparameter tuning by specifying the parameter values of the machine learning model using GridSearchCV

GridSearchCV use the same concept as applying cross validation and for loops in hyperparameter tuning  <br />

**Issue of using GridSearchCV in hyperparameter tuning:**  <br />
The computation cost will becomes very high if the range of parameter values we consider in hyperparameter tuning is huge, because the GridSearchCV will consider the combination and permutation of the parameter values.


In [241]:
from sklearn.model_selection import GridSearchCV

'''
Create a classifier/model called clf with different parameters using GridSearchCV. 
The classifier is SVM model with gamma fixed at auto, and different C (1,10,20) and kernel (rbf,linear) values.
When the classifier is provided with dataset for training, it will conducts cross validation of 5 folds,
and will not return the training scores.
'''
clf = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],
    'kernel': ['rbf','linear']
    }, 
    cv=5, 
    return_train_score=False)

# train the classifier according to the mentioned rules
clf.fit(iris.data, iris.target)

# show the cross-validation results
clf.cv_results_

# save the cross-validation results in the format of dataframe to the dataframe called df_clf
df_clf = pd.DataFrame(clf.cv_results_)

# show the df_clf dataframe
df_clf

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.008801,0.007385,0.0,0.0,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.0,0.0,0.006295,0.00771,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.000198,0.000397,0.004189,0.006077,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.003083,0.00323,0.000654,0.000858,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.0,0.0,0.002046,0.004092,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.000197,0.000394,0.003055,0.006111,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


In [242]:
# show only the interested information/columns of the df dataframe
df_clf[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [243]:
# show the attributes of the clf classfier
dir(clf)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_estimator_type',
 '_format_results',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_routed_params_for_fit',
 '_get_scorers',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run

In [244]:
# show the best score of the model from the cross validation (the data available in the best_score_ attribute of clf)
print('The best score of the model from the cross validation:\n',clf.best_score_)

# show the best parameter values combination of the model from the cross validation (the data available in the best_params_ attribute of clf)
print('\nThe best parameter values combination of the model from the cross validation:\n',clf.best_params_)

The best score of the model from the cross validation:
 0.9800000000000001

The best parameter values combination of the model from the cross validation:
 {'C': 1, 'kernel': 'rbf'}


### Perform hyperparameter tuning by specifying the parameter values of the machine learning model using RandomizedSearchCV

RandomizedSearchCV use the similar concept as GridSearchCV in hyperparameter tuning  <br />

**Advantage of using RandomizedSearchCV over GridSearchCV in hyperparameter tuning:**  <br />
RandomizedSearchCV solves the computation cost issue of GridSearchCV. RandomizedSearchCV will not try every single permutation and combination of parameters, but it will try a random combination of these parameter values. And you can choose what those iteration could be.


In [245]:
from sklearn.model_selection import RandomizedSearchCV

'''
Create a classifier/model called rs with different parameters using RandomizedSearchCV. 
The classifier is SVM model with gamma fixed at auto, and different C (1,10,20) and kernel (rbf,linear) values.
When the classifier is provided with dataset for training, it will conducts cross validation of 5 folds,
and will not return the training scores.
'''
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],
    'kernel': ['rbf','linear']
    }, 
    cv=5, 
    return_train_score=False,
    # only run 2 iteration to randomly try 2 combinations of the parameter values [randomly try 1 combination/iteration] 
    # (as opposed to GridSearchCV, it tries all combinations, so it runs 6 iterations as shown in result above).
    # so whenever we run the script, the combination of parameter values of the same row will change.
    n_iter=2)

# train the classifier according to the mentioned rules
rs.fit(iris.data,iris.target)

# save only the interested information/columns of the cross-validation results in the format of dataframe to the dataframe called df_rs
df_rs = pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]

# show the df_rs dataframe
df_rs

Unnamed: 0,param_C,param_kernel,mean_test_score
0,10,rbf,0.98
1,1,rbf,0.98


# Choose the best machine learning model for a given problem

1) Concept involved in this section:
    1) Define the parameter grid as a dictionary
    2) Perform hyperparameter tuning for different models
    3) Select the best model

## Define the parameter grid as a dictionary

In [246]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# create/initialize a dictionary called model_params to define the parameter grid (dictionary is a variable with {})
model_params = {
    # try a SVM model with these parameter values
    'svm':{
        # the arguments of the model are the parameter we want to set and fixed
        'model': svm.SVC(gamma='auto'),
        # the parameter we want to change
        'params': {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }
    },
    # try a random forest model with these parameter values
    'random_forest':{
        'model': RandomForestClassifier(),
        # the parameter we want to change
        'params': {
            'n_estimators': [1,5,10],
        }
    },
    # try a logistic regression model with these parameter values
    'logistic_regression':{
        # the arguments of the model are the parameter we want to set and fixed
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        # the parameter we want to change
        'params': {
            'C': [1,5,10],
        }
    }
}

## Perform hyperparameter tuning for different models

In [247]:
scores = []

'''
Use a for loop to create a classifier/model called clf_new with different models and parameters using GridSearchCV. 
When the classifier is provided with dataset for training, it will conducts cross validation of 5 folds,
and will not return the training scores.
'''

for model_name, mp in model_params.items():
    clf_new = GridSearchCV(mp['model'],mp['params'], cv=5, return_train_score=False)

    clf_new.fit(iris.data,iris.target)

    # at each iteration, append the outputs in the format of dictionary -> 'key1' : data stored under key1 of the dictionary, ...
    scores.append({
        'model': model_name,
        'best_score': clf_new.best_score_,
        'best_params': clf_new.best_params_
    })

# show the data available in scores variable
scores

[{'model': 'svm',
  'best_score': 0.9800000000000001,
  'best_params': {'C': 1, 'kernel': 'rbf'}},
 {'model': 'random_forest',
  'best_score': 0.9600000000000002,
  'best_params': {'n_estimators': 5}},
 {'model': 'logistic_regression',
  'best_score': 0.9666666666666668,
  'best_params': {'C': 5}}]

## Select the best model

**According to the df dataframe below, the conclusion is:** <br />
The best model for the iris dataset problem is SVM model, because it gives 98% of score (the highest score among other models), with the parameters described in best_params column.

In [248]:
# save the data availabel in the scores variable in the format of dataframe to the df dataframe, by manually specifying the column names
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])

# show the df dataframe
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.96,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


### Extra information: Visualize how the dictionary works in specifying the parameters of GridSearchCV at each iteration

In [249]:
# create a for loop to go through this dictionary values and for each of the values, it will use GridSearchCV
iter = 0
for model_name, mp in model_params.items():
    iter += 1
    print('Iteration: ' + str(iter) + '\n')
    print(model_name) # show the name of a class of the dictionary
    print('\n')
    print(mp) # show the data stored under a class of the dictionary, in the format of 'key': data stored under a key of the dictionary
    print('\n')
    print(mp['model']) # show the data stored under the key called model of the class name of the dictionary
    print('\n')
    print(mp['params']) # show the data stored under the key called params of the class name of the dictionary
    print('\n---------------\n')

Iteration: 1

svm


{'model': SVC(gamma='auto'), 'params': {'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}}


SVC(gamma='auto')


{'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}

---------------

Iteration: 2

random_forest


{'model': RandomForestClassifier(), 'params': {'n_estimators': [1, 5, 10]}}


RandomForestClassifier()


{'n_estimators': [1, 5, 10]}

---------------

Iteration: 3

logistic_regression


{'model': LogisticRegression(solver='liblinear'), 'params': {'C': [1, 5, 10]}}


LogisticRegression(solver='liblinear')


{'C': [1, 5, 10]}

---------------

