<a href="https://colab.research.google.com/github/satishgunjal/Machine-Learning-Using-Python/blob/master/15_Hyperparameter_Tuning/Hyperparameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Tuning Using GridSearchCV API

## Parameter vs Hyperparameter

  ### Parameter(model parameter)
  * Parameter is a configurqation variable which is internal to model and whose value can be estimated from the data
  * They are required by the model when making predictions
  * They are estimated or learned from data
  * They are often not set manually by the practitioner
  * They are often saved as part of the learned model
  * Some examples of model parameters include:
    * The weights in an artificial neural network
    * The support vectors in a support vector machine
    * The coefficients in a linear regression or logistic regression

### Hyperparameter
  * Hyperparameter are external to the model and whose values cannot be estimated based on the data
  * They are often specified by the practitioner (By testing the model with test data)
  * They are often tuned for a given predictive modeling problem.
  * They can often be set using heuristics
  * Some examples of model hyperparameters include:
    * The learning rate for training a neural network
    * The C and sigma hyperparameters for support vector machines
    * The k in k-nearest neighbors
    * No of trees (n_estimators) in RandomForest Alogirithm

## Problem Statement

* For iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV or RandomizedSearchCV API for hyperparametertraining

# Reference
[What is the Difference Between a Parameter and a Hyperparameter?](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/)

## Load the IRIS flower data from sklearn.datasets

In [0]:
from sklearn.datasets import load_iris

iris = load_iris()
dir(iris)

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

## Understanding the data
* iris.DESCR > Complete description of dataset
* iris.data > Data to learn. Each training set is 4 digit array of features. Total 150 training sets
* iris.feature_names > Array of all 4 feature ['sepal length (cm)','sepal width cm)','petal length (cm)','petal width (cm)']
* iris.filename > CSV file name
* iris.target > The classification label. For every training set there is one classification label(0,1,2). Here 0 for setosa, 1 for versicolor and 2 for virginica
* iris.target_names > the meaning of the features. Its a array >> ['setosa', 'versicolor', 'virginica'] 
* From above details its clear that our data is 'iris.data' and labels are 'iris.target'

Lets create a dataframe of our features(iris.data) and labels(iris.target)

In [6]:
import pandas as pd

df = pd.DataFrame(iris.data)
df.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [14]:
# adding flower column to the dataframe
df['flower'] = iris.target
df.head()

Unnamed: 0,0,1,2,3,target,flower
0,5.1,3.5,1.4,0.2,0,0
1,4.9,3.0,1.4,0.2,0,0
2,4.7,3.2,1.3,0.2,0,0
3,4.6,3.1,1.5,0.2,0,0
4,5.0,3.6,1.4,0.2,0,0


In [15]:
# updating the flower column target values with target_names using lambda function
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df.head()

Unnamed: 0,0,1,2,3,target,flower
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


## Approach 1: Use train_test_split and manually tune parameters by trial and error

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.3)

print("len of X_train is %s" % (len(X_train)))
print("len of X_test is %s" % (len(X_test)))
print("len of y_train is %s" % (len(y_train)))
print("len of y_test is %s" % (len(y_test)))

len of X_train is 105
len of X_test is 45
len of y_train is 105
len of y_test is 45


### Lets train the model using SVM algorithm
* **Here kernel, gamma and C are hyperparameter**
* Gamma: In case of high value of Gamma decision boundary is dependent of points close it where in case of low value of Gamma decision SVM will consider the far away points also while deciding the decision boundary
* Regularization parameter(C): Large C will result in overfitting and which will lead to lower bias and high variance. Small C will result in underfitting and which will lead to higher bias and low variance

In [21]:
from sklearn.svm import SVC

model = SVC(kernel='rbf',C= 30, gamma='auto')
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.9555555555555556

## Approach:2 Lets train the model using K Fold Cross Validation
* Manually try suppling models with different parameters to cross_val_score function with 5 fold cross validation
* Here along kernel,gamma and C now K = 5 is also a hyperparameter

In [25]:
from sklearn.model_selection import cross_val_score

cross_val_score(SVC(kernel='linear',C=10,gamma='auto'),iris.data,iris.target,cv =5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [26]:
cross_val_score(SVC(kernel='rbf',C=10,gamma= 'auto'),iris.data,iris.target,cv = 5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [27]:
cross_val_score(SVC(kernel='rbf',C=20, gamma='auto'),iris.data,iris.target,cv=5)

array([0.96666667, 1.        , 0.9       , 0.96666667, 1.        ])

Since above approach is tiresome and mannual we can try to automate it using for loop

In [30]:
import numpy as np
#lets create a array of kernel and C
kernel = ['rbf','linear']
C = [1, 10, 20]

for k in kernel:
  for c in C:
    score = cross_val_score(SVC(kernel=k,C=c, gamma='auto'),iris.data,iris.target,cv=5)
    print("For kernel: %s , and C: %s average score is: %s" % (k,c,np.average(score)))

For kernel: rbf , and C: 1 average score is: 0.9800000000000001
For kernel: rbf , and C: 10 average score is: 0.9800000000000001
For kernel: rbf , and C: 20 average score is: 0.9666666666666668
For kernel: linear , and C: 1 average score is: 0.9800000000000001
For kernel: linear , and C: 10 average score is: 0.9733333333333334
For kernel: linear , and C: 20 average score is: 0.9666666666666666


From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance

## Approach:3 Using GridSearchCV
* We can use sklearn API like GridSearchCV to automate the hyperparameter tuning

In [32]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(SVC(gamma='auto'), {'C': [1,10,20],'kernel': ['rbf','linear']}, cv=5, return_train_score=False)
clf.fit(iris.data, iris.target)
clf.cv_results_

{'mean_fit_time': array([0.00096803, 0.00042634, 0.00054212, 0.00050712, 0.00065579,
        0.00044713]),
 'mean_score_time': array([0.00047021, 0.00029387, 0.00028801, 0.00030537, 0.00034809,
        0.00027599]),
 'mean_test_score': array([0.98      , 0.98      , 0.98      , 0.97333333, 0.96666667,
        0.96666667]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'kernel': 'rbf'},
  {'C': 20, 'kernel': 'linear'}],
 'rank_test_score': array([1, 1, 1, 4, 5, 6], dtype=int32),
 'split0_test_score': array([0.96666667, 0.96

Lets add above results in dataframe for better visualization

In [33]:
df =pd.DataFrame(clf.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000968,0.00039,0.00047,0.000171,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.000426,2.3e-05,0.000294,1.7e-05,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.000542,2.3e-05,0.000288,9e-06,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.000507,7.6e-05,0.000305,3.3e-05,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.000656,0.000116,0.000348,5.9e-05,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.000447,1.3e-05,0.000276,1.8e-05,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


In [35]:
# Visualize important columns only
df[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [37]:
#get best paramters
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

In [38]:
#get best score
clf.best_score_

0.9800000000000001

## **Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters. This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation**

In [41]:
from sklearn.model_selection import RandomizedSearchCV

rs = RandomizedSearchCV(SVC(gamma='auto'), {'C': [1,10,20], 'kernel': ['rbf','linear']}, cv=5, return_train_score=False, n_iter=2)
rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,1,rbf,0.98


**Note above since out 'n_iter' parameter is 2, our API will try only two combination of given hyperparameters and return the results. In previous step with GridSearchCV API tried all 6 combinations**

## Similarly we can try different models with different hyperparameters

In [0]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

In [43]:
model_params.items()

dict_items([('svm', {'model': SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False), 'params': {'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}}), ('random_forest', {'model': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False), 'params': {'n_estimators': [1, 5, 10]}}), ('logistic_regression', {'model': Logisti

In [44]:
scores = []
# Here model_params.items() gives us dictionary object. model_name will get key of that obj and mp will get value of that obj
for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.96,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


**Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification**