<a href="https://colab.research.google.com/github/soujanya-vattikolla/ML-Basics-Definitions/blob/main/HyperparameterTuning(GridSearchCV).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hyperparameter Tuning

The process of choosing the optimal parameters is called Hyperparameter Tuning.

**GridSearchCV** helps find best parameters that gives maximum performance. 

**RandomizedSearchCV** is another class in sklearn library that does same thing as GridSearchCV but without running exhaustive search, this helps with computation time and resources. 

Exercise: Machine Learning Finding Optimal Model and Hyperparameters

For digits dataset in sklearn.dataset, please try following classifiers and find out the one that gives best performance. Also find the optimal parameters for that classifier.

In [46]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [47]:
# Load digits dataset
import pandas as pd
from sklearn.datasets import load_digits

In [48]:
digits_dataset = load_digits()

In [49]:
# creating a dataframe

digit_df = pd.DataFrame(digits_dataset.data,digits_dataset.target)
digit_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [50]:
digit_df['target'] = digits_dataset.target
digit_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,target
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4


In [51]:
# different models with different hyperparameters

model_params = {
      'svm': {
          'model':svm.SVC(gamma='auto'),
          'params': {
              'C' : [1,10,20],
              'kernel':['rbf','linear']
          }
      },
      'randomforest':{
             'model':RandomForestClassifier(),
             'params':{
                 'n_estimators': [1,5,10]
             }
      },
      'logistic_regression':{
                'model': LogisticRegression(solver='liblinear', multi_class='auto'),
                'params':{
                    'C':[1,5,10]
                }
      },
      'naive_bayes_gaussian':{
                'model': GaussianNB(),
                'params':{}
      },
      'naive_bayes_MultinomialNB':{
                  'model': MultinomialNB(),
                  'params':{}
      },
      'decision_tree': {
               'model': DecisionTreeClassifier(),
               'params':{
                   'criterion':['gini','entropy'],
               }
      }
}

In [52]:
# Use GridSearchCV

from sklearn.model_selection import GridSearchCV

In [53]:
scores = []

for model_name,modelparams in model_params.items():
    classf = GridSearchCV(modelparams['model'], modelparams['params'], cv=5, return_train_score=False)
    classf.fit(digits_dataset.data,digits_dataset.target)
    scores.append({
        'model': model_name,
        'best_score':classf.best_score_,
        'best_params': classf.best_params_
    })

In [54]:
# creating a dataframe for the scores

scores_df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
scores_df

Unnamed: 0,model,best_score,best_params
0,svm,0.947697,"{'C': 1, 'kernel': 'linear'}"
1,randomforest,0.898743,{'n_estimators': 10}
2,logistic_regression,0.922114,{'C': 1}
3,naive_bayes_gaussian,0.806928,{}
4,naive_bayes_MultinomialNB,0.87035,{}
5,decision_tree,0.812482,{'criterion': 'entropy'}


From the above scores we can observe that, model 'svm' with parameters (c=1,kernel,linear) with accuracy of 95%, is the best model compared to others.

**RanomizedSearchCV**

RandomizedSearchCV helps in reducing the number of iterations and with random combination of parameters.

This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation.
