<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/14_Hyperparameter_Tuning_Diabetes_Prediction_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Hyperparameter tuning**

Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a machine learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters, viz. node weights, are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. 

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.

**Grid search**

Grid search, or a parameter sweep, is a hyperparameter optimization technique that involves an exhaustive search through a manually specified subset of the hyperparameter space of a learning algorithm. 

A grid search algorithm is typically guided by a performance metric, measured by cross-validation on the training set or evaluation on a held-out validation set.

![alt text](https://raw.githubusercontent.com/raj-vijay/ml/master/images/Grid_Search.jpg)

**Pima Indians Diabetes Database**

Pima Indians Diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. 

In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset is imported from Kaggle.

https://www.kaggle.com/uciml/pima-indians-diabetes-database

Installing Kaggle Package to access the diabetes dataset from Kaggle.

In [1]:
!pip install kaggle



Make .kaggle directory under root to import the Kaggle Authentication JSON.

In [0]:
!mkdir ~/.kaggle

Change file path to root/.kaggle/kaggle.json

In [0]:
!cp /content/kaggle.json ~/.kaggle/kaggle.json

Protect Kaggle JSON file for security reasons

Chmod 600 (chmod a+rwx,u-x,g-rwx,o-rwx) sets permissions so that, (U)ser / owner can read, can write and can't execute. (G)roup can't read, can't write and can't execute. (O)thers can't read, can't write and can't execute.

In [0]:
!chmod 600 /root/.kaggle/kaggle.json

Import the diabetes dataset

In [5]:
!kaggle datasets download -d uciml/pima-indians-diabetes-database

Downloading pima-indians-diabetes-database.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 8.01MB/s]


In [6]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the diabetes dataset into a DataFrame: df
df = pd.read_csv('pima-indians-diabetes-database.zip', compression='zip', header=0, sep=',', quotechar='"')
print(df)

     Pregnancies  Glucose  ...  Age  Outcome
0              6      148  ...   50        1
1              1       85  ...   31        0
2              8      183  ...   32        1
3              1       89  ...   21        0
4              0      137  ...   33        1
..           ...      ...  ...  ...      ...
763           10      101  ...   63        0
764            2      122  ...   27        0
765            5      121  ...   30        0
766            1      126  ...   47        1
767            1       93  ...   23        0

[768 rows x 9 columns]


In [0]:
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

In [0]:
# Import necessary modules
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

In [30]:
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
print(c_space)

[1.00000000e-05 8.48342898e-05 7.19685673e-04 6.10540230e-03
 5.17947468e-02 4.39397056e-01 3.72759372e+00 3.16227766e+01
 2.68269580e+02 2.27584593e+03 1.93069773e+04 1.63789371e+05
 1.38949549e+06 1.17876863e+07 1.00000000e+08]


In [0]:
# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver = 'liblinear')

In [0]:
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

In [36]:
# Fit it to the data
logreg_cv.fit(X,y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([1.00000000e-05, 8.48342898e-05, 7.19685673e-04, 6.10540230e-03,
       5.17947468e-02, 4.39397056e-01, 3.72759372e+00, 3.16227766e+01,
       2.68269580e+02, 2.27584593e+03, 1.93069773e+04, 1.63789371e+05,
       1.38949549e+06, 1.17876863e+07, 1.00000000e+08])},
             pre_dispatch='2*n_jobs',

In [39]:
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))


Tuned Logistic Regression Parameters: {'C': 2275.845926074791}
Best score is 0.7708333333333334
