### All Techniques Of Hyper Parameter Optimization
   * GridSearchCV
   * RandomizedSearchCV
   * Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt)
   * Sequential Model Based Optimization(Tuning a scikit-learn estimator with skopt)
   * Optuna- Automate Hyperparameter Tuning
   * Genetic Algorithms (TPOT Classifier)


In [1]:
# Importing library
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [2]:
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**No need to perform feature scaling process bcoz there is no need for ensemble techniques. Its creates decision tree based on the condition**

In [3]:
df.isnull().sum().sum()

0

In [4]:
# checking unique() shows 0 so we replace with median, bcoz 0 refers patient died
df.Glucose = np.where(df['Glucose']==0, df.Glucose.median(), df.Glucose)
df.SkinThickness = np.where(df.SkinThickness == 0, df.SkinThickness.median(), df.SkinThickness)
df.Insulin = df.Insulin.replace(0, df.Insulin.median())
df.BMI = np.where(df.BMI == 0, df.BMI.median(), df.BMI)

In [5]:
df.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [6]:
# independent and dependent 
X = df.drop(['Outcome'], axis=1)
y = df['Outcome']

In [7]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
X_test.shape

(154, 8)

#### Without HyperParameterTunning:

In [9]:
# Fit dataset into model
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)

In [10]:
print(accuracy_score(pred, y_test))
print(confusion_matrix(pred, y_test))
print(classification_report(pred, y_test))

0.7727272727272727
[[90 23]
 [12 29]]
              precision    recall  f1-score   support

           0       0.88      0.80      0.84       113
           1       0.56      0.71      0.62        41

    accuracy                           0.77       154
   macro avg       0.72      0.75      0.73       154
weighted avg       0.80      0.77      0.78       154



#### Manual HyperParameterTunning:

In [11]:
# Fit dataset into model
rfc = RandomForestClassifier(n_estimators=150, criterion='gini')  # change value and check accuracy for many times
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)

In [12]:
print(accuracy_score(pred, y_test))
print(confusion_matrix(pred, y_test))
print(classification_report(pred, y_test))

0.7272727272727273
[[86 26]
 [16 26]]
              precision    recall  f1-score   support

           0       0.84      0.77      0.80       112
           1       0.50      0.62      0.55        42

    accuracy                           0.73       154
   macro avg       0.67      0.69      0.68       154
weighted avg       0.75      0.73      0.74       154



### HyperParameter Tunning

#### 1. Randomized SearchCV

In [13]:
rfc = RandomForestClassifier()

In [14]:
params = {'n_estimators':[100, 150, 200, 250, 300, 350, 400, 450, 500],
          'criterion': ['gini','entropy'],
          'min_samples_split': [2,3,4,5,6,7,8,9,10],
          'min_samples_leaf': [1,3,5,7,9,11,13,15],
          'max_features': ['auto','sqrt', 'log2'],
          'max_depth': [10,20,30,40,50,60,70,80.90,100]
         }

In [15]:
# You can increase the value of no of iteration if u want (default=10)
from sklearn.model_selection import RandomizedSearchCV
random_cv = RandomizedSearchCV(estimator=rfc, param_distributions=params, n_iter=100, cv=3, verbose=2, random_state=42) 
random_cv.fit(X_train, y_train)
pred = random_cv.predict(X_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END criterion=gini, max_depth=100, max_features=auto, min_samples_leaf=7, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=100, max_features=auto, min_samples_leaf=7, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=100, max_features=auto, min_samples_leaf=7, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=10, max_features=sqrt, min_samples_leaf=5, min_samples_split=7, n_estimators=350; total time=   0.5s
[CV] END criterion=gini, max_depth=10, max_features=sqrt, min_samples_leaf=5, min_samples_split=7, n_estimators=350; total time=   0.7s
[CV] END criterion=gini, max_depth=10, max_features=sqrt, min_samples_leaf=5, min_samples_split=7, n_estimators=350; total time=   1.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=7, min_samples_split=4, n_estimators=450; t

In [16]:
print(f'Accuracy score : {accuracy_score(pred, y_test)}')
print(f'Confusion matrix :\n {confusion_matrix(pred, y_test)}')
print(f'Classification report :\n {classification_report(pred, y_test)}')

Accuracy score : 0.7792207792207793
Confusion matrix :
 [[90 22]
 [12 30]]
Classification report :
               precision    recall  f1-score   support

           0       0.88      0.80      0.84       112
           1       0.58      0.71      0.64        42

    accuracy                           0.78       154
   macro avg       0.73      0.76      0.74       154
weighted avg       0.80      0.78      0.79       154



In [17]:
# To check best parameter
random_cv.best_params_

{'n_estimators': 100,
 'min_samples_split': 8,
 'min_samples_leaf': 9,
 'max_features': 'log2',
 'max_depth': 60,
 'criterion': 'gini'}

In [18]:
# To check best estimator value
random_cv.best_estimator_

RandomForestClassifier(max_depth=60, max_features='log2', min_samples_leaf=9,
                       min_samples_split=8)

#### 2. GridSearchCV

In [19]:
from sklearn.model_selection import GridSearchCV
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)

In [20]:
param_grid={'n_estimators': [100],
 'min_samples_split': [9,10,11,12],
 'min_samples_leaf': [8,9,10,11],
 'max_depth': [60],
 'max_features': ['log2'],
 'criterion': ['gini']}

In [21]:
grid_cv = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3, verbose=2)
grid_cv.fit(X_train, y_train)
pred = grid_cv.predict(X_test)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=9, n_estimators=100; total time=   0.2s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=9, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=9, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=10, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=10, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=10, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_depth=60, max_features=log2, min_samples_leaf=8, min_samples_split=11, n_estimators=100; to

In [22]:
print(f'Accuracy score : {accuracy_score(pred, y_test)}')
print(f'Confusion matrix :\n {confusion_matrix(pred, y_test)}')
print(f'Classification report :\n {classification_report(pred, y_test)}')

Accuracy score : 0.7662337662337663
Confusion matrix :
 [[89 23]
 [13 29]]
Classification report :
               precision    recall  f1-score   support

           0       0.87      0.79      0.83       112
           1       0.56      0.69      0.62        42

    accuracy                           0.77       154
   macro avg       0.72      0.74      0.72       154
weighted avg       0.79      0.77      0.77       154



In [23]:
grid_cv.best_params_

{'criterion': 'gini',
 'max_depth': 60,
 'max_features': 'log2',
 'min_samples_leaf': 10,
 'min_samples_split': 11,
 'n_estimators': 100}

In [24]:
grid_cv.best_estimator_

RandomForestClassifier(max_depth=60, max_features='log2', min_samples_leaf=10,
                       min_samples_split=11)

#### Bayesian Optimization
     Install hyperopt packages in your system using [pip install hyperopt] 
Reference link:
   https://towardsdatascience.com/hyperopt-hyperparameter-tuning-based-on-bayesian-optimization-7fa32dffaf29

In [25]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

# fmin: class which will host optimization process
# tpe: the optimizer to be used (Tree of Parzen Estimator)
# hp: for defining the search space

In [26]:
para = {'criterion': hp.choice('criterion', ['entropy', 'gini']), # Choice used for choice of 2 input
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),  # quniform used for selects integers
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),  # uniform used for selects float 
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [100,150,200,300,500,750,850,1000])
    }

In [27]:
# create a function
def parameter(para):
    rfc = RandomForestClassifier(criterion = para['criterion'], 
                                   max_depth = para['max_depth'],
                                   max_features = para['max_features'],
                                   min_samples_leaf = para['min_samples_leaf'],
                                   min_samples_split = para['min_samples_split'],
                                   n_estimators = para['n_estimators'], 
                                   )
    
    accuracy = cross_val_score(rfc, X_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }

In [28]:
from sklearn.model_selection import cross_val_score
# fn: function to be optimized
# space: search space
# algo: optimizer algorithm
# max_evals: number of iterations
trials = Trials()
best_para = fmin(fn= parameter, 
            space= para,
            algo= tpe.suggest,
            max_evals = 100,
            trials= trials)
best_para

100%|█████████████████████████████████████████████| 100/100 [10:23<00:00,  6.23s/trial, best loss: -0.7703718512594963]


{'criterion': 0,
 'max_depth': 1020.0,
 'max_features': 2,
 'min_samples_leaf': 0.005840497515923141,
 'min_samples_split': 0.12861683969032578,
 'n_estimators': 1}

**This all my assigned value in dict format keys and value check with parameters**
##### para:
       {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 1, 100, 5),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [100,150,200,250,300,350,450,500])}
Extract this para by comparing previous best parameter

In [36]:
rfc = RandomForestClassifier(n_estimators= 750, 
                             criterion='entropy',
                             max_depth=1020, 
                             max_features='log2', 
                             min_samples_leaf=0.005840497515923141, 
                             min_samples_split=0.12861683969032578)
rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)

In [37]:
print(f"Accuracy score: {accuracy_score(pred, y_test)}")
print(f"Confusion matrix: {confusion_matrix(pred, y_test)}")
print(f"Classification report: {classification_report(pred, y_test)}")

Accuracy score: 0.7727272727272727
Confusion matrix: [[91 24]
 [11 28]]
Classification report:               precision    recall  f1-score   support

           0       0.89      0.79      0.84       115
           1       0.54      0.72      0.62        39

    accuracy                           0.77       154
   macro avg       0.72      0.75      0.73       154
weighted avg       0.80      0.77      0.78       154

