## Model Tuning

<b>Hyperparameters</b>
Machine learning model:
<ul> 
    <li>parameters: learned from data</li>
    <ul>
        <li>CART example: split-point of a node, split-feature of a node, ...</li>
    </ul>   
    <li>hyperparameters: not learned from data, set prior to training</li>
    <ul>
        <li>CART example: max_depth , min_samples_leaf , splitting criterion ...</li>
    </ul>
    
    
<b>What is hyperparameter tuning?</b>
- Problem:search for a set of optimal hyperparameters for a learning algorithm.
- Solution: nd a set of optimal hyperparameters that results in an optimal model.
- Optimal model: yields an optimal score.
- Score: in sklearn defaults to accuracy (classication) and <i>R^2</i> (regression).
- Cross validation is used to estimate the generalization performance.    
    
<b> Example Hyperparament Tuning Approaches</b>
- Grid Search
- Random Search
- Bayesian Optimization
- GeneticAlgorithms
    
    
    
    

<b>Grid search cross validation</b>
- Manually set a grid of discrete hyperparameter values.
- Set a metric for scoring model performance.
- Search exhaustively through the grid.
- For each set of hyperparameters, evaluate each model's CV score.
- The optimal hyperparameters are those ofthe model achieving the best CV score.

In [10]:
#import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV


file_path = '/Users/joycemungai/datacamp/machine_learning_python/datasets/'

#### Tree hyperparameters
Tune the hyperparameters of a classification tree. Given that Indian Liver Patient dataset is imbalanced, use the ROC AUC score as a metric instead of accuracy.

In [5]:
#load dataset
lipd = pd.read_csv(file_path + 'indian_liver_patient/indian_liver_patient_preprocessed.csv')

lipd.head()

Unnamed: 0.1,Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [7]:
# define X and y
X = lipd.loc[:, ~lipd.columns.isin(['Unnamed: 0','Liver_disease'])] 
## Drop unnamed column to prevent covergence warning in LogisticRegression
y = lipd['Liver_disease']

#split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2,random_state=1)

In [21]:
# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Check default hyperparameter
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': 1,
 'splitter': 'best'}

#### Set the tree's hyperparameter grid

In [22]:
# Define params_dt
params_dt = {
    'max_depth':[2,3,4],
    'min_samples_leaf':[0.12,0.14,0.16,0.18]
            
}

In [24]:
# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)
#train model
grid_dt.fit(X_train, y_train)


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4],
                         'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},
             scoring='roc_auc')

In [25]:
# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = grid_dt.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test,y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.731
