# Automated Hyperparameter Tuning and EDA

 This notebook would be focusing on automated hyperparameter techniques. We would be skipping Grid Search and Randomised Search as they are already commonly used in many of the notebooks
 
### Automated Hyperparameter Tuning helps since we dont have to use time and resource intensive grid search techniques to get good results.

The three hyperparameter optimization techniques that we would use are as below:-

1. Scikit-optimize

2. Hyperopt

3. Optuna

Edit-
Documentation for the libraries are below:-

https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html

http://hyperopt.github.io/hyperopt/

https://optuna.readthedocs.io/en/stable/


### We would do some basic EDA before we start with the optimization. We would not be doing any feature engineering since our focus is hyperparameter tuning which gives us good results.

Let's import all the necessary libraries

In [None]:
!pip install -U scikit-learn==0.23
!pip install scikit-optimize==0.8.1

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy.stats import norm
from skopt import gp_minimize,space
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn.metrics import accuracy_score
from skopt.utils import use_named_args
from hyperopt import hp,Trials,tpe,fmin
from hyperopt.pyll.base import scope
from hyperopt.plotting import main_plot_history
import optuna
import warnings
warnings.filterwarnings("ignore")

Importing the input data into a dataframe

In [None]:
dataset = pd.read_csv("../input/heart-disease-uci/heart.csv")

Checking the dimensions of the dataset imported

In [None]:
dataset.shape

Let's have a look at a few rows to get a sense of the data

In [None]:
dataset.head()

We will check for any missing values. Upon checking it seem there are no missing values in this dataset.

In [None]:
dataset.isnull().sum()

Let's check the percentage of males and females

In [None]:
f , ax = plt.subplots()
plt.pie(dataset["sex"].value_counts(),explode=[0,.1],labels=["Male","Female"],startangle=90,shadow=True,autopct = '%1.1f%%')

According to our dataset females are at a higher risk of heart disease than males.

In [None]:
f,ax = plt.subplots(figsize=(10,7))
sns.countplot("sex",hue="target",data=dataset)
bars = ax.patches
half = int(len(bars)/2)
ax.set_xticklabels(["female","male"])
ax.legend(["absence","presence"])
for first,second in (zip(bars[:half],bars[half:])):
    height1= first.get_height()
    height2= second.get_height()
    total = height1 + height2
    ax.text(first.get_x()+first.get_width()/2,height1+2,'{0:.0%}'.format(height1/total),ha="center")
    ax.text(second.get_x()+second.get_width()/2,height2+2,'{0:.0%}'.format(height2/total),ha="center")

The 35 to 45 age band has the highest percentage of affected cases.

In [None]:
dataset.loc[:,"age_band"] = pd.cut(dataset.age,bins=[25,35,45,60,80])
f,ax = plt.subplots(figsize=(10,8))
sns.countplot("age_band",hue="target",data=dataset)
bars = ax.patches
half = int(len(ax.patches)/2)
ax.legend(["absence","presence"])

for first,second in zip(bars[:half],bars[half:]):
    height1 =  first.get_height()
    height2 = second.get_height()
    total_height= height1+height2
    ax.text(first.get_x()+first.get_width()/2, height1+1,'{0:.0%}'.format(height1/total_height), ha ='center')
    ax.text(second.get_x()+second.get_width()/2, height2+1,'{0:.0%}'.format(height2/total_height), ha ='center')

Except for the 60 to 80 age band, rest of the bands are highly skewed towards males.

In [None]:
f,ax= plt.subplots()
sns.countplot("age_band",hue="sex",data=dataset)
ax.legend(["female","male"])


There is higher cholestrol count in cases where there in no disease, contrary to common knowledge.

In [None]:
f,ax = plt.subplots(figsize=(10,7))
sns.boxplot("target","chol",data=dataset)
ax.set_xticklabels(["absence","presence"])

Lets move on to modelling. First we would split our dataset in to train and test sets.

In [None]:
y= dataset["target"]
dataset.drop(["target","age_band"],axis=1,inplace=True)
X_train,X_test,y_train,y_test = train_test_split(dataset,y,test_size=0.3,random_state=42)

## Scikit-optimize Hyperparamter Tuning
We would use RandomForest for tuning. 

Below we create the parameter space and the objective function to be minimized.

In [None]:
param_space_skopt =[
    space.Integer(3,10,name="max_depth"),
    space.Integer(50,1000,name="n_estimators"),
    space.Categorical(["gini","entropy"],name="criterion"),
    space.Real(0.1,1,name="max_features"),
    space.Integer(2,10,name="min_samples_leaf")
]

model = RandomForestClassifier()

@use_named_args(param_space_skopt)
def objective_skopt(**params_skopt):
    model.set_params(**params_skopt)
    skf = StratifiedKFold(n_splits=5,random_state=42)
    scores = -np.mean(cross_val_score(model,X_train,y_train,cv=skf,scoring="accuracy"))
    return scores

We call the gp_minimize function.

In [None]:
result = gp_minimize(objective_skopt,dimensions= param_space_skopt, n_calls=25, n_random_starts=10,verbose=10,random_state=42)

Check the best score received.

In [None]:
-result.fun

We plot the results vs the calls to the objective function

In [None]:
from skopt.plots import plot_convergence
plot_convergence(result)

Here we are testing the best parameters on our test set.

In [None]:
model_skopt =RandomForestClassifier(n_estimators= result.x[1],criterion=result.x[2],max_depth=result.x[0],min_samples_leaf=result.x[4],max_features=result.x[3],random_state=42)
model_skopt.fit(X_train,y_train)
y_pred_skopt = model_skopt.predict(X_test)
skopt_score = accuracy_score(y_test,y_pred_skopt)
skopt_score

## Hyperopt Hyperparameter Tuning

Below we defind the parameter space and the objective function.

In [None]:
param_space_hopt = {
    "max_depth":scope.int(hp.quniform("max_depth",3,10,1)),
              "n_estimators":scope.int(hp.quniform("n_estimators",50,1000,1)),
               "criterion":hp.choice("criterion",["gini","entropy"]),
               "max_features":hp.uniform("max_features",0.1,1),
               "min_samples_leaf":scope.int(hp.quniform("min_samples_leaf",2,10,1))
              }

def objective_hopt(params_hopt):
    model_hopt = RandomForestClassifier(**params_hopt)
    skf = StratifiedKFold(n_splits=5,random_state=42)
    scores = -np.mean(cross_val_score(model_hopt,X_train,y_train,cv=skf,scoring="accuracy"))
    return scores

trial_hopt = Trials()
hyopt = fmin(fn=objective_hopt,space = param_space_hopt, algo=tpe.suggest,max_evals=25,trials=trial_hopt) 

Let's check the best parameters

In [None]:
hyopt

We plot the scores against the calls to the objective function.

In [None]:
main_plot_history(trial_hopt)

In [None]:
model_hopt =RandomForestClassifier(n_estimators= int(hyopt["n_estimators"]),criterion="gini",max_depth=int(hyopt["max_depth"]),min_samples_leaf=int(hyopt["min_samples_leaf"]),max_features=hyopt["max_features"],random_state=42)
model_hopt.fit(X_train,y_train)
y_pred_hyopt = model_hopt.predict(X_test)
hyopt_score = accuracy_score(y_test,y_pred_hyopt)
hyopt_score

## Optuna Hyperparamter Tuning

We define the objective function below.

In [None]:
def optimization_optuna(trial_optuna):
    
    n_estimators = trial_optuna.suggest_int("n_estimators",50,1000)
    max_depth = trial_optuna.suggest_int("max_depth",3,10)
    criterion = trial_optuna.suggest_categorical("criterion",["entropy","gini"])
    min_samples_split = trial_optuna.suggest_int("min_samples_leaf",2,10)
    max_features = trial_optuna.suggest_uniform("max_features",0.1,1)
    

    model_optuna = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,criterion=criterion,
                                         min_samples_split=min_samples_split,max_features=max_features)
    skf = StratifiedKFold(n_splits=5)
    score = cross_val_score(model_optuna,X_train,y_train,cv=skf,scoring="accuracy")
    return np.mean(score)

In optuna we can give the direction in which we evaluate the objective function. Earlier we used -ve since those objective functions evaluated for minimizing.

Here we can define the direction and we choose maximize since it we use accuracy score. We haven't negated the score in the objective function.

In [None]:
study = optuna.create_study(direction="maximize")
result = study.optimize(optimization_optuna,n_trials=25)

Let's check the best parameters.

In [None]:
study.best_params

We evaluate the best parameters on the test data.

In [None]:
model_optuna =RandomForestClassifier(n_estimators= study.best_params["n_estimators"],criterion=study.best_params["criterion"],max_depth=study.best_params["max_depth"],min_samples_leaf=study.best_params["min_samples_leaf"],max_features=study.best_params["max_features"],random_state=42)
model_optuna.fit(X_train,y_train)
y_pred_optuna = model_optuna.predict(X_test)
optuna_score = accuracy_score(y_test,y_pred_optuna)
optuna_score

We visualize the movement of scores according to the calls to the objective functions.

In [None]:
optuna.visualization.plot_optimization_history(study)