# <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">Introduction</p>

> Developing a model is never enough. It's important that you tune it to get the best out of it.

### What is hyper-parameter?

Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter optimization/tuning.

In more simple terms :

- **Hyper-parameter** are the parameters that are set by the model developer befor starting the training of the model. (i.e. Learning Rate, regularization parameter)
- **Model Parameters** are the ones' that are learned while model training. (i.e. Weights of Neural Network)

### Hyper-parameter optimization

For simplicity assume lambda = f(x,y) and the plot of lambda with respect to x and y looks like below. Now we want to find such x and y that the value of lambda is minimized/maximized. 

<img src="https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/ac3f2f5a-9199-4bb7-8ce6-47e4dc307a0e.png">

Now if the rule by which (x,y) are related to lambda is know then one can use calculus or deterministic optimization algorithms to optimize lambda. If the function/rule is unknown then stochastic algorithms are use. And this process is called hyper-parameter optimization. In the next section we'll look into some of the details of the techniques used for Hype-parameter optimization.

[Image Credit](https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/ac3f2f5a-9199-4bb7-8ce6-47e4dc307a0e.png)

### Imports

In [None]:
import pandas as pd 
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import warnings
warnings.filterwarnings("ignore")

Disclaimer - This notebook if more focused on how to do hyper-parameter tuning especially using Bayesian Optimization. The model that we'll be using in this notebook is XGBoost.

## HR Analytics Dataset

In [None]:
hr_train = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
hr_test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
hr_train.head()

In [None]:
def prepare_hr_data(train, test):
        
    features_remove=['enrollee_id','target']
    y = train['target']
    X = train.drop(features_remove, axis=1)
    
    test_id = test['enrollee_id']
    test = test.drop(['enrollee_id'],axis=1)
    
    print(len(X.columns), len(test.columns))
    
    train_len = len(X)

    merged = pd.concat([X, test])
    
    categorical_features = []
    for c in merged.columns:
        col_type = merged[c].dtype
        if col_type == "object" or col_type.name == "category":
            # an option in case the data(pandas dataframe) isn't passed with the categorical column type
            # X[c] = X[c].astype('category')
            categorical_features.append(c)
            
            
    merged = pd.get_dummies(data=merged, columns=categorical_features)
    #merged = merged.fillna(merged.mean())
    
    merged = merged.rename(columns={'experience_<1':'experience_less_than_1',
'experience_>20':'experience_more_than_20',
'company_size_<10':'company_size_less_than_20',
'last_new_job_>4':'last_new_job_more_than_4'})
    
    train_x = merged.iloc[:train_len,:]
    test = merged.iloc[train_len:, :]
    
    return train_x, y, test_id, test

In [None]:
train_X, train_y, test_id, test = prepare_hr_data(hr_train, hr_test)
target = 'target'
IDcol = "enrollee_id"
train = train_X
train[target] = train_y
train.head()

## Hyper-parameter optimization techniques

Here are some of the major techniques that are majorly used in ML domain
- Manual Search
- Grid Search
- Random Search
- Bayesian Optimization
- Gradient Based Optimization

### Manual Search 

As the name suggest we use our knowledge and experience to pick the value of a hyper-parameter then we train the model and evaluate it. Based on the results we change the value of the hyper-parameter and retain the model. The proces continues until satifactory results are obtained. This is rarely used since it takes a lot of time.

### Grid Search

In this method we define a search space (grid) for each hyper-parameter that we want to tune. The model is traned on each combination of the values that we defined in the grid and then performance is evaluated. The v=hyper-parameter values corresponding to best model performance is picked as final.
Pseodo Code Snippet - 

> clf = RandomForestClassifier() <br/>
> grid_search = {'max_depth': [5,10], 'learning_rate':[0.01, 0.1]} <br/>
> model = GridSearchCV(estimator = clf, param_grid = grid_search, cv = 4, verbose= 5, n_jobs = -1)
> model.fit(train_X,train_y)                            

### Random Search

In this method as well we define a search space (grid) for each hyper-parameter that we want to tune but instead of doing exhaustive search it searches the grid space randomly  i.e. it tries randomly selected combinations of parameters. Pseudo code snippet -

> clf = RandomForestClassifier() <br/>
> param_grid = {'n_estimators':range(50,100,50)} <br/>
> rnd_search = RandomizedSearchCV(clf, param_grid, scoring='roc_auc', n_iter=4, cv=2) <br/>
> rnd_search.fit(train_X,train_y) <br/>

<img src="https://miro.medium.com/max/3192/1*Q3GY243UjUA7r-pLudRFTQ.png">

<br/>
We can see here that random search does better because of the way the values are picked. In this example, grid search only tested three unique values for each hyperperameter, whereas the random search tested 9 unique values for each.

[Image Credit](https://miro.medium.com/max/3192/1*Q3GY243UjUA7r-pLudRFTQ.png)

### Bayesian Optimization 

Thanks to [fernando](https://github.com/fmfn) for this beautiful explanation and animation.

Bayesian optimization works by constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not. As you iterate over and over, the algorithm balances its needs of exploration and exploitation taking into account what it knows about the target function. At each step a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with a exploration strategy (such as UCB (Upper Confidence Bound), or EI (Expected Improvement)), are used to determine the next point that should be explored (see the gif below).



<img src="https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif">

This process is designed to minimize the number of steps required to find a combination of parameters that are close to the optimal combination. To do so, this method uses a proxy optimization problem (finding the maximum of the acquisition function) that, albeit still a hard problem, is cheaper (in the computational sense) and common tools can be employed. Therefore Bayesian Optimization is most adequate for situations where sampling the function to be optimized is a very expensive endeavor.

## Bayesian Optimization in Action [on XGBoost Model]

We'll see bayesian optimization in action because it's the most efficient one.

In [None]:
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics   #Additional scklearn functions
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV   #Perforing grid search
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.utils import shuffle
from bayes_opt import BayesianOptimization as BO
import scipy as sp
import matplotlib as mp
from matplotlib.pylab import rcParams
%matplotlib inline

In [None]:
def xgb_bo(X, y, categoricals_features = None):
    
   # For any optimization problem we have one objective function and a search space.
    
   # Define the evaluation function / objective function and parameter space.
   # Here our objective function is AUC of the XGBoost model.
    
    def xgb_eval(learning_rate, max_depth, min_child_weight, max_delta_step, 
                 min_split_loss, subsample, colsample_bytree, reg_lambda, reg_alpha, scale_pos_weight):
        
        xgb_params = {'objective': 'binary:logistic',
                      'booster': 'gbtree',
                      'eval_metric': 'auc',
                      'learning_rate': learning_rate,
                      'max_depth': int(max_depth),
                      'min_child_weight': min_child_weight,
                      'max_delta_step': max_delta_step,
                      'min_split_loss': min_split_loss,
                      'subsample': subsample,
                      'colsample_bytree': int(colsample_bytree),
                      'reg_lambda': reg_lambda,  
                      'reg_alpha': reg_alpha,
                      'scale_pos_weight' : int(scale_pos_weight)
                    }
        
        dtrain = xgb.DMatrix(X, y)
        cv_xgb = xgb.cv(xgb_params, dtrain)
        
        
        
        return cv_xgb['test-auc-mean'].iloc[-1]
    
    ### Define the search space
    
    xgb_param_space = {
        'learning_rate': (0.001, 0.1),
        'max_depth': (3, 80),
        'min_child_weight': (0.1, 10),
        'max_delta_step': (0,10),
        'min_split_loss': (0,1),
        'subsample': (0.2, 1),
        'colsample_bytree': (0.2, 1),
        'reg_lambda': (0, 50),
        'reg_alpha': (0, 50),
        'scale_pos_weight': (0.2, 5)
    }
    
    return xgb_eval, xgb_param_space


def run_bayes_opt(eval_func, param_space):
    
    """
    This function is to run Bayesian optimization. 
    'init_points' is the number of initializations.
    'n_iter' is the number of iterations after your random initializations.
    """
    
    bo = BO(eval_func, param_space)
    
    n_iter = 20
    init_points = 5
    
    with warnings.catch_warnings():
        warnings.filterwarnings('ignore')
        
        bo.maximize(init_points = init_points,
                   n_iter = n_iter,
                   acq = 'ucb',
                   alpha = 1e-6)
        
    return bo

In [None]:
bayes_op = True
if bayes_op:
    # run Bayesian optimization
    xgb_eval, xgb_param_space = xgb_bo(train_X, train_y)
    xgb_bo = run_bayes_opt(xgb_eval, xgb_param_space)
    max_bo_params = xgb_bo.max['params']

Optimized parameters from Bayesian Optimization

In [None]:
max_bo_params

In [None]:
xgb_model = XGBClassifier(
    n_estimators = 1000,
    objective = 'binary:logistic',
    booster = 'gbtree',
    learning_rate = max_bo_params['learning_rate']/2,
    max_depth = int(max_bo_params['max_depth']),
    min_child_weight = max_bo_params['min_child_weight'],
    max_delta_step = max_bo_params['max_delta_step'],
    min_split_loss = max_bo_params['min_split_loss'],
    subsample = max_bo_params['subsample'],
    colsample_bytree = max_bo_params['colsample_bytree'],
    reg_lambda = max_bo_params['reg_lambda'],
    reg_alpha = max_bo_params['reg_alpha'],
    scale_pos_weight = max_bo_params['scale_pos_weight'], 
    seed=27)

The above xgb_model is having parameters optimized by Bayesian Optimization. Once can increase n_iter and init_points to get better results.

The parameter n_estimator coudn't be optimized using Bayesian optimization. I don't understand why?

So, for tuning n_estimators we'll use Randomized search. (Currently commented it because it takes a lot of time to run...jusr uncomment if you want to use)

In [None]:
param_test = {  
    'n_estimators':range(50,1000,50)
}
rnd_search = RandomizedSearchCV(xgb_model, param_test, scoring='roc_auc', n_iter=4, cv=4)
rnd_search.fit(train_X,train_y)

In [None]:
rnd_search.best_params_, rnd_search.best_score_

### Final Model with optimized parameters

In [None]:
xgb_final = XGBClassifier(
    n_estimators = rnd_search.best_params_['n_estimators'],
    objective = 'binary:logistic',
    booster = 'gbtree',
    learning_rate = max_bo_params['learning_rate']/2,
    max_depth = int(max_bo_params['max_depth']),
    min_child_weight = int(max_bo_params['min_child_weight']),
    max_delta_step = int(max_bo_params['max_delta_step']),
    min_split_loss = max_bo_params['min_split_loss'],
    subsample = max_bo_params['subsample'],
    colsample_bytree = max_bo_params['colsample_bytree'],
    reg_lambda = max_bo_params['reg_lambda'],
    reg_alpha = max_bo_params['reg_alpha'],
    scale_pos_weight = max_bo_params['scale_pos_weight'], 
    seed=27)

In [None]:
def modelfit(alg, dtrain, val, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50, verbose=True):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics=["auc", "logloss"], early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    val_predictors = [x for x in val.columns if x not in [IDcol,target]]
    eval_s = [(dtrain[predictors], dtrain[target]),(val[predictors], val[target])]
    alg.fit(dtrain[predictors], dtrain[target],early_stopping_rounds=early_stopping_rounds,eval_metric=["auc", "logloss"],eval_set=eval_s,verbose=verbose)
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    val_predictions = alg.predict(val[val_predictors])
        
    #Print model report:
    print ("\nModel Report")
    print ("Train Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions))
    print ("Validation Accuracy : %.4g" % metrics.accuracy_score(val[target].values, val_predictions))
    
    
    # retrieve performance metrics
    results = alg.evals_result()
    epochs = len(results['validation_0']['logloss'])
    x_axis = range(0, epochs)
    # plot log loss
    fig, ax = plt.subplots()
    ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
    ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
    ax.legend()
    plt.ylabel('Log Loss')
    plt.title('XGBoost Log Loss')
    plt.show()
    #plt.savefig("log.png")
    fig, ax = plt.subplots()
    ax.plot(x_axis, results['validation_0']['auc'], label='Train')
    ax.plot(x_axis, results['validation_1']['auc'], label='Test')
    ax.legend()
    plt.ylabel('Area under curve')
    plt.title('XGBoost AUC value')
    plt.show()
    #plt.savefig('error.png')
    
    
    return

In [None]:
def evaluate_model_performnce(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    
    # Visualizing model performance
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

    # labels, title and ticks
    ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
    ax.set_title('Confusion Matrix'); 

    tn, fp, fn, tp = cm.ravel()
    #print(tn, fp, fn, tp)
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1 = 2 * (precision * recall) / (precision + recall)
    accuracy = ((tp+tn)/(tp+tn+fp+fn))*100
    print("Precision : ",precision)
    print("Recall : ",recall)
    print("F1 Score : ",f1)
    print("Validation Accuracy : ",accuracy)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    print("Accuracy Score : ", accuracy)

    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
    auc = metrics.auc(fpr, tpr)
    print("AUC Value : ", auc)
    
    return accuracy, auc, f1

Running the experiment

In [None]:
#xgb_final = XGBClassifier()

In [None]:
# Prepare train and validation set
df = train
train, val = train_test_split(df, test_size=0.2)
# Training the XGBoost Model
predictors = [x for x in train.columns if x not in [target, IDcol]]
modelfit(xgb_final, train, val, predictors, useTrainCV=False, early_stopping_rounds=30)

### Happy Learning!!