# Hyperparamter Tuning - XGBoost
1. Import the customer churn data (I have already cleaned it)
2. Split the data into test and train sets
3. Build data matrices - as XGBoost uses DMatrix
4. Find the logloss of the model with default parameters
5. Tune the parameters
6. Find the logloss of the model with tuned parameters

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
df = pd.DataFrame(X, columns=cancer.feature_names)
df['target'] = y

In [3]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [4]:
# Checking the dimension of the data

df.shape

(569, 31)

In [5]:
# Splitting the data into train and test datasets
# test:train = 3:7
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [6]:
# XGBoost uses an internal data structure DMatrix - which optimizes both memory effieciency and speed
# Hence, rather than using pandas dataframe, we will use data matrix - DMatrix

import xgboost as xgb

dm_train = xgb.DMatrix(X_train, label=y_train)
dm_test = xgb.DMatrix(X_test, label=y_test)

## Building Model

Ideal case would include an exhaustive gridsearch on all the parameters. However, such an approach is computationally intensive. Hence, we will focus on few important parameters and tune them sequentially. Following are the parameters that we will tune in this process:
1. max_depth
2. min_child_weight
3. subsample
4. colsample_bytree
5. eta
6. num_boost_rounds
7. early_stopping_rounds

We will use logistic loss function to assess the accuracy of predictions, as this is a classification problem

In [7]:
# We will set num_boost_rounds to 100, early_stopping_rounds to 10, and objective to binary:logistic.
# All the other values at this stage are default values.
# We will tune our model by chaning the default values.

params = {'max_depth':6, 'min_child_weight':1, 'eta':0.3, 'subsample':1, 
          'colsample_bytree':1, 'objective':'binary:logistic',}

# We will use logloss function to evaluate the model's performance
params['eval_metric'] = "logloss"

xgmodel = xgb.train(params, dtrain = dm_train, num_boost_round = 100, evals = [(dm_test,"Test")], 
                    early_stopping_rounds = 10)

print("Best Logloss: {:.3f} | Rounds: {}".format(xgmodel.best_score,xgmodel.best_iteration+1))

[0]	Test-logloss:0.48345
[1]	Test-logloss:0.36545
[2]	Test-logloss:0.28797
[3]	Test-logloss:0.23170
[4]	Test-logloss:0.19662
[5]	Test-logloss:0.17002
[6]	Test-logloss:0.14550
[7]	Test-logloss:0.13175
[8]	Test-logloss:0.12037
[9]	Test-logloss:0.11088
[10]	Test-logloss:0.10224
[11]	Test-logloss:0.09588
[12]	Test-logloss:0.09096
[13]	Test-logloss:0.08960
[14]	Test-logloss:0.08519
[15]	Test-logloss:0.08142
[16]	Test-logloss:0.07903
[17]	Test-logloss:0.07691
[18]	Test-logloss:0.07538
[19]	Test-logloss:0.07396
[20]	Test-logloss:0.07265
[21]	Test-logloss:0.07262
[22]	Test-logloss:0.07007
[23]	Test-logloss:0.07007
[24]	Test-logloss:0.06937
[25]	Test-logloss:0.06889
[26]	Test-logloss:0.06774
[27]	Test-logloss:0.06804
[28]	Test-logloss:0.06711
[29]	Test-logloss:0.06791
[30]	Test-logloss:0.06578
[31]	Test-logloss:0.06600
[32]	Test-logloss:0.06535
[33]	Test-logloss:0.06362
[34]	Test-logloss:0.06389
[35]	Test-logloss:0.06352
[36]	Test-logloss:0.06433
[37]	Test-logloss:0.06281
[38]	Test-logloss:0.06

Here, we found that the tenth round gave the best result and the results did not improve in the next 10 rounds. Hence, the iteration stopped at round 19 and we did not reach the maximum number of boosting rounds (100). Finding a suitable evidence to stop the iterations is important. Stopping the iterations when results do not improve prevents overfittig and the inefficient utilization of resources. We will use cross validation to tune the parameters within the params dictionary

In [8]:
# Parameters: max-depth and min_child_weight
# I realized that the optimal values are in the following ranges through multiple iterations

gridsearch_params = [(max_depth, min_child_weight)
                    for max_depth in range(1,4)
                    for min_child_weight in range(17,21)]

In [9]:
logloss_min = float("Inf")
best_params = None

for max_depth, min_child_weight in gridsearch_params:
    
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    
    xg_cvresults = xgb.cv(params, dtrain = dm_train, num_boost_round = 100,
                      seed = 0, nfold=10, metrics = {'logloss'}, early_stopping_rounds = 10,)
    
    logloss_mean = xg_cvresults['test-logloss-mean'].min()
    
    print("max_depth: {} | min_child_weight: {} with Logloss: {:.3}\n".format(max_depth,min_child_weight,logloss_mean))
    
    if logloss_mean < logloss_min:
        logloss_min = logloss_mean
        best_params = (max_depth, min_child_weight)

        
print("Best Parameters: max_depth: {} | min_child_weight: {} with Logloss: {:.3f}". format(best_params[0], 
                                                                                  best_params[1], logloss_min))

max_depth: 1 | min_child_weight: 17 with Logloss: 0.199

max_depth: 1 | min_child_weight: 18 with Logloss: 0.206

max_depth: 1 | min_child_weight: 19 with Logloss: 0.211

max_depth: 1 | min_child_weight: 20 with Logloss: 0.217

max_depth: 2 | min_child_weight: 17 with Logloss: 0.195

max_depth: 2 | min_child_weight: 18 with Logloss: 0.201

max_depth: 2 | min_child_weight: 19 with Logloss: 0.204

max_depth: 2 | min_child_weight: 20 with Logloss: 0.219

max_depth: 3 | min_child_weight: 17 with Logloss: 0.195

max_depth: 3 | min_child_weight: 18 with Logloss: 0.201

max_depth: 3 | min_child_weight: 19 with Logloss: 0.204

max_depth: 3 | min_child_weight: 20 with Logloss: 0.219

Best Parameters: max_depth: 2 | min_child_weight: 17 with Logloss: 0.195


In [10]:
# Updating the parameters with the best values: max_depth = 2 and min_child_weight = 19

params['max_depth'] = 2
params['min_child_weight'] = 19

In [11]:
# Parameters: subsample and colsample_bytree
# I found that the optimal values are in the following ranges through multiple iterations

gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(1,5)]
]

In [12]:
logloss_min = float("Inf")
best_params = None

for subsample, colsample in (gridsearch_params):
    
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    
    xg_cvresults = xgb.cv(params, dtrain = dm_train, num_boost_round = 100,
                      seed = 0, nfold=10, metrics = {'logloss'}, early_stopping_rounds = 10,)
    
    logloss_mean = xg_cvresults['test-logloss-mean'].min()
    
    print("subsample: {} | colsample: {} with Logloss: {:.3f}\n".format(subsample,colsample,logloss_mean))
    
    if logloss_mean < logloss_min:
        logloss_min = logloss_mean
        best_params = (subsample, colsample)
        
print("Best Parameters: subsample: {} | colsample: {} with Logloss: {:.3f}". format(best_params[0], 
                                                                           best_params[1], logloss_min))

subsample: 0.7 | colsample: 0.1 with Logloss: 0.279

subsample: 0.7 | colsample: 0.2 with Logloss: 0.271

subsample: 0.7 | colsample: 0.3 with Logloss: 0.262

subsample: 0.7 | colsample: 0.4 with Logloss: 0.259

subsample: 0.8 | colsample: 0.1 with Logloss: 0.256

subsample: 0.8 | colsample: 0.2 with Logloss: 0.250

subsample: 0.8 | colsample: 0.3 with Logloss: 0.235

subsample: 0.8 | colsample: 0.4 with Logloss: 0.236

subsample: 0.9 | colsample: 0.1 with Logloss: 0.228

subsample: 0.9 | colsample: 0.2 with Logloss: 0.223

subsample: 0.9 | colsample: 0.3 with Logloss: 0.212

subsample: 0.9 | colsample: 0.4 with Logloss: 0.220

subsample: 1.0 | colsample: 0.1 with Logloss: 0.215

subsample: 1.0 | colsample: 0.2 with Logloss: 0.211

subsample: 1.0 | colsample: 0.3 with Logloss: 0.210

subsample: 1.0 | colsample: 0.4 with Logloss: 0.204

Best Parameters: subsample: 1.0 | colsample: 0.4 with Logloss: 0.204


In [13]:
# Updating the parameters with the best values: subsample = 0.9 and colsample = 0.4

params['subsample'] = 0.9
params['colsample_bytree'] = 0.4

In [14]:
# Parameter: eta

logloss_min = float("Inf")
best_params = None

for eta in [0.3, 0.2, 0.1, 0.05, 0.01, 0.005]:
    
    params['eta'] = eta
    
    xg_cvresults = xgb.cv(params, dtrain = dm_train, num_boost_round = 100,
                      seed = 0, nfold=10, metrics = {'logloss'}, early_stopping_rounds = 10,)
    
    logloss_mean = xg_cvresults['test-logloss-mean'].min()
    print("eta: {} with Logloss: {:.3}\n".format(eta,logloss_mean))
    
    if logloss_mean < logloss_min:
        logloss_min = logloss_mean
        best_params = eta
        
print("Best Parameter: eta: {} with Logloss: {:.3f}". format(best_params, logloss_min))

eta: 0.3 with Logloss: 0.22

eta: 0.2 with Logloss: 0.22

eta: 0.1 with Logloss: 0.224

eta: 0.05 with Logloss: 0.218

eta: 0.01 with Logloss: 0.333

eta: 0.005 with Logloss: 0.454

Best Parameter: eta: 0.05 with Logloss: 0.218


In [15]:
# Updating the eta parameter with the best value

params['eta'] = 0.3

In [16]:
# Setting the optimum paramters

params = {'colsample_bytree': 0.4,
          'eta': 0.3,
          'eval_metric': 'logloss',
          'max_depth': 2,
          'min_child_weight': 19,
          'objective':'binary:logistic',
          'subsample': 0.9}

In [17]:
# Finding the optimal number of rounds for the model with new parameters

xgmodel_tuned = xgb.train(params, dtrain = dm_train, 
                          num_boost_round=100, evals=[(dm_test,"Test")], early_stopping_rounds=10)


print("Best Logloss: {:.3f} in {} rounds". format(xgmodel_tuned.best_score, xgmodel_tuned.best_iteration+1))

[0]	Test-logloss:0.52934
[1]	Test-logloss:0.41815
[2]	Test-logloss:0.33513
[3]	Test-logloss:0.27513
[4]	Test-logloss:0.23717
[5]	Test-logloss:0.21494
[6]	Test-logloss:0.19976
[7]	Test-logloss:0.19125
[8]	Test-logloss:0.19040
[9]	Test-logloss:0.19004
[10]	Test-logloss:0.18961
[11]	Test-logloss:0.18915
[12]	Test-logloss:0.18874
[13]	Test-logloss:0.18877
[14]	Test-logloss:0.18936
[15]	Test-logloss:0.18900
[16]	Test-logloss:0.18865
[17]	Test-logloss:0.18906
[18]	Test-logloss:0.18901
[19]	Test-logloss:0.18903
[20]	Test-logloss:0.18905
[21]	Test-logloss:0.18893
[22]	Test-logloss:0.18923
[23]	Test-logloss:0.18906
[24]	Test-logloss:0.18903
[25]	Test-logloss:0.18894
[26]	Test-logloss:0.18904
Best Logloss: 0.189 in 17 rounds


In [18]:
from IPython.display import Markdown as md
md("With the tuned parameters we would need {} rounds to achieve the best result The improvement after parameter tuning is marginal in our case. ".format(xgmodel_tuned.best_iteration+1))

With the tuned parameters we would need 17 rounds to achieve the best result The improvement after parameter tuning is marginal in our case. 

Logloss of our model decreased from 0.424 to 0.417 However, we were able to see how parameters can be tuned.

Here we have used only a few combination of parameters. We can further improve the impact of tuning; however, doing so would be computationally more expensive. More combination of parameters and wider ranges of values for each of those paramaters would have to be tested.