# XGB HyperParameter Tuning using Optuna

Params to Tune (from xgb docs)

#### Some params to tune

- booster [default= gbtree ]: 
Which booster to use. Can be **gbtree**, **gblinear** or **dart**; gbtree and dart use tree based models while gblinear uses linear functions.

- num_feature [set automatically by XGBoost, no need to be set by user]: 
Feature dimension used in boosting, set to maximum dimension of the feature.

- eta [default=0.3, alias: learning_rate]
Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative. range: [0,1]

- gamma [default=0, alias: min_split_loss]
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. range: [0,∞]

- max_depth [default=6]
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguide growing policy when tree_method is set as hist or gpu_hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. range: [0,∞]

- min_child_weight [default=1]
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. range: [0,∞]

- max_delta_step [default=0]
Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update. range: [0,∞]

- max_leaves [default=0]
Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.

- max_bin, [default=256]
Only used if tree_method is set to hist or gpu_hist. Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.

- lambda [default=1, alias: reg_lambda]
L2 regularization term on weights. Increasing this value will make model more conservative.

- alpha [default=0, alias: reg_alpha]
L1 regularization term on weights. Increasing this value will make model more conservative.

- colsample_bytree, colsample_bylevel, colsample_bynode [default=1]
This is a family of parameters for subsampling of columns.

    - All colsample_by* parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.

    - colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

    - colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.

    - colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

    - colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split.

## Installing Dependencies

In [None]:
import os
import optuna
import pandas as pd
import numpy as np
import xgboost as xgb
import sklearn.metrics as metrics
import sklearn.preprocessing as prep
import sklearn.model_selection as ms
from functools import partial

In [None]:
train_csv = pd.read_csv("../input/tps-september-xgb-in-gpu-baseline/train_mean_filling.csv")

### Creating feature and target set

In [None]:
target = "claim"
features = [f for f in train_csv.columns if f not in ["id", target]]
print(features)

# Cross Validation

In [None]:
# crossvalidation utility
class CrossValidation:
    def __init__(self, df, shuffle,random_state=None):
        self.df = df
        self.random_state = random_state
        self.shuffle = shuffle
        if shuffle is True:
            self.df = df.sample(frac=1,
                random_state=self.random_state).reset_index(drop=True)

    def hold_out_split(self,percent,stratify=None):
        if stratify is not None:
            y = self.df[stratify]
            train,val = ms.train_test_split(self.df, test_size=percent/100,
                stratify=y, random_state=self.random_state)
            return train,val
        size = len(self.df) - int(len(self.df)*(percent/100))
        train = self.df.iloc[:size,:]
        val = self.df.iloc[size:,:]
        return train,val

    def kfold_split(self, splits, stratify=None):
        if stratify is not None:
            kf = ms.StratifiedKFold(n_splits=splits,
                shuffle=self.shuffle,
                random_state=self.random_state)
            y = self.df[stratify]
            for train, val in kf.split(X=self.df,y=y):
                t = self.df.iloc[train,:]
                v = self.df.iloc[val, :]
                yield t,v
        else:
            kf = ms.KFold(n_splits=splits, shuffle=self.shuffle,
                random_state=self.random_state)
            for train, val in kf.split(X=self.df):
                t = self.df.iloc[train,:]
                v = self.df.iloc[val, :]
                yield t,v

In [None]:
seed = 42
folds = 5

In [None]:
cv = CrossValidation(train_csv,
                     shuffle=True,
                     random_state=seed
                    )

In [None]:
del train_csv

# Hyperparameter Tuning using Optuna

### optuna for the whole training set and pickup the best hyperparams from all trials

In [None]:
def tuning_params(trial):
    n_estimators = trial.suggest_int("n_estimators", 1000, 11000, step=1000)
    learning_rate = trial.suggest_float("learning_rate", 1e-2, 0.25, log=True)
    reg_lambda = trial.suggest_loguniform("reg_lambda", 1e-8, 100.0)
    reg_alpha = trial.suggest_loguniform("reg_alpha", 1e-8, 100.0)
    subsample = trial.suggest_float("subsample", 0.1, 1.0)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.1, 1.0)
    max_depth = trial.suggest_int("max_depth", 1, 7)
    
    for train_, val_ in cv.kfold_split(folds):
        trainX = train_[features]
        trainY = train_[target]
        valX = val_[features]
        valY = val_[target]

        model = xgb.XGBClassifier(
            seed=1,
            tree_method="gpu_hist",
            gpu_id=0,
            predictor="gpu_predictor",
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            reg_lambda=reg_lambda,
            reg_alpha=reg_alpha,
            subsample=subsample,
            colsample_bytree=colsample_bytree,
            max_depth=max_depth,
            use_label_encoder=False
        )
        model.fit(trainX, trainY, 
                  early_stopping_rounds=300, 
                  eval_set=[(valX, valY)],
                  eval_metric="auc",
                  verbose=1000
                 )
        
        predY = model.predict(valX)
        val_auc = metrics.roc_auc_score(valY, predY)
        return val_auc

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(tuning_params, n_trials=5, gc_after_trial=True)

In [None]:
print(study.best_params)

### setup an optuna for one of the fold and get the best hyperparams, then repeat it for others

In [None]:
def tuning_fold_params(train_, val_, trial):
    n_estimators = trial.suggest_int("n_estimators", 1000, 11000, step=1000)
    learning_rate = trial.suggest_float("learning_rate", 1e-2, 0.25, log=True)
    reg_lambda = trial.suggest_loguniform("reg_lambda", 1e-8, 100.0)
    reg_alpha = trial.suggest_loguniform("reg_alpha", 1e-8, 100.0)
    subsample = trial.suggest_float("subsample", 0.1, 1.0)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.1, 1.0)
    max_depth = trial.suggest_int("max_depth", 1, 7)
    
    trainX = train_[features]
    trainY = train_[target]
    valX = val_[features]
    valY = val_[target]

    model = xgb.XGBClassifier(
        seed=1,
        tree_method="gpu_hist",
        gpu_id=0,
        predictor="gpu_predictor",
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        reg_lambda=reg_lambda,
        reg_alpha=reg_alpha,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        max_depth=max_depth,
        use_label_encoder=False
    )
    model.fit(trainX, trainY, 
              early_stopping_rounds=300, 
              eval_set=[(valX, valY)],
              eval_metric="auc",
              verbose=1000
             )

    predY = model.predict(valX)
    val_auc = metrics.roc_auc_score(valY, predY)
    return val_auc

In [None]:
best_params = []
study = optuna.create_study(direction="maximize")
for fold, (train_, val_) in enumerate(cv.kfold_split(folds)):
    print("Trial Fold: ", fold+1)
    trial_fn = partial(tuning_fold_params, train_, val_)
    study.optimize(trial_fn, n_trials=5, gc_after_trial=True)
    best_params.append(study.best_params)

In [None]:
for best in best_params:
    print(best)