# Overview

In this notebook, we will run an XGBoost Classifier on the training data set that was created in the aggregation notebook. We will use Baysian Optimization to find the best hyper-parameters, fit a model for those best hyper-parameters, and then evaluate the model on the testing data set that was created in the aggregation notebook. Finally, we will find the best features from the XGBoost Classifier and re-run the classifier on those most important features to see if it improves the model performance.

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from bayes_opt import BayesianOptimization
import re
import gc

In [2]:
train = pd.read_csv('/kaggle/input/msba-6420-predictive-analytics-project/Alldata_v3/train.csv')
train = train.rename(columns = lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

In [3]:
gc.collect()

21

## Bayesian Optimizer for XGBoost Model

We will tune the following hyper-parameters using Bayesian Optimization to reduce the likelihood that the model overfits. We use Bayesian Optimization because it is much faster than grid-search. Rather than using all combinations of the hyper-parameters as done in grid-search, Bayesian optimization uses a surrogate model to choose the next point to evaluate by optimizing the acquisition function. We used a Gaussian Process for this surrogate model because it is flexible and gives us uncertainty estimates. 

* colsample_bylevel: the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed
* colsample_bytree: the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
* max_depth: this is the maximum depth of the tree. Increasing this value will make the model more complex and more likely to overfit.
* reg_alpha: L1 regularization to control overfitting.
* reg_lambda: L2 regularization to control overfitting.
* gamma: this specifies the minimum loss reduction required to make a split. The more conservative the algorithm will be
* min_child_weight: minimum sum of weights of all observations required in a child. Higher values are more likely to reduce overfitting, but too high of values can result in underfitting.
* n_estimators: number of trees (or rounds) in an XGBoost model. The more trees, the more likely the model will overfit.

We will keep the following hyper-parameters constant:
* base_score: the probability of the 0 and 1 values in the Target data set. We will set it to 0.5, which is the default.
* max_delta_step: maximum delta step we allow each tree’s weight estimation to be. We will set it to 0, so there will be no constraint on the step. 
* nthread: this is number of parallel threads used to run XGBoost
* learning_rate: the step size shrinkage used in update to prevent overfitting. We will set it to 0.01
* subsample: the fraction of observations to be randomly samples for each tree. We will set it to 0.85
* seed: this ensures we can generate reproducible results. This can be set to any number.
* scale_pos_weight: this accounts for the imbalanced class by controling the balance of positive and negative weights. We will set it to a positive number, 2, because we have highly imbalanced data.

After finding the best set of the above hyper-parameters that we tuned, we will build these hyper-parameters on the training data set and then evaluate the results on the testing set.

The following sources were used to understand these hyper-parameters and bayesian optimization:
* Bayesian Optimization Sources:
    * https://towardsdatascience.com/hyperparameter-optimization-in-gradient-boosting-packages-with-bayesian-optimization-aaf1b27e7b90
* XGBoost Sources:
    * https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook#A-Guide-on-XGBoost-hyperparameters-tuning
    * https://coderzcolumn.com/tutorials/machine-learning/bayes-opt-bayesian-optimization-for-hyperparameters-tuning
    * https://www.kaggle.com/code/christianlillelund/house-prices-xgboost-bayesianoptimization/notebook#Submission
    * https://www.kaggle.com/code/willkoehrsen/model-tuning-results-random-vs-bayesian-opt#Implementation
    * https://www.kaggle.com/code/snehithatiger/classification-using-random-search-xgboost
    * https://github.com/dmlc/xgboost/blob/master/demo/guide-python/cross_validation.py

In [4]:
xgb.set_config(verbosity=0)

def xgb_classifier(colsample_bytree, colsample_bylevel, n_estimators, max_depth, reg_alpha,
                   reg_lambda, min_child_weight, gamma):
    params = {"base_score": 0.5,
              "booster": 'gbtree',
              "colsample_bylevel": colsample_bylevel,
              "colsample_bytree": colsample_bytree,
              "objective" : "binary:logistic",
              "eval_metric" : "auc",
              "max_delta_step": 0,
              "max_depth" : int(max_depth),
              "reg_alpha" : reg_alpha,
              "reg_lambda" : reg_lambda,
              "gamma": gamma,
              "nthread" : 4,
              "min_child_weight" : min_child_weight,
              "learning_rate" : 0.01,
              "subsample" : 0.85,
              "seed" : 27,
              "verbosity" : 2,
              "n_estimators": int(n_estimators),
              "tree_method":'gpu_hist',
              "random_state": 0,
              "scale_pos_weight": 2
             }
    cv_result = xgb.cv(params,
                       xgb.DMatrix(train.drop(columns=['SK_ID_CURR','TARGET']), train['TARGET']),
                       1000,
                       early_stopping_rounds=20,
                       stratified=True,
                       nfold=3)
    return cv_result['test-auc-mean'].iloc[-1]

In [5]:
xgbBO = BayesianOptimization(xgb_classifier, {  'colsample_bytree':(0, 1),
                                                'colsample_bylevel': (0,1),
                                                'max_depth': (2, 6),
                                                'reg_alpha': (0.0, 1.0),
                                                'reg_lambda': (0.0, 1.0),
                                                'min_child_weight': (0, 30),
                                                'n_estimators': (100, 2000),
                                                'gamma': (0.0, 0.1)
                                                })

xgbBO.maximize(n_iter=10, init_points=2)

|   iter    |  target   | colsam... | colsam... |   gamma   | max_depth | min_ch... | n_esti... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------
Parameters: { "n_estimators" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "n_estimators" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "n_estimators" } might not be used.

  This could be a false alarm, with some parameters getting used by

In [6]:
result = np.array([i['target'] for i in xgbBO.res])
xgbBO.res[result.argmax()]['params']

{'colsample_bylevel': 0.7340040345842987,
 'colsample_bytree': 0.7764728602209735,
 'gamma': 0.07087099958722323,
 'max_depth': 5.3139635424804466,
 'min_child_weight': 13.384729804665435,
 'n_estimators': 779.5130010546751,
 'reg_alpha': 0.8643051625067838,
 'reg_lambda': 0.016702462769826898}

In [7]:
result

array([0.76643033, 0.77330533, 0.732362  , 0.78021933, 0.78096233,
       0.78466367, 0.78483533, 0.76604533, 0.783836  , 0.78510933,
       0.78307567, 0.78088767])

## Build XGBoost Model using Best Hyperparameters from Bayesian Optimization

In [8]:
colsample_bytree = xgbBO.max["params"]["colsample_bytree"]
colsample_bylevel = xgbBO.max["params"]["colsample_bylevel"]
max_depth = int(xgbBO.max["params"]["max_depth"])
reg_alpha = xgbBO.max["params"]["reg_alpha"]
reg_lambda = xgbBO.max["params"]["reg_lambda"]
min_child_weight = xgbBO.max["params"]["min_child_weight"]
n_estimators = xgbBO.max["params"]["n_estimators"]
gamma = xgbBO.max["params"]["gamma"]

In [9]:
clf = xgb.XGBClassifier(booster= 'gbtree',
    objective= "binary:logistic",
    eval_metric = "auc", 
    is_unbalance= True,
    nthreads = 4,
    learning_rate = 0.01,
    subsample = 0.85,
    seed = 27,
    verbosity = 2,
    tree_method='gpu_hist',
    gamma= gamma,
    max_depth= max_depth,
    min_child_weight= min_child_weight,
    n_estimators= int(n_estimators),
    reg_alpha= reg_alpha,
    reg_lambda= reg_lambda,
    colsample_bytree = colsample_bytree,
    colsample_bylevel = colsample_bylevel) 

clf.fit(train.drop(columns=['SK_ID_CURR','TARGET']), train['TARGET'], eval_metric= 'auc',verbose=2)



Parameters: { "is_unbalance", "nthreads" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.7340040345842987, colsample_bynode=1,
              colsample_bytree=0.7764728602209735, enable_categorical=False,
              eval_metric='auc', gamma=0.07087099958722323, gpu_id=0,
              importance_type=None, interaction_constraints='',
              is_unbalance=True, learning_rate=0.01, max_delta_step=0,
              max_depth=5, min_child_weight=13.384729804665435, missing=nan,
              monotone_constraints='()', n_estimators=779, n_jobs=2, nthreads=4,
              num_parallel_tree=1, predictor='auto', random_state=27,
              reg_alpha=0.8643051625067838, reg_lambda=0.016702462769826898,
              scale_pos_weight=1, seed=27, subsample=0.85,
              tree_method='gpu_hist', ...)

In [10]:
del train
gc.collect()

132

## Evaluate XGBoost Performance on Testing Data

In [11]:
test = pd.read_csv('/kaggle/input/msba-6420-predictive-analytics-project/Alldata_v3/test.csv')
test = test.rename(columns = lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

In [12]:
prediction = clf.predict_proba(test.drop(columns=['SK_ID_CURR']))
result = pd.DataFrame({'SK_ID_CURR':test['SK_ID_CURR'],
              'TARGET':pd.DataFrame(prediction)[1]})

result.to_csv("Result_XGB.csv",index=False)

In [13]:
del test
gc.collect()

42

# Feature Selection: Find Top Features from XGBoost Model

Now we will find the best features from the XGBoost model with the best hyper-parameters. When calculating the importance of the features, we will use the metric 'weight', which shows the number of times the feature is used to split data.

After finding the most important features, we will re-run the model on the most important features to improve our model's performance.

In [14]:
train = pd.read_csv('/kaggle/input/msba-6420-predictive-analytics-project/Alldata_v3/train.csv')
train = train.rename(columns = lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

In [15]:
gc.collect()

21

In [16]:
clf = xgb.XGBClassifier(booster= 'gbtree',
    objective= "binary:logistic",
    eval_metric = "auc", 
    is_unbalance= True,
    nthreads = 4,
    learning_rate = 0.01,
    subsample = 0.85,
    seed = 27,
    verbosity = 2,
    tree_method='gpu_hist',
    gamma= gamma,
    max_depth= max_depth,
    min_child_weight= min_child_weight,
    n_estimators= int(n_estimators),
    reg_alpha= reg_alpha,
    reg_lambda= reg_lambda,
    colsample_bytree = colsample_bytree,
    colsample_bylevel = colsample_bylevel) 

In [17]:
clf.fit(train.drop(columns=['SK_ID_CURR','TARGET']), train['TARGET'], eval_metric= 'auc',verbose=2)

Parameters: { "is_unbalance", "nthreads" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.7340040345842987, colsample_bynode=1,
              colsample_bytree=0.7764728602209735, enable_categorical=False,
              eval_metric='auc', gamma=0.07087099958722323, gpu_id=0,
              importance_type=None, interaction_constraints='',
              is_unbalance=True, learning_rate=0.01, max_delta_step=0,
              max_depth=5, min_child_weight=13.384729804665435, missing=nan,
              monotone_constraints='()', n_estimators=779, n_jobs=2, nthreads=4,
              num_parallel_tree=1, predictor='auto', random_state=27,
              reg_alpha=0.8643051625067838, reg_lambda=0.016702462769826898,
              scale_pos_weight=1, seed=27, subsample=0.85,
              tree_method='gpu_hist', ...)

In [18]:
importance = clf.get_booster().get_score(importance_type='weight')

In [19]:
keys = list(importance.keys())
values = list(importance.values())

In [20]:
important_features = pd.DataFrame({'features': keys, 'importance': values})

In [21]:
most_important = important_features.sort_values(by='importance',ascending=False).reset_index(drop=True)[:400]['features'].to_list()

In [22]:
train_important_features = train.filter(most_important)

## Re-run XGBoost Model on Best Features

In [23]:
clf = xgb.XGBClassifier(booster= 'gbtree',
    objective= "binary:logistic",
    eval_metric = "auc", 
    is_unbalance= True,
    nthreads = 4,
    learning_rate = 0.01,
    subsample = 0.85,
    seed = 27,
    verbosity = 2,
    tree_method='gpu_hist',
    gamma= gamma,
    max_depth= max_depth,
    min_child_weight= min_child_weight,
    n_estimators= int(n_estimators),
    reg_alpha= reg_alpha,
    reg_lambda= reg_lambda,
    colsample_bytree = colsample_bytree,
    colsample_bylevel = colsample_bylevel) 

clf.fit(train_important_features, train['TARGET'], eval_metric= 'auc',verbose=2)

Parameters: { "is_unbalance", "nthreads" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.7340040345842987, colsample_bynode=1,
              colsample_bytree=0.7764728602209735, enable_categorical=False,
              eval_metric='auc', gamma=0.07087099958722323, gpu_id=0,
              importance_type=None, interaction_constraints='',
              is_unbalance=True, learning_rate=0.01, max_delta_step=0,
              max_depth=5, min_child_weight=13.384729804665435, missing=nan,
              monotone_constraints='()', n_estimators=779, n_jobs=2, nthreads=4,
              num_parallel_tree=1, predictor='auto', random_state=27,
              reg_alpha=0.8643051625067838, reg_lambda=0.016702462769826898,
              scale_pos_weight=1, seed=27, subsample=0.85,
              tree_method='gpu_hist', ...)

In [24]:
del train_important_features
gc.collect()

58

## Evaluate XGBoost Performance on Testing Data

In [25]:
test = pd.read_csv('/kaggle/input/msba-6420-predictive-analytics-project/Alldata_v3/test.csv')
test = test.rename(columns = lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

In [26]:
prediction = clf.predict_proba(test.filter(most_important))
result = pd.DataFrame({'SK_ID_CURR':test['SK_ID_CURR'],
              'TARGET':pd.DataFrame(prediction)[1]})

result.to_csv("Result_XGB_top400.csv",index=False)