# 1. Introduction

In this google colab, we'll be applying our accumulated knowledge on the techniques of supervised learning algorithms. The task to be adressed for this milestone is the prediction of damage levels to buildings caused by the 2015 Gorkha earthquake in Nepal. Further information on the task is retrievable from the competition page by **drivendata.org**: "[Richter's Predictor: Modeling Earthquake Damage](https://www.drivendata.org/competitions/57/nepal-earthquake/)".

The authors of this project are:

- [Raúl Barba Rojas](Raul.Barba@alu.uclm.es)
- [Diego Guerrero Del Pozo](Diego.Guerrero@alu.uclm.es)
- [Marvin Schmidt](Marvin.Schmidt@alu.uclm.es)

# 2. Preparations

## 2.1. Importing libraries

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## 2.2 Importing training data

All the datasets from the DrivenData competition can be accessed in this github repository.

In this section, we simply load the three different datasets as pandas dataframes, so that we can work with them to achieve the desired results.

---

There are two different csv files related to the training dataset:

1. `train_values.csv`: this file contains the values of the different features with which the training will be performed.
2. `train_labels.csv `: this file contains the values of the labels for the output feature that we are trying to predict, which is called `damage_grade`.

Thus, we first need to download the datasets from the github repository and we need to load them as dataframes:

In [2]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
df_train_values= pd.read_csv("train_values.csv", index_col = "building_id")
df_train_values

--2022-12-13 09:54:31--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv [following]
--2022-12-13 09:54:31--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23442727 (22M) [text/plain]
Saving to: ‘train_values.csv’


2022-12-13 09:54:33 (241 MB/s) - ‘train_values.csv’ saved [23442727/23442727]



Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,25,1335,1621,1,55,6,3,n,r,n,...,0,0,0,0,0,0,0,0,0,0
669485,17,715,2060,2,0,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
602512,17,51,8163,3,55,6,7,t,r,q,...,0,0,0,0,0,0,0,0,0,0
151409,26,39,1851,2,10,14,6,t,r,x,...,0,0,0,0,0,0,0,0,0,0


In [3]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
df_train_labels = pd.read_csv("train_labels.csv", index_col = "building_id")
df_train_labels

--2022-12-13 09:54:34--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv [following]
--2022-12-13 09:54:35--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2330792 (2.2M) [text/plain]
Saving to: ‘train_labels.csv’


2022-12-13 09:54:36 (141 MB/s) - ‘train_labels.csv’ saved [2330792/2330792]



Unnamed: 0_level_0,damage_grade
building_id,Unnamed: 1_level_1
802906,3
28830,2
94947,3
590882,2
201944,3
...,...
688636,2
669485,3
602512,3
151409,2


Once we have loaded both datasets we need to join them, obtaining the complete training dataset:

In [4]:
df_train_values.join(df_train_labels).to_csv("train_full.csv")

## 2.3 Importing testing data

In order to be able to evaluate our findings, we'll also need the testing data, as well as the template for the submission file. These datasets can also be accessed from this github repository.

1. `test_values.csv`: this file contains the values of the different features with which the testing will be performed.
2. `submission_format.csv`: this file contains "empty" labels for all the buildings we're trying to predict the damage grade for. It's a template file to be modified later, in which every label for ``damage_grade`` is ``1``.

In [5]:
from sklearn.preprocessing import StandardScaler

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
test_values = pd.read_csv('test_values.csv', index_col='building_id')
test_values = pd.get_dummies(test_values)

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/submission_format.csv
submission_format = pd.read_csv('submission_format.csv', index_col='building_id')

--2022-12-13 09:54:39--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv [following]
--2022-12-13 09:54:39--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7815385 (7.5M) [text/plain]
Saving to: ‘test_values.csv’


2022-12-13 09:54:40 (241 MB/s) - ‘test_values.csv’ saved [7815385/7815385]

--2022-12-13 09:54:40--  https://github.com/alan-flint/Richter-DrivenDa

# 3. Model implementation

## 3.1. XGBoost

In this section, we will try to provide the best solution possible using a XGBoost as a model. A priori, we expect better results, as it is known for being a more complex model that could be good when trying to make these kind of predictions that we are trying to do. Historically, it has been good to make predictions in this competition, so we try to use it to improve our results in such competition.

We will use the features that were selected in the baseline:

In [6]:
df_train_values_subset = pd.get_dummies(df_train_values)

selected_features = ['geo_level_1_id',
                     'geo_level_2_id',
                     'geo_level_3_id',
                     'foundation_type_r',
                     'age',
                     'area_percentage',
                     'height_percentage',
                     'has_superstructure_mud_mortar_stone',
                     'ground_floor_type_v',
                     'other_floor_type_q']

df_train_values_subset = df_train_values_subset[selected_features]
df_train_values_subset

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,foundation_type_r,age,area_percentage,height_percentage,has_superstructure_mud_mortar_stone,ground_floor_type_v,other_floor_type_q
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
802906,6,487,12198,1,30,6,5,1,0,1
28830,8,900,2812,1,10,8,7,1,0,1
94947,21,363,8973,1,10,5,5,1,0,0
590882,22,418,10694,1,10,6,5,1,0,0
201944,11,131,1488,1,30,8,9,0,0,0
...,...,...,...,...,...,...,...,...,...,...
688636,25,1335,1621,1,55,6,3,1,0,0
669485,17,715,2060,1,0,6,5,1,0,1
602512,17,51,8163,1,55,6,7,1,0,1
151409,26,39,1851,1,10,14,6,0,1,0


Now, let us split the initial dataset into train and test datasets.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_train_values_subset, df_train_labels.damage_grade, random_state=1)

### 3.1.1. XGBoost without hyperparameter tuning

A first aproximation is to simply run the model and obtain the result, without getting deeper into the hyperparameter optimization:

In [8]:
import time

from xgboost import XGBClassifier
from sklearn.metrics import f1_score

# Creates XGBoost Classifier
model = XGBClassifier(
    n_estimators = 2048,
    random_state = 0
)

%time model.fit(X_train, Y_train) # train the model
Y_pred = model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

CPU times: user 6min 57s, sys: 2 s, total: 6min 59s
Wall time: 7min
F1 score:     0.7290


As we can see, the f1-score is not really bad, but it let's see how far we can push it with hyperparameter tuning.

### 3.1.2. XGBoost with hyperparameter tuning

Once we got an initial, orientative, value for XGBoost, we can start tuning its parameters to obtain better results. Thus, we will do hyperparameter tuning in this way:

1. We will perform random search to obtain possible good values for the hyperparameters.

2. We will perform grid search to obtain the "optimal" values for the hyperparameters.

#### 3.1.2.1. Random Search

In this subsection we apply random search to start tuning the hyperparameters. Let us define the parameters to be optimized (for each of them we give some possible values, the idea is to get "good" values out of the possibilities we gave to the algorithm):

In [9]:
param_dist = {
    "learning_rate": [0.001, 0.01, 0.1],
    "n_estimators": [64, 128, 256, 512, 1024, 2048],
    "colsample_bytree": [0.6, 0.8, 1.0],
    "subsample": [0.5, 0.7, 1.0],
    "max_depth": [3, 7, 9, None],
    "reg_lambda": [1, 1.5, 2],
    "gamma": [0, 0.1, 0.3],
}

Now, we will perform the random search, however, we will use some parametrization for it. Specifically, we will use Stratified KFold, because we have unbalanced data (to deal with this problem we decided to use it):

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold


kfold = StratifiedKFold(n_splits = 5) # To prevent unbalanced data problems

model = XGBClassifier(random_state = 0, tree_method = 'gpu_hist')

random_search_models = RandomizedSearchCV(
    estimator = model, 
    param_distributions = param_dist, 
    n_iter = 100, 
    cv = kfold, 
    random_state = 0,
    n_jobs = -1
)

random_search_models.fit(X_train, Y_train)

Let us define a function to show a report of the results: (Not developed by us! You can check the author in this [link](https://colab.research.google.com/drive/1qk_2pqwj69Xrnj9_5i-M32tD8BceLCPE?usp=sharing#scrollTo=zEAgBk9PViSx)) 

In [None]:
def report(results, n_top=3): 
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

We can obtain a report to see the best models and parameters:

In [None]:
report(random_search_models.cv_results_, n_top = 5)

In [None]:
random_search_models.best_params_

Now, we can decide to select the best model to obtain a possible prediction:

In [None]:
best_random_model = random_search_models.best_estimator_ # Gets the best model

best_random_model.fit(X_train, Y_train) # Trains the model
Y_pred = best_random_model.predict(X_test) # Test predictions

# f1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Before performing the grid search algorithm to find the most optimal model, we can use the "best" model obtained with random search to see the results obtained in the competition:

In [None]:
# Apply feature reduction
test_values_subset = test_values[selected_features]

# Obtain the predictions
predictions = best_random_model.predict(test_values_subset)

# Create the submission file
xgboost_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission.to_csv('xgboost_submission_baseline.csv')

After submitting the csv file, we obtained a `0.7342`, which leads to the rank `#735` of the whole competition, a good spot, but it can be improved even more, so let us use grid search to improve this result.

However, there is something we must pay attention to. We generated the model using the GPU, by making use of the parameter "gpu_hist". However, by making use of that parameter, the model will provide some approximate results (to achieve faster execution). Thus, the results can be better if we execute it with "gpu_real" or simply without a GPU. Let's compare the results:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

model = XGBClassifier(
    random_state = 0,
    subsample = 0.7,
    reg_lambda =  1.5,
    n_estimators = 256,
    max_depth = 9,
    learning_rate = 0.1,
    gamma = 0.1,
    colsample_bytree = 0.8
)

model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Now let's generate the submission file:

In [None]:
# apply feature reduction
test_values_subset = test_values[selected_features]

# obtain the predictions
predictions = model.predict(test_values_subset)

# create the submission file
xgboost_submission_v2 = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission_v2.to_csv('xgboost_submission_v2.csv')

After such improvement, we obtained the rank `#695`, as we got a result of `0.7364`.

We can make some small variations to try to achieve better results (although this will be performed properly later with the grid search), for instance, increasing the number of estimators:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

model = XGBClassifier(
    random_state = 0,
    subsample = 0.7,
    reg_lambda =  1.5,
    n_estimators = 475,
    max_depth = 9,
    learning_rate = 0.1,
    gamma = 0.1,
    colsample_bytree = 0.8,
)

model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Let's obtain the deliverable csv file:

In [None]:
# apply feature reduction
test_values_subset = test_values[selected_features]

# obtain the predictions
predictions = model.predict(test_values_subset)

# create the submission file
xgboost_submission_v3 = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission_v3.to_csv('xgboost_submission_v3.csv')

#### 3.1.2.2. Grid Search

Another improvement that can be performed is to apply grid search on some of the best parameters, in order to complete the hyperparametrization.

Firstly, let us define the parameters that we will try (the values come from the top 5 best models in the previous execution):

- subsample: 0.7, 1.
- reg_lambda: 1, 1.5, 2.
- n_estimators: 475 (we decided to use a value in the middle).
- max_depth: 7, 9.
- learning_rate: 0.1.
- gamma: 0.1, 0.3.
- colsample_bytree: 0.6, 0.8, 1.

In [None]:
param_dist = {"n_estimators": [475], 
              #"max_features": ['auto', 'sqrt'], 
              "max_depth": [7, 9], 
              "subsample": [0.7, 1],
              "reg_lambda" : [1, 1.5, 2],
              "gamma" : [0.1, 0.3],
              "colsample_bytree": [0.6, 0.8, 1],
              "learning_rate" : 0.1
            }

Now, we can apply the grid search:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

kfold = StratifiedKFold(n_splits = 5) # To prevent unbalanced data problems

model = XGBClassifier(random_state = 0, tree_method = 'gpu_hist')

grid_search_models = GridSearchCV(
    estimator = model, 
    param_grid= param_dist, 
    cv = kfold,
    n_jobs = -1
)

# Fit the random search model
grid_search_models.fit(X_train, Y_train)

We can show the report of the top 5 models:

In [None]:
report(grid_search_models.cv_results_, n_top = 5)

Let us create the best model obtained by grid search:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

model = XGBClassifier(
    random_state = 0,
    colsample_bytree = 0.6,
    gamma = 0.1,
    max_depth = 9,
    n_estimators = 475,
    reg_lambda = 2,
    subsample = 1,
    learning_rate = 0.1
)

model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Let us obtain the competition results:

In [None]:
# apply feature reduction
test_values_subset = test_values[selected_features]

# obtain the predictions
predictions = model.predict(test_values_subset)

# create the submission file
xgboost_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission.to_csv('xgboost_submission_grid_search.csv')

As a result of the previous code, we obtained a value of `0.7378`, which leads to the rank `#668`. However, we still believe that it can be improved, so we will perform some more improvements on it.

#### 3.1.2.3. Normalizing the values

We also wanted to see if the same models with normalized data would lead to different results, thus we decided to execute the model obtained with grid search with normalized data to obtain another result (we are trying to obtain the highest possible score).

The first step is to normalize the data:

In [None]:
df_train_values_normalized_subset = df_train_values_subset.copy()
test_values_normalized = test_values.copy()

df_train_values_normalized_subset['geo_level_1_id'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['geo_level_1_id']])
df_train_values_normalized_subset['geo_level_2_id'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['geo_level_2_id']])
df_train_values_normalized_subset['geo_level_3_id'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['geo_level_3_id']])
df_train_values_normalized_subset['age'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['age']])
df_train_values_normalized_subset['area_percentage'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['area_percentage']])
df_train_values_normalized_subset['height_percentage'] = StandardScaler().fit_transform(df_train_values_normalized_subset[['height_percentage']])

test_values_normalized['geo_level_1_id'] = StandardScaler().fit_transform(test_values_normalized[['geo_level_1_id']])
test_values_normalized['geo_level_2_id'] = StandardScaler().fit_transform(test_values_normalized[['geo_level_2_id']])
test_values_normalized['geo_level_3_id'] = StandardScaler().fit_transform(test_values_normalized[['geo_level_3_id']])
test_values_normalized['age'] = StandardScaler().fit_transform(test_values_normalized[['age']])
test_values_normalized['area_percentage'] = StandardScaler().fit_transform(test_values_normalized[['area_percentage']])
test_values_normalized['height_percentage'] = StandardScaler().fit_transform(test_values_normalized[['height_percentage']])

Now we obtain the new train and test data:

In [None]:
from sklearn.model_selection import train_test_split

X_normalized_train, X_normalized_test, Y_normalized_train, Y_normalized_test = train_test_split(df_train_values_normalized_subset, df_train_labels.damage_grade, random_state=1)

Now let us obtain the predictions:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

model = XGBClassifier(
    random_state = 0,
    colsample_bytree = 0.6,
    gamma = 0.1,
    max_depth = 9,
    n_estimators = 475,
    reg_lambda = 2,
    subsample = 1,
    learning_rate = 0.1
)

model.fit(X_normalized_train, Y_normalized_train)

Y_pred = model.predict(X_normalized_test) # Obtain the test predictions

# f1-score
f1 = f1_score(Y_normalized_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Now, we can obtain the results in the competition:

In [None]:
# apply feature reduction
test_values_subset = test_values_normalized[selected_features]

# obtain the predictions
predictions = model.predict(test_values_subset)

# create the submission file
xgboost_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission.to_csv('xgboost_submission_grid_search_normalized.csv')

#### 3.1.2.4. Bayes Search

Another way of performing the hyperparametrization is to use bayesian search, which is supposed to be faster than the previous methods, but it could also lead to improvements when it comes to the parameter tuning:

Let's install the necessary libraries:

In [None]:
!pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.callbacks import DeadlineStopper, DeltaYStopper
from skopt.space import Real, Categorical, Integer

Let us define the hyperparameters to be optimized:

In [None]:
search_spaces = {'learning_rate': Real(0.01, 1.0, 'uniform'),
                 'max_depth': Integer(2, 12),
                 'subsample': Real(0.1, 1.0, 'uniform'),
                 'colsample_bytree': Real(0.1, 1.0, 'uniform'), # subsample ratio of columns by tree
                 'reg_lambda': Real(1e-9, 100., 'uniform'), # L2 regularization
                 'reg_alpha': Real(1e-9, 100., 'uniform'), # L1 regularization
                 'n_estimators': Integer(64, 4096),
                 'gamma' : Real(0, 100, 'uniform'),
                 'min_child_weight' : Integer(0, 100),
   }

As it can be seen above, we defined the parameters widely, because the grid search did not expect quite as good as we thought it would. Thus, we want to get the hyperparameters from scratch, making use of Bayes Search.

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

kfold = StratifiedKFold(n_splits = 6)                              # To prevent unbalanced data problems

model = XGBClassifier(
    random_state = 0,
    tree_method = 'gpu_hist'
)

bayes_search_models = BayesSearchCV(
  estimator = model,
  search_spaces = search_spaces,
  n_jobs = -1,
  cv = kfold,
  n_iter = 70,
  scoring = 'f1_micro'
)

bayes_search_models.fit(X_train, Y_train)

## 3.2. Pre-evaluation

Let's obtain the report:

In [None]:
report(bayes_search_models.cv_results_, n_top = 5)

Let us try the best model obtained with Bayes Search:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

model = XGBClassifier(
    random_state = 0,
    colsample_bytree = 0.5213550800955983,
    gamma = 9.586621746368602,
    max_depth = 12,
    min_child_weight = 44,
    n_estimators = 510,
    reg_lambda = 43.000232118939216,
    reg_alpha = 1e-09,
    subsample = 0.8367536123980525,
    learning_rate = 0.07311758457326237
)

model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

We can see how the model is actually worse than other models, most likely because it did not finish the execution (it did not converge).