# Optimización de un modelo de XGBoost

Este notebook recoge los resultados de la búsqueda del mejor modelo de clasificación mediante XGBoost (= eXtreme Gradient Boosting). Se trata de un método de boosting, por tanto, la idea es generar un modelo robusto a partir de varios modelos "débiles". Sin embargo, se le considera extreme gradient boosting ya que es generalmente bastante más rápido que otras implementaciones de gradient boosting y suele tener un buen rendimiento sobre datos estructurados.

Para buscar el mejor modelo posible, se tratará de buscar los mejores hiperparámetros para:

* El tipo de booster que se va a utilizar.
* El paso del método de boosting.
* La mínima reducción de loss exigida para hacer una nueva partición de una rama cuando el booster sea de tal tipo.
* La profundidad máxima de los árboles cuando el booster sea de tal tipo.

### Preparación de los datos

In [2]:
# Estructuras de datos
import pandas as pd
import numpy as np

# Librerías de optimización de hiperparámetros
import optuna

# Accuracy
from sklearn.metrics import accuracy_score

# Model
import xgboost as xgb
from xgboost import XGBClassifier

# Cargar los datos
from data_and_submissions import *

# Métodos para los entrenamientos con CV
from train_cv_methods import *

Vamos a usar la siguiente partición de los datos:

* 60% train $\sim$ 50 datos
* 20% validation $\sim$ 18 datos (se define al aplicar cross-validación en el ajuste)
* 20% test $\sim$ 18 datos

In [2]:
X_train, X_test, y_train, y_test, test_kaggle = load_data()
print("Tamaño del dataset de train:", X_train.shape)
print("Tamaño del dataset de test:", X_test.shape)

Tamaño del dataset de train: (68, 410)
Tamaño del dataset de test: (18, 410)


### Modelo

Búsqueda de hiperparámetros mediante ``GridSearchCV`` de ``sklearn``:

In [3]:
import warnings
warnings.filterwarnings("ignore") # Suprimir warning de versiones
xgb.set_config(verbosity=0)

# Definir y entrenar el modelo
model_XGB = XGBClassifier(eval_metric="logloss", random_state=0, use_label_encoder=False)
param_grid_XGB = {
    "booster": ["gbtree", "gblinear", "dart"],
    "learning_rate": [0.0001, 0.001, 0.01, 0.1, 1],
    "gamma": [0.0001, 0.001, 0.01, 0.1, 1],
    "max_depth": np.arange(0, 21, 2) # 0 = ninguna restricción
}

In [5]:
# Definir y entrenar el modelo
cv_results_XGB = train_GridSearchCV(model_XGB, param_grid_XGB, X_train, X_test, y_train, y_test)
top_acc = top_acc_GridSearchCV(cv_results_XGB["mean_test_score"])
models_same_acc_GridSearchCV(cv_results_XGB, top_acc)

[{'booster': 'gblinear', 'gamma': 0.1, 'learning_rate': 1, 'max_depth': 0}]

In [5]:
model_XGB_opt = XGBClassifier(eval_metric="logloss", booster="gblinear", gamma=0.1, learning_rate=1, max_depth=0,
                              random_state=0, use_label_encoder=False)
model_XGB_opt.fit(X_train, y_train)

# Predicción en partición de test
y_pred_XGBoost = model_XGB_opt.predict(X_test)

# Precisión en partición de test
accuracy = accuracy_score(y_test, y_pred_XGBoost)
print("Accuracy: {:0.2f}%".format(accuracy * 100))

Accuracy: 66.67%


Búsqueda mediante la librería ``optuna`` probando 2 métodos de búsqueda de hiperparámetros:

* **GridSampler:** equivalente a la anterior búsqueda de grid de sklearn. Lo usaremos para que los resultados sean comparables.
* **TPE:** algoritmo para hacer una "búsqueda inteligente" de hiperparámetros. Debería ahorrar intentos de combinaciones haciendo una selección inteligente de las pruebas. En nuestro caso le permitiremos probar un 10% del número de combinaciones posibles. 

In [7]:
def objectiveXGBoost_Grid(trial):
    '''
    Define la función a optimizar por medio de un sampler de tipo GridSampler.
    En este caso se trata de maximizar el accuracy
    '''
    booster = trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"])
    learning_rate = trial.suggest_categorical("learning_rate", [0.0001, 0.001, 0.01, 0.1, 1])
    gamma = trial.suggest_categorical("gamma", [0.0001, 0.001, 0.01, 0.1, 1])
    max_depth = trial.suggest_int("max_depth", 0, 20)
    
    modelXGBoost_optuna = XGBClassifier(eval_metric="logloss", booster=booster, learning_rate=learning_rate, gamma=gamma,
                                        max_depth=max_depth, random_state=0, use_label_encoder=False)
    
    modelXGBoost_optuna.fit(X_train, y_train)

    y_pred_XGBoost_optuna = modelXGBoost_optuna.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred_XGBoost_optuna)
    return accuracy

In [8]:
# Prueba con GridSampler
optuna.logging.set_verbosity(optuna.logging.WARNING)

search_space = {"booster": ["gbtree", "gblinear", "dart"], 
                "learning_rate": [0.0001, 0.001, 0.01, 0.1, 1],
                "gamma": [0.0001, 0.001, 0.01, 0.1, 1],
                "max_depth": range(0, 20, 2)
               }
sampler = optuna.samplers.GridSampler(search_space)
study_Grid = optuna.create_study(direction="maximize", sampler=sampler)
study_Grid.optimize(objectiveXGBoost_Grid)

In [9]:
study_Grid.best_trial

FrozenTrial(number=3, values=[0.8333333333333334], datetime_start=datetime.datetime(2022, 7, 2, 21, 37, 5, 606018), datetime_complete=datetime.datetime(2022, 7, 2, 21, 37, 5, 809093), params={'booster': 'gblinear', 'learning_rate': 0.001, 'gamma': 1, 'max_depth': 0}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear', 'dart')), 'learning_rate': CategoricalDistribution(choices=(0.0001, 0.001, 0.01, 0.1, 1)), 'gamma': CategoricalDistribution(choices=(0.0001, 0.001, 0.01, 0.1, 1)), 'max_depth': IntUniformDistribution(high=20, low=0, step=1)}, user_attrs={}, system_attrs={'search_space': OrderedDict([('booster', ['dart', 'gblinear', 'gbtree']), ('gamma', [0.0001, 0.001, 0.01, 0.1, 1]), ('learning_rate', [0.0001, 0.001, 0.01, 0.1, 1]), ('max_depth', [0, 2, 4, 6, 8, 10, 12, 14, 16, 18])]), 'grid_id': 460}, intermediate_values={}, trial_id=3, state=TrialState.COMPLETE, value=None)

In [6]:
# Definir y entrenar el modelo
modelXGBoost_optuna_Grid = XGBClassifier(eval_metric="logloss", booster="gblinear", learning_rate=0.001, gamma=1,
                                         max_depth=0, random_state=0, use_label_encoder=False)  
modelXGBoost_optuna_Grid.fit(X_train, y_train)

# Predicción en partición de test
y_pred_XGBoost_optuna_Grid = modelXGBoost_optuna_Grid.predict(X_test)

# Precisión en partición de test
accuracy = accuracy_score(y_test, y_pred_XGBoost_optuna_Grid)
print("Accuracy: {:0.2f}%".format(accuracy * 100))

Accuracy: 83.33%


In [11]:
def objectiveXGBoost_TPE(trial):
    '''
    Define la función a optimizar por medio de un sampler de tipo TPE.
    En este caso se trata de maximizar el accuracy
    '''
    booster = trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"])
    learning_rate = trial.suggest_categorical("learning_rate", [0.0001, 0.001, 0.01, 0.1, 1])
    gamma = trial.suggest_categorical("gamma", [0.0001, 0.001, 0.01, 0.1, 1])
    max_depth = trial.suggest_int("max_depth", 0, 20, 2)
    
    modelXGBoost_optuna = XGBClassifier(eval_metric="logloss", booster=booster, learning_rate=learning_rate, gamma=gamma,
                                        max_depth=max_depth, random_state=0, use_label_encoder=False)
    
    modelXGBoost_optuna.fit(X_train, y_train)

    y_pred_XGBoost_optuna = modelXGBoost_optuna.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred_XGBoost_optuna)
    return accuracy

In [15]:
# Prueba con TPE
optuna.logging.set_verbosity(optuna.logging.WARNING)

sampler = optuna.samplers.TPESampler(seed=0)  # Asegurar los reproducibilidad de los resultados
study_TPE = optuna.create_study(direction="maximize", sampler=sampler)
study_TPE.optimize(objectiveXGBoost_TPE, n_trials=83)
# n_trials = (3 x 5 x 5 x 11) * 0.1 = 82.5 ~ 83

In [16]:
study_TPE.best_trial

FrozenTrial(number=46, values=[0.8333333333333334], datetime_start=datetime.datetime(2022, 7, 2, 21, 54, 30, 210936), datetime_complete=datetime.datetime(2022, 7, 2, 21, 54, 30, 460878), params={'booster': 'gblinear', 'learning_rate': 0.001, 'gamma': 0.1, 'max_depth': 14}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear', 'dart')), 'learning_rate': CategoricalDistribution(choices=(0.0001, 0.001, 0.01, 0.1, 1)), 'gamma': CategoricalDistribution(choices=(0.0001, 0.001, 0.01, 0.1, 1)), 'max_depth': IntUniformDistribution(high=20, low=0, step=2)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=46, state=TrialState.COMPLETE, value=None)

In [7]:
# Definir y entrenar el modelo
modelXGBoost_optuna_TPE = XGBClassifier(eval_metric="logloss", booster="gblinear", learning_rate=0.001, gamma=0.1,
                                        max_depth=14, random_state=0, use_label_encoder=False) 
modelXGBoost_optuna_TPE.fit(X_train, y_train)

# Predicción en partición de test
y_pred_XGBoost_optuna_TPE = modelXGBoost_optuna_TPE.predict(X_test)

# Precisión en partición de test
accuracy = accuracy_score(y_test, y_pred_XGBoost_optuna_TPE)
print("Accuracy: {:0.2f}%".format(accuracy * 100))

Accuracy: 83.33%


Búsqueda mediante ``optuna`` con ``OptunaSearchCV``:

In [18]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Definir y entrenar el modelo
model_XGB = XGBClassifier(eval_metric="logloss", random_state=0, use_label_encoder=False)
param_grid_XGB = {
    "booster": optuna.distributions.CategoricalDistribution(["gbtree", "gblinear", "dart"]),
    "learning_rate": optuna.distributions.CategoricalDistribution([0.0001, 0.001, 0.01, 0.1, 1]),
    "gamma": optuna.distributions.CategoricalDistribution([0.0001, 0.001, 0.01, 0.1, 1]),
    "max_depth": optuna.distributions.IntUniformDistribution(0, 20, 2) # 0 = ninguna restricción
}

optuna_search = optuna.integration.OptunaSearchCV(model_XGB, param_grid_XGB, cv=4, n_trials=792, refit=True, random_state=0)
# n_trials = 3 x 5 x 5 x 11 = 825
optuna_search.fit(X_train, y_train)

OptunaSearchCV(cv=4,
               estimator=XGBClassifier(base_score=None, booster=None,
                                       colsample_bylevel=None,
                                       colsample_bynode=None,
                                       colsample_bytree=None,
                                       enable_categorical=False,
                                       eval_metric='logloss', gamma=None,
                                       gpu_id=None, importance_type=None,
                                       interaction_constraints=None,
                                       learning_rate=None, max_delta_step=None,
                                       max_depth=None, min_child_weight=None,
                                       missing=nan, mo...
                                       validate_parameters=None,
                                       verbosity=None),
               n_trials=792,
               param_distributions={'booster': CategoricalDistribution(cho

In [19]:
top_acc = top_acc_OptunaSearchCV(optuna_search.trials_)
models_same_acc_OptunaSearchCV(optuna_search.trials_, top_acc)

[{'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 20},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 20},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 18},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 1, 'max_depth': 18},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 18},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 18},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 18},
 {'booster': 'gblinear', 'learning_rate': 1, 'gamma': 0.01, 'max_depth': 16}]

In [10]:
optunaCV_opt = XGBClassifier(eval_metric="logloss", booster="gblinear", learning_rate=1, gamma=0.01,
                             max_depth=20, random_state=0, use_label_encoder=False)
optunaCV_opt.fit(X_train, y_train)

# Predicción en partición de test
y_pred_XGB_optuna = optunaCV_opt.predict(X_test)

# Precisión en partición de test
accuracy = accuracy_score(y_test, y_pred_XGB_optuna)
print("Accuracy: {:0.2f}%".format(accuracy * 100))

Accuracy: 66.67%


Comparamos las predicciones en ``test_kaggle`` los dos modelos con misma accuracy máxima:

In [11]:
y_pred_model1 = modelXGBoost_optuna_Grid.predict(test_kaggle)
y_pred_model2 = modelXGBoost_optuna_TPE.predict(test_kaggle)

In [12]:
results = {"Optuna & GridSampler": y_pred_model1, "Optuna & TPE": y_pred_model2}

results_df = pd.DataFrame(results)    
results_df["All the same"] = results_df.eq(results_df.iloc[:, 0], axis=0).all(1)
results_df[results_df["All the same"] == False]

Unnamed: 0,Optuna & GridSampler,Optuna & TPE,All the same
7890,1,0,False
10602,0,1,False
19861,1,0,False
21314,1,0,False
23272,1,0,False
27814,1,0,False
37473,1,0,False
43662,1,0,False
46080,1,0,False
51295,1,0,False


Los modelos generan distintas predicciones, vamos a generar el submit de Kaggle para ambos:

In [13]:
create_submission(y_pred_model1, "opt_XGBoost_GridSampler")

(119748, 2)


In [14]:
create_submission(y_pred_model2, "opt_XGBoost_TPE")

(119748, 2)
