# XGBOOST - Benchmark individuel

XGBoost (Extreme Gradient Boosting) est un algorithme de machine learning basé sur le gradient boosting, conçu pour être rapide, performant et optimisé pour les grands volumes de données.  
Il construit une série d’arbres de décision, où chaque nouvel arbre corrige les erreurs commises par les précédents.  
XGBoost utilise des techniques avancées comme la régularisation, le traitement efficace des valeurs manquantes et l’optimisation parallèle, ce qui lui permet d’obtenir d’excellents résultats sur les problèmes tabulaires complexes.  

In [1]:
import os
import sys
from pathlib import Path


CWD = Path.cwd()
PROJECT_ROOT = CWD.parent.parent
DB_PATH = (PROJECT_ROOT / "mlflow.db").resolve()
ARTIFACT_ROOT = (PROJECT_ROOT / "artifacts").resolve()
ARTIFACT_ROOT.mkdir(parents=True, exist_ok=True)


os.environ["MLFLOW_TRACKING_URI"] = f"sqlite:///{DB_PATH.as_posix()}"
os.environ["MLFLOW_ARTIFACT_URI"] = ARTIFACT_ROOT.as_uri()


sys.path.append(str(PROJECT_ROOT))

import mlflow  


mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

print("CWD =", CWD)
print("Tracking URI =", mlflow.get_tracking_uri())
print("Artifacts root (env) =", os.environ["MLFLOW_ARTIFACT_URI"])


CWD = c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\notebooks\02_benchmark
Tracking URI = sqlite:///C:/Users/yoann/Documents/open classrooms/projet 8/livrables/pret a dépenser/mlflow.db
Artifacts root (env) = file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts


In [2]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier

from src.modeling.train import train_with_cv
from src.modeling.prepare_for_model import prepare_application_for_model
from src.tracking import mlflow_tracking

EXPERIMENT_NAME = "home_credit_benchmarking"
exp_id = mlflow_tracking.get_or_create_experiment(EXPERIMENT_NAME, ARTIFACT_ROOT)
mlflow.set_experiment(EXPERIMENT_NAME)


<Experiment: artifact_location='file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts', creation_time=1771138249350, experiment_id='1', last_update_time=1771138249350, lifecycle_stage='active', name='home_credit_benchmarking', tags={}>

In [3]:
DATA_PATH = PROJECT_ROOT / "data" / "processed" / "train_split.csv"
df = pd.read_csv(DATA_PATH)

X_boost, y = prepare_application_for_model(df, model_type="boosting")

print("X_boost:", X_boost.shape, "| y:", y.shape)

X_boost: (215257, 1656) | y: (215257,)


In [4]:
from src.modeling.prepare_xgboost import prepare_xgb

X_xgb = prepare_xgb(X_boost)

print("X_xgb dtypes uniques :", X_xgb.dtypes.unique())

X_xgb dtypes uniques : [dtype('float32')]


In [5]:
params_xgb = {
    "n_estimators": 300,
    "learning_rate": 0.05,
    "max_depth": 4,
    "subsample": 0.8,
    "colsample_bytree": 0.8,

    "objective": "binary:logistic",
    "eval_metric": "auc",

    # déséquilibre
    "scale_pos_weight": 10,

    # rapide et stable
    "tree_method": "hist",

    "random_state": 42,
    "n_jobs": -1,
}

model_xgb = XGBClassifier(**params_xgb)


In [6]:
results = train_with_cv(
    model=model_xgb,
    model_name="XGBoost",
    X=X_xgb,
    y=y,
    model_type="boosting",

    threshold=0.5,
    n_splits=5,
    random_state=42,
    log_fold_metrics=True,

    # business
    cost_fn=10,
    cost_fp=1,
    fbeta_beta=3,

    use_lgb_categorical=False,
    fit_params={}
)

results


===== Entraînement (benchmark CV) : XGBoost =====

--- Fold 1/5 ---
   → AUC=0.7894 | Recall@0.50=0.6605 | F1@0.50=0.3133 | F3@0.50=0.5407 | Cost=20685
   → TN=30691 FP=8885 FN=1180 TP=2296 | fit=123.73s | pred=0.49s

--- Fold 2/5 ---
   → AUC=0.7767 | Recall@0.50=0.6300 | F1@0.50=0.2970 | F3@0.50=0.5147 | Cost=21939
   → TN=30497 FP=9079 FN=1286 TP=2190 | fit=115.38s | pred=0.45s

--- Fold 3/5 ---
   → AUC=0.7826 | Recall@0.50=0.6426 | F1@0.50=0.3045 | F3@0.50=0.5258 | Cost=21381
   → TN=30615 FP=8961 FN=1242 TP=2233 | fit=97.10s | pred=0.65s

--- Fold 4/5 ---
   → AUC=0.7864 | Recall@0.50=0.6558 | F1@0.50=0.3072 | F3@0.50=0.5345 | Cost=21042
   → TN=30494 FP=9082 FN=1196 TP=2279 | fit=111.45s | pred=0.50s

--- Fold 5/5 ---
   → AUC=0.7757 | Recall@0.50=0.6247 | F1@0.50=0.2962 | F3@0.50=0.5113 | Cost=22055
   → TN=30561 FP=9015 FN=1304 TP=2171 | fit=100.53s | pred=0.73s

===== Résultats finaux (CV) =====
AUC                         : 0.7822 ± 0.0053
Recall@0.50              : 0.6427 

{'model': 'XGBoost',
 'auc_mean': 0.7821614968951536,
 'auc_std': 0.005321885960992504,
 'recall_mean_fixed_threshold': 0.6427458668278266,
 'recall_std_fixed_threshold': 0.013949161314424583,
 'precision_mean_fixed_threshold': 0.1987698203320216,
 'precision_std_fixed_threshold': 0.004212888191168557,
 'f1_mean_fixed_threshold': 0.30363687194771427,
 'f1_std_fixed_threshold': 0.006421266085381616,
 'fbeta_3_mean_fixed_threshold': 0.5253891385498691,
 'fbeta_3_std_fixed_threshold': 0.011237807763337698,
 'business_cost_mean_fixed_threshold': 21420.4,
 'business_cost_std_fixed_threshold': 521.0019577698341,
 'threshold': 0.5,
 'time_sec': 557.525759935379}