# CATBOOST - Benchmark individuel

CatBoost (Categorical Boosting) est un algorithme de gradient boosting développé par Yandex, spécialement conçu pour gérer efficacement les variables catégorielles.  
Contrairement à d’autres modèles qui nécessitent un encodage manuel (comme le One‑Hot), CatBoost traite directement les catégories grâce à des techniques d’encodage statistique intégrées, ce qui réduit les erreurs et évite le surapprentissage.

CatBoost est apprécié pour sa stabilité, sa facilité d’utilisation, et ses excellentes performances sur les données tabulaires.  
Il gère automatiquement les valeurs manquantes, réduit le risque d’overfitting et fonctionne très bien même avec peu de réglages, ce qui en fait un modèle robuste et fiable pour des benchmarks comme Home Credit.

In [1]:
import os
import sys
from pathlib import Path


CWD = Path.cwd()
PROJECT_ROOT = CWD.parent.parent
DB_PATH = (PROJECT_ROOT / "mlflow.db").resolve()
ARTIFACT_ROOT = (PROJECT_ROOT / "artifacts").resolve()
ARTIFACT_ROOT.mkdir(parents=True, exist_ok=True)


os.environ["MLFLOW_TRACKING_URI"] = f"sqlite:///{DB_PATH.as_posix()}"
os.environ["MLFLOW_ARTIFACT_URI"] = ARTIFACT_ROOT.as_uri()


sys.path.append(str(PROJECT_ROOT))

import mlflow  


mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

print("CWD =", CWD)
print("Tracking URI =", mlflow.get_tracking_uri())
print("Artifacts root (env) =", os.environ["MLFLOW_ARTIFACT_URI"])

CWD = c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\notebooks\02_benchmark
Tracking URI = sqlite:///C:/Users/yoann/Documents/open classrooms/projet 8/livrables/pret a dépenser/mlflow.db
Artifacts root (env) = file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts


In [2]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier

from src.modeling.train import train_with_cv
from src.modeling.prepare_for_model import prepare_application_for_model
from src.tracking import mlflow_tracking

EXPERIMENT_NAME = "home_credit_benchmarking"
exp_id = mlflow_tracking.get_or_create_experiment(EXPERIMENT_NAME, ARTIFACT_ROOT)
mlflow.set_experiment(EXPERIMENT_NAME)


<Experiment: artifact_location='file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts', creation_time=1771138249350, experiment_id='1', last_update_time=1771138249350, lifecycle_stage='active', name='home_credit_benchmarking', tags={}>

In [3]:

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "train_split.csv"
df = pd.read_csv(DATA_PATH)

X_boost, y = prepare_application_for_model(df, model_type="boosting")
print("X_boost:", X_boost.shape, "| y:", y.shape)

X_boost: (215257, 1656) | y: (215257,)


In [4]:
from src.modeling.prepare_catboost import prepare_catboost

X_cb = prepare_catboost(X_boost)
cat_cols = X_cb.select_dtypes(include=["object"]).columns.tolist()
cat_features_idx = [X_cb.columns.get_loc(c) for c in cat_cols]


In [5]:
params_cb = {
    "iterations": 200,
    "learning_rate": 0.05,
    "depth": 4,
    "random_state": 42,
    "auto_class_weights": "Balanced",
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "verbose": False,
}

model_cb = CatBoostClassifier(**params_cb)

results = train_with_cv(
    model=model_cb,
    model_name="CatBoost",
    X=X_cb,
    y=y,
    model_type="boosting",
    threshold=0.5,
    n_splits=5,
    random_state=42,
    log_fold_metrics=True,
    cost_fn=10,
    cost_fp=1,
    fbeta_beta=3,

    fit_params={"cat_features": cat_features_idx},
    use_lgb_categorical=False,
)

results


===== Entraînement (benchmark CV) : CatBoost =====

--- Fold 1/5 ---
   → AUC=0.7768 | Recall@0.50=0.7138 | F1@0.50=0.2814 | F3@0.50=0.5460 | Cost=21626
   → TN=27900 FP=11676 FN=995 TP=2481 | fit=77.73s | pred=0.33s

--- Fold 2/5 ---
   → AUC=0.7622 | Recall@0.50=0.6815 | F1@0.50=0.2692 | F3@0.50=0.5217 | Cost=22828
   → TN=27818 FP=11758 FN=1107 TP=2369 | fit=79.55s | pred=0.38s

--- Fold 3/5 ---
   → AUC=0.7690 | Recall@0.50=0.6961 | F1@0.50=0.2761 | F3@0.50=0.5338 | Cost=22186
   → TN=27950 FP=11626 FN=1056 TP=2419 | fit=82.13s | pred=0.29s

--- Fold 4/5 ---
   → AUC=0.7756 | Recall@0.50=0.7068 | F1@0.50=0.2802 | F3@0.50=0.5418 | Cost=21789
   → TN=27977 FP=11599 FN=1019 TP=2456 | fit=81.50s | pred=0.37s

--- Fold 5/5 ---
   → AUC=0.7631 | Recall@0.50=0.6852 | F1@0.50=0.2744 | F3@0.50=0.5273 | Cost=22440
   → TN=28076 FP=11500 FN=1094 TP=2381 | fit=78.13s | pred=0.28s

===== Résultats finaux (CV) =====
AUC                         : 0.7693 ± 0.0061
Recall@0.50              : 0.6967

{'model': 'CatBoost',
 'auc_mean': 0.7693483535421575,
 'auc_std': 0.00607752405910324,
 'recall_mean_fixed_threshold': 0.6966678974426903,
 'recall_std_fixed_threshold': 0.012289603875870561,
 'precision_mean_fixed_threshold': 0.1722891375786085,
 'precision_std_fixed_threshold': 0.002701182806877842,
 'f1_mean_fixed_threshold': 0.2762563482655649,
 'f1_std_fixed_threshold': 0.0043826035569942134,
 'fbeta_3_mean_fixed_threshold': 0.534102028387738,
 'fbeta_3_std_fixed_threshold': 0.008959968845595852,
 'business_cost_mean_fixed_threshold': 22173.8,
 'business_cost_std_fixed_threshold': 435.2343736425238,
 'threshold': 0.5,
 'time_sec': 414.024441242218}