# régression logistique - Benchmark individuel
La régression logistique est un modèle de classification probabiliste. Elle estime directement la probabilité de défaut à partir d’une combinaison linéaire des variables, ce qui en fait un modèle simple, rapide et hautement interprétable. Elle est utilisée ici comme modèle de référence afin de comparer les performances de modèles plus complexes tout en conservant une base explicable et conforme aux pratiques métiers

In [1]:
import os
import sys
from pathlib import Path


CWD = Path.cwd()
PROJECT_ROOT = CWD.parent.parent
DB_PATH = (PROJECT_ROOT / "mlflow.db").resolve()
ARTIFACT_ROOT = (PROJECT_ROOT / "artifacts").resolve()
ARTIFACT_ROOT.mkdir(parents=True, exist_ok=True)


os.environ["MLFLOW_TRACKING_URI"] = f"sqlite:///{DB_PATH.as_posix()}"
os.environ["MLFLOW_ARTIFACT_URI"] = ARTIFACT_ROOT.as_uri()


sys.path.append(str(PROJECT_ROOT))

import mlflow 


mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

print("CWD =", CWD)
print("Tracking URI =", mlflow.get_tracking_uri())
print("Artifacts root (env) =", os.environ["MLFLOW_ARTIFACT_URI"])

CWD = c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\notebooks\02_benchmark
Tracking URI = sqlite:///C:/Users/yoann/Documents/open classrooms/projet 8/livrables/pret a dépenser/mlflow.db
Artifacts root (env) = file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts


In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from src.modeling.train import train_with_cv
from src.modeling.prepare_for_model import prepare_application_for_model
from src.modeling.prepare_for_model import make_preprocessor
from src.tracking import mlflow_tracking

EXPERIMENT_NAME = "home_credit_benchmarking"
exp_id = mlflow_tracking.get_or_create_experiment(EXPERIMENT_NAME, ARTIFACT_ROOT)
mlflow.set_experiment(EXPERIMENT_NAME)


<Experiment: artifact_location='file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts', creation_time=1771138249350, experiment_id='1', last_update_time=1771138249350, lifecycle_stage='active', name='home_credit_benchmarking', tags={}>

In [3]:


# --- Load split ---
df = pd.read_csv(PROJECT_ROOT / "data" / "processed" / "train_split.csv")

# --- Prepare X/y ---
X_skl, y = prepare_application_for_model(df, model_type="sklearn")

# --- Preprocessor ---
preprocessor, cols = make_preprocessor(X_skl)
print({k: len(v) for k, v in cols.items()})

# --- Model ---
params_lr = {
    "max_iter": 1000,
    "class_weight": "balanced",
    "solver": "saga",
    "n_jobs": -1,
    "random_state": 42,
}

model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("logreg", LogisticRegression(**params_lr)),
])

results = train_with_cv(
    model=model,
    model_name="LogisticRegression",
    X=X_skl,
    y=y,
    model_type="sklearn",
    threshold=0.5,
    n_splits=5,
    random_state=42,
    log_fold_metrics=True,
    cost_fn=10,
    cost_fp=1,
    fbeta_beta=3,
)

results

{'num': 1642, 'cat': 14, 'bool': 0}

===== Entraînement (benchmark CV) : LogisticRegression =====


                 SimpleImputer(add_indicator=True, strategy='me...' (58329 characters) is truncated to 6000 characters to meet the length limit.



--- Fold 1/5 ---




   → AUC=0.7774 | Recall@0.50=0.7091 | F1@0.50=0.2819 | F3@0.50=0.5442 | Cost=21655
   → TN=28031 FP=11545 FN=1011 TP=2465 | fit=3553.55s | pred=4.09s

--- Fold 2/5 ---




   → AUC=0.7688 | Recall@0.50=0.6913 | F1@0.50=0.2765 | F3@0.50=0.5317 | Cost=22234
   → TN=28072 FP=11504 FN=1073 TP=2403 | fit=3487.08s | pred=4.34s

--- Fold 3/5 ---




   → AUC=0.7708 | Recall@0.50=0.6927 | F1@0.50=0.2788 | F3@0.50=0.5341 | Cost=22062
   → TN=28194 FP=11382 FN=1068 TP=2407 | fit=3580.49s | pred=4.28s

--- Fold 4/5 ---




   → AUC=0.7724 | Recall@0.50=0.6964 | F1@0.50=0.2793 | F3@0.50=0.5362 | Cost=21987
   → TN=28139 FP=11437 FN=1055 TP=2420 | fit=3467.43s | pred=4.12s

--- Fold 5/5 ---




   → AUC=0.7658 | Recall@0.50=0.6852 | F1@0.50=0.2793 | F3@0.50=0.5309 | Cost=22134
   → TN=28382 FP=11194 FN=1094 TP=2381 | fit=3543.53s | pred=4.12s

===== Résultats finaux (CV) =====
AUC                         : 0.7710 ± 0.0039
Recall@0.50              : 0.6949 ± 0.0080
Precision@0.50           : 0.1747 ± 0.0011
F1@0.50                  : 0.2792 ± 0.0017
F3@0.50                : 0.5354 ± 0.0048
Business cost (FN*10+FP*1) : 22014.40 ± 197.34
TN/FP/FN/TP (moy)            : 28163.6/11412.4/1060.2/2415.2
⏱ Temps total                : 17677.13s


{'model': 'LogisticRegression',
 'auc_mean': 0.7710289813012711,
 'auc_std': 0.003856254267214205,
 'recall_mean_fixed_threshold': 0.6949409807022046,
 'recall_std_fixed_threshold': 0.007970366283567464,
 'precision_mean_fixed_threshold': 0.17466655790624414,
 'precision_std_fixed_threshold': 0.001067651216076283,
 'f1_mean_fixed_threshold': 0.2791622894125739,
 'f1_std_fixed_threshold': 0.001735017059887004,
 'fbeta_3_mean_fixed_threshold': 0.5354360201828434,
 'fbeta_3_std_fixed_threshold': 0.004772516636686096,
 'business_cost_mean_fixed_threshold': 22014.4,
 'business_cost_std_fixed_threshold': 197.3388963179839,
 'threshold': 0.5,
 'time_sec': 17677.128698587418}