# Feature reduction – LightGBM

Ce notebook vise à réduire le nombre de variables utilisées par le modèle LightGBM
en s’appuyant sur l’analyse d’importance globale (gain), afin de :

- diminuer la complexité du modèle  
- accélérer l’entraînement et l’inférence  
- améliorer la stabilité  
- faciliter l’interprétation métier  

La performance est évaluée par validation croisée et comparée au modèle complet.

## Imports + chemins + MLflow

In [1]:
import os
import sys
from pathlib import Path


CWD = Path.cwd()
PROJECT_ROOT = CWD.parent.parent
DB_PATH = (PROJECT_ROOT / "mlflow.db").resolve()
ARTIFACT_ROOT = (PROJECT_ROOT / "artifacts").resolve()
ARTIFACT_ROOT.mkdir(parents=True, exist_ok=True)

FEATURE_REDUCTION_DIR = PROJECT_ROOT / "reports" / "feature_reduction"
FEATURE_REDUCTION_DIR.mkdir(parents=True, exist_ok=True)

os.environ["MLFLOW_TRACKING_URI"] = f"sqlite:///{DB_PATH.as_posix()}"
os.environ["MLFLOW_ARTIFACT_URI"] = ARTIFACT_ROOT.as_uri()


sys.path.append(str(PROJECT_ROOT))

import mlflow  


mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

print("CWD =", CWD)
print("Tracking URI =", mlflow.get_tracking_uri())
print("Artifacts root (env) =", os.environ["MLFLOW_ARTIFACT_URI"])


CWD = c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\notebooks\03_modeling
Tracking URI = sqlite:///C:/Users/yoann/Documents/open classrooms/projet 8/livrables/pret a dépenser/mlflow.db
Artifacts root (env) = file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts


In [2]:
import pandas as pd
from lightgbm import LGBMClassifier
import numpy as np
from src.modeling.train import train_with_cv
from src.modeling.prepare_for_model import prepare_application_for_model
from src.tracking import mlflow_tracking

EXPERIMENT_NAME = "home_credit_reduction_perimetre"
exp_id = mlflow_tracking.get_or_create_experiment(EXPERIMENT_NAME, ARTIFACT_ROOT)
mlflow.set_experiment(EXPERIMENT_NAME)
#mlflow ui --backend-store-uri sqlite:///mlflow.db

<Experiment: artifact_location='file:///C:/Users/yoann/Documents/open%20classrooms/projet%208/livrables/pret%20a%20d%C3%A9penser/artifacts', creation_time=1771233857159, experiment_id='2', last_update_time=1771233857159, lifecycle_stage='active', name='home_credit_reduction_perimetre', tags={}>

## Chargement des données

In [3]:

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "train_split.csv"
df = pd.read_csv(DATA_PATH)

X_lgb, y = prepare_application_for_model(df, model_type="boosting")
print("X_lgb:", X_lgb.shape, "| y:", y.shape)

X_lgb: (215257, 1656) | y: (215257,)


## Chargement feature importance précédente

In [4]:
FI_DIR = PROJECT_ROOT / "reports" / "feature_importance"
fi_path = FI_DIR / "lightgbm_feature_importance_full.csv"
fi = pd.read_csv(fi_path)
print("fi loaded:", fi.shape)

# Robustesse : déterminer la colonne d'importance
imp_col = "importance_gain" if "importance_gain" in fi.columns else [c for c in fi.columns if c.startswith("importance_")][0]
print("Importance column used:", imp_col)

fi loaded: (1656, 3)
Importance column used: importance_gain


## Fonctions utilitaire

### Sélection Top-Nfeatures

In [5]:
from src.modeling.feature_selection import select_top_features

### Drop features corrélées

In [6]:
from src.modeling.feature_selection  import drop_correlated_features

## Parametres

In [7]:
params_lgb = {
    "objective": "binary",
    "n_estimators": 150,
    "learning_rate": 0.05,
    "num_leaves": 32,
    "class_weight": "balanced",
    "random_state": 42,
    "n_jobs": -1,
}

FEATURE_SIZES = [25, 50, 75, 100, 125, 150]
CORR_THRESHOLDS = [None, 0.90, 0.85, 0.80]  # None = pas de filtre corr

THRESH_FIXED = 0.5
COST_FN = 10
COST_FP = 1
FBETA_BETA = 3

results_reduction = []


features_path = FEATURE_REDUCTION_DIR
features_path.mkdir(parents=True, exist_ok=True)

### Réduction du benchmark

In [8]:
FEATURE_SIZES = [25, 50, 75, 100, 125, 150]
CORR_THRESHOLDS = [None, 0.90, 0.85, 0.80]   # None = sans filtrage corr

results_reduction = []

FEATURE_SIZES = [25, 50, 75, 100, 125, 150]
CORR_THRESHOLDS = [None, 0.90, 0.85, 0.80]   # None = sans filtrage corr

THRESH_FIXED = 0.5
COST_FN = 10
COST_FP = 1
FBETA_BETA = 3

In [9]:

for corr_th in CORR_THRESHOLDS:
    print("\n==============================")
    print("NO CORRELATION FILTER" if corr_th is None else f"CORR_THRESHOLD = {corr_th}")
    print("==============================")

    for top_n in FEATURE_SIZES:
        label = "nocorr" if corr_th is None else f"corr{str(corr_th).replace('.','')}"
        run_name = f"LightGBM_top{top_n}_{label}"
        print(f"\n===== {run_name} =====")

        # 1) Top-N
        X_top = select_top_features(X_lgb, fi, top_n, importance_col=imp_col)
        print("Top-N shape :", X_top.shape)

        # 2) Corr filter (optionnel, uniquement numérique/bool)
        if corr_th is None:
            X_corr = X_top.copy()
            to_drop = []
        else:
            X_corr, to_drop, _ = drop_correlated_features(X_top, threshold=corr_th)

        print(f"After corr  : {X_corr.shape} | dropped={len(to_drop)}")

        # 3) Sauvegarde locale des features gardées / droppées
        keep_file = features_path / f"kept_features_top{top_n}_{label}.txt"
        keep_file.write_text("\n".join(X_corr.columns.tolist()), encoding="utf-8")

        if corr_th is not None and len(to_drop) > 0:
            drop_file = features_path / f"dropped_features_top{top_n}_{label}.txt"
            drop_file.write_text("\n".join(to_drop), encoding="utf-8")

        # 4) CV run (MLflow géré dans train_with_cv)
        model_lgb = LGBMClassifier(**params_lgb)
        res = train_with_cv(
            model=model_lgb,
            model_name=run_name,
            X=X_corr,
            y=y,
            model_type="boosting",
            threshold=THRESH_FIXED,
            n_splits=5,
            random_state=42,
            log_fold_metrics=True,
            cost_fn=COST_FN,
            cost_fp=COST_FP,
            fbeta_beta=FBETA_BETA,
        )

        # 5) Meta infos (pour tableau final)
        res["model"] = run_name  # pour être sûr d'avoir le nom complet
        res["top_n"] = int(top_n)
        res["corr_threshold"] = np.nan if corr_th is None else float(corr_th)
        res["n_features_after_corr"] = int(X_corr.shape[1])
        res["dropped_corr"] = int(len(to_drop))
        results_reduction.append(res)


NO CORRELATION FILTER

===== LightGBM_top25_nocorr =====
Top-N shape : (215257, 25)
After corr  : (215257, 25) | dropped=0

===== Entraînement (benchmark CV) : LightGBM_top25_nocorr =====

--- Fold 1/5 ---
[LightGBM] [Info] Number of positive: 13901, number of negative: 158304
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.011594 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5146
[LightGBM] [Info] Number of data points in the train set: 172205, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
   → AUC=0.7778 | Recall@0.50=0.7002 | F1@0.50=0.2872 | F3@0.50=0.5438 | Cost=21459
   → TN=28537 FP=11039 FN=1042 TP=2434 | fit=2.37s | pred=0.17s

--- Fold 2/5 ---
[LightGBM] [Info] Number of positive: 13901, number of negative: 158304
[LightGBM] [Info] Auto-choosing col-wise multi-threading, t

## baseline  lightGBM pour comparaison

In [10]:
runs = mlflow.search_runs(
    experiment_names=["home_credit_benchmarking"],
    filter_string="tags.phase = 'benchmark_baseline' and tags.model_name = 'LightGBM'",
    order_by=["attributes.start_time DESC"],
)

if runs.empty:
    raise ValueError("Aucun run MLflow trouvé pour phase=benchmark_baseline et model_name=LightGBM")

r = runs.iloc[0]

baseline = {
    "run_id": r["run_id"],
    "model": "LightGBM_full",

    "auc_mean": float(r.get("metrics.auc_mean", np.nan)),
    "auc_std": float(r.get("metrics.auc_std", np.nan)),

    "recall_mean_fixed_threshold": float(r.get("metrics.recall_mean_fixed_threshold", np.nan)),
    "recall_std_fixed_threshold": float(r.get("metrics.recall_std_fixed_threshold", np.nan)),

    "precision_mean_fixed_threshold": float(r.get("metrics.precision_mean_fixed_threshold", np.nan)),
    "precision_std_fixed_threshold": float(r.get("metrics.precision_std_fixed_threshold", np.nan)),

    "f1_mean_fixed_threshold": float(r.get("metrics.f1_mean_fixed_threshold", np.nan)),
    "f1_std_fixed_threshold": float(r.get("metrics.f1_std_fixed_threshold", np.nan)),

    "fbeta_3_mean_fixed_threshold": float(r.get("metrics.fbeta_3_mean_fixed_threshold", np.nan)),
    "fbeta_3_std_fixed_threshold": float(r.get("metrics.fbeta_3_std_fixed_threshold", np.nan)),

    "business_cost_mean_fixed_threshold": float(r.get("metrics.business_cost_mean_fixed_threshold", np.nan)),
    "business_cost_std_fixed_threshold": float(r.get("metrics.business_cost_std_fixed_threshold", np.nan)),

    "threshold": float(r.get("tags.threshold_fixed", 0.5)),
    "time_sec": float(r.get("metrics.train_time_sec", np.nan)),

    "top_n": int(X_lgb.shape[1]),
    "corr_threshold": np.nan,
    "n_features_after_corr": int(X_lgb.shape[1]),
    "dropped_corr": 0,
}

results_reduction.append(baseline)
print("Baseline added:", baseline["run_id"])

Baseline added: 835f4e494920416797ac8122cf3003ec


## Tableau comparatif final

In [11]:

df_red = pd.DataFrame(results_reduction)

df_red = df_red.rename(columns={
    "auc_mean": "auc",
    "auc_std": "auc_std",

    "recall_mean_fixed_threshold": "recall",
    "recall_std_fixed_threshold": "recall_std",

    "precision_mean_fixed_threshold": "precision",
    "precision_std_fixed_threshold": "precision_std",

    "f1_mean_fixed_threshold": "f1",
    "f1_std_fixed_threshold": "f1_std",

    "fbeta_3_mean_fixed_threshold": "f3",
    "fbeta_3_std_fixed_threshold": "f3_std",

    "business_cost_mean_fixed_threshold": "business_cost",
    "business_cost_std_fixed_threshold": "business_cost_std",
})

'''# sécurisation types numériques
num_cols = ["auc","auc_std","recall","recall_std","precision","precision_std","f1","f1_std","f3","f3_std",
            "business_cost","business_cost_std","time_sec","top_n","n_features_after_corr","dropped_corr"]
for c in num_cols:
    if c in df_red.columns:
        df_red[c] = pd.to_numeric(df_red[c], errors="coerce")'''

# Tri (business_cost plus petit = meilleur)
df_red = df_red.sort_values(
    by=["business_cost", "recall", "f3", "auc", "time_sec"],
    ascending=[True, False, False, False, True],
).reset_index(drop=True)

final_cols = [
    "model", "top_n", "corr_threshold", "n_features_after_corr", "dropped_corr",
    "business_cost", "business_cost_std",
    "recall", "recall_std",
    "precision", "precision_std",
    "f1", "f1_std",
    "f3", "f3_std",
    "auc", "auc_std",
    "time_sec",
]
final_cols = [c for c in final_cols if c in df_red.columns]
df_final = df_red[final_cols].copy()
df_final = df_final.drop_duplicates(subset=["model"]).reset_index(drop=True)
display(df_final)

out_csv = FEATURE_REDUCTION_DIR / "feature_reduction_results.csv"
df_final.to_csv(out_csv, index=False)
print("CSV:", out_csv)

Unnamed: 0,model,top_n,corr_threshold,n_features_after_corr,dropped_corr,business_cost,business_cost_std,recall,recall_std,precision,precision_std,f1,f1_std,f3,f3_std,auc,auc_std,time_sec
0,LightGBM_top125_nocorr,125,,125,0,21301.2,508.261901,0.68746,0.014614,0.186234,0.003294,0.293072,0.005371,0.54167,0.010879,0.782206,0.005061,48.133787
1,LightGBM_full,1656,,1656,0,21311.4,456.538542,0.653852,0.012666,0.196677,0.003447,0.302392,0.005379,0.530527,0.00992,0.782745,0.005089,522.616138
2,LightGBM_top100_nocorr,100,,100,0,21371.2,506.202094,0.685043,0.014399,0.185906,0.003357,0.292446,0.005419,0.540042,0.010789,0.781778,0.005229,32.312829
3,LightGBM_top150_nocorr,150,,150,0,21387.0,431.447332,0.684468,0.012121,0.185844,0.002922,0.292317,0.004667,0.539668,0.009133,0.781873,0.004765,48.96761
4,LightGBM_top150_corr085,150,0.85,107,43,21394.4,468.647671,0.685907,0.013789,0.185324,0.002956,0.291803,0.004874,0.540031,0.010125,0.780311,0.004837,35.949725
5,LightGBM_top75_nocorr,75,,75,0,21422.8,485.753188,0.686194,0.01398,0.184836,0.003123,0.291225,0.005108,0.539778,0.010393,0.781134,0.005404,20.660311
6,LightGBM_top125_corr085,125,0.85,91,34,21426.6,463.434397,0.686425,0.012863,0.184723,0.003109,0.291106,0.004986,0.53981,0.009747,0.780263,0.004775,37.43313
7,LightGBM_top125_corr08,125,0.8,82,43,21450.4,379.76498,0.685849,0.01076,0.184544,0.00252,0.290831,0.00405,0.539336,0.008042,0.779741,0.004999,23.562577
8,LightGBM_top150_corr09,150,0.9,114,36,21462.0,454.315749,0.684583,0.01323,0.184724,0.002902,0.29094,0.00475,0.538784,0.009756,0.780223,0.005117,40.269654
9,LightGBM_top75_corr08,75,0.8,54,21,21476.0,436.334734,0.686655,0.012496,0.183956,0.002864,0.290172,0.004619,0.539279,0.009271,0.778421,0.005188,20.272368


CSV: c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\reports\feature_reduction\feature_reduction_results.csv


In [12]:
from datetime import datetime

with mlflow.start_run(run_name=f"LightGBM_feature_reduction_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}"):
    mlflow.set_tag("phase", "feature_reduction_summary")
    mlflow.set_tag("model_name", "LightGBM")
    mlflow.set_tag("dataset", "train_split")
    mlflow.set_tag("threshold_mode", "fixed")
    mlflow.set_tag("threshold_fixed", str(float(THRESH_FIXED)))
    mlflow.set_tag("cost_fn", str(int(COST_FN)))
    mlflow.set_tag("cost_fp", str(int(COST_FP)))
    mlflow.set_tag("fbeta_beta", str(float(FBETA_BETA)))

    mlflow.log_artifact(str(out_csv))

    best = df_final.iloc[0]

    # tags best
    mlflow.set_tag("best.model", str(best.get("model", "")))
    mlflow.set_tag("best.top_n", str(int(best.get("top_n", -1))))
    if pd.notna(best.get("corr_threshold", np.nan)):
        mlflow.set_tag("best.corr_threshold", str(float(best["corr_threshold"])))
    else:
        mlflow.set_tag("best.corr_threshold", "none")
    mlflow.set_tag("best.n_features_after_corr", str(int(best.get("n_features_after_corr", -1))))
    mlflow.set_tag("best.dropped_corr", str(int(best.get("dropped_corr", 0))))

    # metrics best
    for k in ["business_cost","business_cost_std","recall","recall_std","precision","precision_std",
              "f1","f1_std","f3","f3_std","auc","auc_std","time_sec"]:
        if k in best and pd.notna(best[k]):
            mlflow.log_metric(f"best_{k}", float(best[k]))

print("Run summary MLflow créé")

# --- Conclusion auto (print) ---
best = df_final.iloc[0]
print("\n=== Conclusion auto (meilleur compromis) ===")
print("Modèle :", best["model"])
print("Nb variables :", int(best["n_features_after_corr"]))
print("Coût métier :", float(best["business_cost"]))
print("Recall :", float(best["recall"]))
print("F3 :", float(best["f3"]))
print("AUC :", float(best["auc"]))
print("Temps (s) :", float(best["time_sec"]))

Run summary MLflow créé

=== Conclusion auto (meilleur compromis) ===
Modèle : LightGBM_top125_nocorr
Nb variables : 125
Coût métier : 21301.2
Recall : 0.687460241243139
F3 : 0.5416703548589556
AUC : 0.7822058984653175
Temps (s) : 48.13378715515137


## Conclusion – Réduction de périmètre (LightGBM)

La comparaison des modèles réduits a été réalisée en priorisant le **coût métier** (FN ×10 + FP ×1), calculé à **seuil fixe 0.5**, via validation croisée.

Le meilleur compromis est obtenu avec **LightGBM_top125_nocorr** (125 variables) :

- **Coût métier** : 21 301 (meilleur, et légèrement inférieur au modèle complet)
- **Recall** : 0.687 (supérieur au modèle complet à 0.654)
- **F3** : 0.542 (supérieur au modèle complet à 0.531)
- **AUC** : 0.782 (équivalente au modèle complet à 0.783)
- **Temps d’entraînement** : ~64 s (contre ~915 s pour le modèle complet)

La réduction du périmètre permet donc de conserver la performance globale tout en améliorant les indicateurs métier et en réduisant fortement le temps de calcul.

Le **filtrage par corrélation** accélère encore l’entraînement, mais n’apporte pas de gain sur le coût métier par rapport au meilleur modèle sans filtrage.  
Le modèle retenu pour la suite est donc **LightGBM_top125_nocorr**, base de travail pour l’optimisation du seuil métier et l’interprétabilité (SHAP).

In [13]:
import json
from datetime import datetime

lock = {
    "created_at": datetime.now().isoformat(timespec="seconds"),
    "dataset": "train_split.csv",
    "best_model": "LightGBM_top125_nocorr",
    "feature_file": "kept_features_top125_nocorr.txt",
    "cv": 5,
    "random_state": 42,
    "threshold_fixed": 0.5,
    "cost_fn": 10,
    "cost_fp": 1,
    "fbeta_beta": 3,
}

lock_path = FEATURE_REDUCTION_DIR / "dataset_lock.json"
lock_path.write_text(json.dumps(lock, ensure_ascii=False, indent=2), encoding="utf-8")
print("Lock saved:", lock_path)

Lock saved: c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\reports\feature_reduction\dataset_lock.json
