# Préparation des données pour l’API + Packaging MLflow 

## Objectif

Cette étape prépare un dataset exploitable en production pour alimenter l’API de scoring (Gradio) et la base SQLite.

## Chargement des données de production simulées

Nous utilisons `test_final.csv` (dataset Kaggle sans variable TARGET) comme proxy de données réelles en production.

Ce fichier représente des demandes clients pour lesquelles on souhaite produire un score de défaut via l’API.

In [1]:
from pathlib import Path
import pandas as pd
import sys
CWD = Path.cwd()
PROJECT_ROOT = CWD.parent.parent
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
sys.path.append(str(PROJECT_ROOT))
df = pd.read_csv(DATA_PROCESSED / "test_final.csv")


## Création du dataset API-ready

Nous construisons un dataset `X_api.csv` contenant :
- `SK_ID_CURR` (identifiant client),
- les 125 features retenues.

Ce fichier sera ensuite utilisé pour alimenter une base SQLite locale afin de simuler le stockage des données en production.

La colonne **SK_ID_CURR** est indispensable pour pouvoir interroger un client spécifique via l’API.

In [2]:

FEATURE_REDUCTION_DIR = PROJECT_ROOT / "reports" / "feature_reduction"
FEATURE_SET_NAME = "top125_nocorr"
kept_file = FEATURE_REDUCTION_DIR / f"kept_features_{FEATURE_SET_NAME}.txt"

kept_features = [
    l.strip()
    for l in kept_file.read_text(encoding="utf-8").splitlines()
    if l.strip()
]
kept_features = [c for c in kept_features if c in df.columns]
if len(kept_features) == 0:
    raise ValueError(f"kept_features vide -> vérifie le contenu de: {kept_file}")


cols = ["SK_ID_CURR"] + kept_features
missing = [c for c in cols if c not in df.columns]
if missing:
    raise ValueError(f"Colonnes manquantes dans test_final.csv: {missing[:10]}")

## Export du fichier `X_api.csv`

Le fichier est exporté dans `app/asset/`.

Il constitue une source standardisée pour :
- initialiser la base SQLite,
- simuler des appels API,
- générer des données de monitoring (logs de prédiction, scores, latence, drift).

In [3]:


X_api = df[cols].copy()
out_path = DATA_PROCESSED/ "X_api.csv"
X_api.to_csv(out_path, index=False)

print("OK:", out_path, X_api.shape)

OK: c:\Users\yoann\Documents\open classrooms\projet 8\livrables\pret a dépenser\data\processed\X_api.csv (48744, 126)


In [4]:
import json
row = X_api.iloc[3].to_dict()

# Remplacer les NaN par None (=> null en JSON)
row = {k: (None if pd.isna(v) else v) for k, v in row.items()}

# Export JSON formaté
json_text = json.dumps(row, indent=2)

print(json_text)

{
  "SK_ID_CURR": 100028,
  "EXT_SOURCE_3": 0.6127042441012546,
  "EXT_SOURCE_2": 0.5096770801723647,
  "ORGANIZATION_TYPE": "Business Entity Type 3",
  "EXT_SOURCE_1": 0.5257339776824489,
  "BUREAU_BUREAU_DEBT_RATIO_MAX": 0.838975,
  "DAYS_EMPLOYED": -1866.0,
  "AMT_CREDIT": 1575000.0,
  "OCCUPATION_TYPE": "Sales staff",
  "PREV_RATIO_REFUSED": 0.0,
  "PREV_INST_AMT_PAYMENT_MIN_SUM": 12704.67,
  "AMT_ANNUITY": 49018.5,
  "PREV_INST_AMT_PAYMENT_MIN_MEAN": 4234.89,
  "CODE_GENDER": "F",
  "NAME_EDUCATION_TYPE": "Secondary / secondary special",
  "AMT_GOODS_PRICE": 1575000.0,
  "DAYS_BIRTH": -13976,
  "OWN_CAR_AGE": null,
  "BUREAU_BUREAU_DEBT_RATIO_STD": 0.2777859517559874,
  "PREV_PREV_CREDIT_TO_APPLICATION_MEAN": 0.9674785654181164,
  "PREV_POS_INSTALMENT_FUTURE_MEAN_MEAN": 8.3125,
  "PREV_POS_INSTALMENT_FUTURE_MEAN_STD": 4.684582425360877,
  "BUREAU_BUREAU_DEBT_RATIO_MEAN": 0.1222668319328298,
  "PREV_INST_RATIO_EARLY_MEAN": 0.7698412698412698,
  "PREV_INST_RATIO_LATE_MEAN": 0.047619

In [5]:
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

X_api.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,48744.0,277796.7,103169.5,100001.0,188557.75,277549.0,367555.5,456250.0
EXT_SOURCE_3,40076.0,0.5001056,0.189498,0.000527,0.363945,0.5190973,0.6528966,0.8825303
EXT_SOURCE_2,48736.0,0.5180211,0.1812781,8e-06,0.408066,0.5587579,0.658497,0.8549997
EXT_SOURCE_1,28212.0,0.5011798,0.2051423,0.013458,0.343695,0.5067713,0.6659559,0.9391445
BUREAU_BUREAU_DEBT_RATIO_MAX,41096.0,3.039704,167.8111,0.0,0.0,0.754785,0.9476186,29999.0
DAYS_EMPLOYED,39470.0,-2476.739,2307.964,-17463.0,-3328.75,-1765.0,-861.0,-1.0
AMT_CREDIT,48744.0,516740.4,365397.0,45000.0,260640.0,450000.0,675000.0,2245500.0
PREV_RATIO_REFUSED,47800.0,0.1145769,0.1798921,0.0,0.0,0.0,0.2,1.0
PREV_INST_AMT_PAYMENT_MIN_SUM,47800.0,45995.59,109780.1,0.0,9203.52375,20976.19,44871.99,3139280.0
AMT_ANNUITY,48720.0,29426.24,16016.37,2295.0,17973.0,26199.0,37390.5,180576.0


Vérification des bornes pour les chiffres numériques

In [6]:
desc = X_api.describe().T  # (rows=features)

# bornes simples à partir du describe (min/max observés)
bounds = desc[["min", "max", "mean", "std"]].copy()
bounds

Unnamed: 0,min,max,mean,std
SK_ID_CURR,100001.0,456250.0,277796.7,103169.5
EXT_SOURCE_3,0.000527,0.8825303,0.5001056,0.189498
EXT_SOURCE_2,8e-06,0.8549997,0.5180211,0.1812781
EXT_SOURCE_1,0.013458,0.9391445,0.5011798,0.2051423
BUREAU_BUREAU_DEBT_RATIO_MAX,0.0,29999.0,3.039704,167.8111
DAYS_EMPLOYED,-17463.0,-1.0,-2476.739,2307.964
AMT_CREDIT,45000.0,2245500.0,516740.4,365397.0
PREV_RATIO_REFUSED,0.0,1.0,0.1145769,0.1798921
PREV_INST_AMT_PAYMENT_MIN_SUM,0.0,3139280.0,45995.59,109780.1
AMT_ANNUITY,2295.0,180576.0,29426.24,16016.37


## Récuperer le modèle + artifacts 

In [9]:
import os
import shutil

RUN_ID = "a4dc7831df1b42a4a363af56bc86a775"

SRC_RUN_DIR = PROJECT_ROOT / "artifacts" / RUN_ID/"artifacts"
DEST_DIR = PROJECT_ROOT/"app"/"assets"

os.makedirs(DEST_DIR, exist_ok=True)

# Copier le dossier model
shutil.copytree(
    f"{SRC_RUN_DIR}/model",
    f"{DEST_DIR}/model",
    dirs_exist_ok=True
)

# Copier le dossier api_artifacts
shutil.copytree(
    f"{SRC_RUN_DIR}/api_artifacts",
    f"{DEST_DIR}/api_artifacts",
    dirs_exist_ok=True
)

print("Copie terminée vers app/assets/")

Copie terminée vers app/assets/
