# Capstone – Option 4: Prédiction du risque de réadmission hospitalière (classification)

Objectif: utiliser un dataset public (Diabetes 130-US hospitals) pour prédire si un patient sera réadmis rapidement, et produire un mini-rapport.

**Ce notebook couvre les 3 exigences:**
- **Data Processing**: nettoyage + EDA + visualisations
- **Machine Learning**: classification + comparaison de modèles + validation
- **Generative AI**: génération d’un résumé exécutif (et option RAG)

In [None]:
# 1) Imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, RocCurveDisplay
)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 2) Charger le dataset
Deux options:
1) **Local**: télécharge `diabetic_data.csv` depuis UCI/Kaggle et mets le fichier dans le même dossier.
2) **OpenML** (si internet autorisé dans ton environnement): `fetch_openml`.


In [None]:
# Option A (recommandée): fichier local
DATA_PATH = 'diabetic_data.csv'  # <-- ajuste si besoin
df = pd.read_csv(DATA_PATH)

print(df.shape)
df.head()

In [None]:
# 3) Data Processing – audit qualité rapide
display(df.info())
display(df.isna().sum().sort_values(ascending=False).head(20))
display(df.describe(include='all').T.head(20))


## 4) Définir la cible
Le dataset contient une colonne `readmitted` (souvent: `NO`, `>30`, `<30`).
- **Cible binaire** (simple): `1` si `<30`, sinon `0`.

On fait aussi un nettoyage minimal des valeurs manquantes représentées par `?`.


In [None]:
# Normaliser les valeurs manquantes '?
df = df.replace('?', np.nan)

# Cible binaire
TARGET_COL = 'readmitted'
df = df.dropna(subset=[TARGET_COL])
df['target_readmit_30'] = (df[TARGET_COL] == '<30').astype(int)

print(df['target_readmit_30'].value_counts(normalize=True).rename('ratio'))


## 5) Sélection de features (MVP)
Pour un capstone court, on part sur un sous-ensemble raisonnable (tu pourras élargir après).

In [None]:
# Sous-ensemble de colonnes courantes (ajuste selon ta version du dataset)
candidate_cols = [
    'time_in_hospital','num_lab_procedures','num_procedures','num_medications',
    'number_outpatient','number_emergency','number_inpatient',
    'diag_1','diag_2','diag_3',
    'race','gender','age',
    'admission_type_id','discharge_disposition_id','admission_source_id',
    'insulin','change','diabetesMed'
]
available = [c for c in candidate_cols if c in df.columns]
missing = [c for c in candidate_cols if c not in df.columns]
print('Colonnes utilisées:', len(available))
print('Colonnes manquantes (OK):', missing)

X = df[available].copy()
y = df['target_readmit_30'].copy()

X.head()

## 6) Visualisations (Data Processing)
On regarde la distribution de la cible et 2–3 variables clés.


In [None]:
# Distribution de la cible
fig, ax = plt.subplots()
y.value_counts().sort_index().plot(kind='bar', ax=ax)
ax.set_title('Distribution de la cible (réadmission <30j)')
ax.set_xlabel('Classe (0=non, 1=oui)')
ax.set_ylabel('Nombre de cas')
plt.show()

# Exemple: time_in_hospital vs cible (si dispo)
if 'time_in_hospital' in X.columns:
    fig, ax = plt.subplots()
    X.loc[y==0, 'time_in_hospital'].dropna().plot(kind='hist', bins=20, ax=ax, alpha=0.7, label='0')
    X.loc[y==1, 'time_in_hospital'].dropna().plot(kind='hist', bins=20, ax=ax, alpha=0.7, label='1')
    ax.set_title('time_in_hospital selon la cible')
    ax.set_xlabel('Jours')
    ax.legend()
    plt.show()


## 7) Split train/test (avec stratification)
Important pour préserver la proportion des classes si déséquilibrées.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
print('Train:', X_train.shape, ' Test:', X_test.shape)
print('y_train ratio:', y_train.mean().round(3), ' y_test ratio:', y_test.mean().round(3))


## 8) Pipeline de preprocessing + modèles
On utilise `ColumnTransformer` + `Pipeline` pour éviter le data leakage et garder un workflow reproductible.


In [None]:
# Détecter types
numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = [c for c in X.columns if c not in numeric_features]

print('Num:', numeric_features)
print('Cat:', categorical_features)

# Preprocessing
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Modèles candidats
candidates = {
    'LogReg': LogisticRegression(max_iter=2000, random_state=RANDOM_STATE),
    'RandomForest': RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=-1),
    'GradBoost': GradientBoostingClassifier(random_state=RANDOM_STATE)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = {'acc':'accuracy','f1':'f1','roc_auc':'roc_auc'}

results = []
for name, model in candidates.items():
    pipe = Pipeline([('preprocess', preprocessor), ('model', model)])
    scores = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
    results.append({
        'model': name,
        'cv_acc_mean': scores['test_acc'].mean(),
        'cv_f1_mean': scores['test_f1'].mean(),
        'cv_roc_auc_mean': scores['test_roc_auc'].mean(),
    })

results_df = pd.DataFrame(results).sort_values('cv_f1_mean', ascending=False)
results_df

## 9) Hyperparameter tuning (1 modèle)
On tune le meilleur modèle selon CV-F1 (tu peux changer le critère).


In [None]:
best_name = results_df.iloc[0]['model']
print('Best (selon CV-F1):', best_name)

if best_name == 'RandomForest':
    base = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1)
    param_grid = {
        'model__n_estimators': [200, 500],
        'model__max_depth': [None, 10, 20],
        'model__min_samples_split': [2, 5]
    }
elif best_name == 'LogReg':
    base = LogisticRegression(max_iter=5000, random_state=RANDOM_STATE)
    param_grid = {
        'model__C': [0.1, 1.0, 10.0],
        'model__penalty': ['l2'],
        'model__solver': ['lbfgs']
    }
else:
    base = GradientBoostingClassifier(random_state=RANDOM_STATE)
    param_grid = {
        'model__n_estimators': [100, 300],
        'model__learning_rate': [0.05, 0.1],
        'model__max_depth': [2, 3]
    }

pipe = Pipeline([('preprocess', preprocessor), ('model', base)])
gs = GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv=cv, n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print('Best params:', gs.best_params_)
print('Best CV F1:', gs.best_score_)


## 10) Évaluation finale sur le test set

In [None]:
best_pipe = gs.best_estimator_
y_pred = best_pipe.predict(X_test)
y_proba = None
if hasattr(best_pipe.named_steps['model'], 'predict_proba'):
    y_proba = best_pipe.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred, digits=3))

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

if y_proba is not None:
    print('ROC-AUC:', roc_auc_score(y_test, y_proba).round(3))
    RocCurveDisplay.from_predictions(y_test, y_proba)
    plt.show()


## 11) Interprétabilité (simple)
- Pour RandomForest/GB: importance des features (approx)
- Pour LogReg: coefficients


In [None]:
model = best_pipe.named_steps['model']

# Récupérer les noms des features après OneHot (approx)
ohe = best_pipe.named_steps['preprocess'].named_transformers_['cat'].named_steps['onehot'] if categorical_features else None
cat_names = []
if ohe is not None and len(categorical_features)>0:
    cat_names = list(ohe.get_feature_names_out(categorical_features))
feature_names = numeric_features + cat_names

if hasattr(model, 'feature_importances_'):
    importances = model.feature_importances_
    imp = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values('importance', ascending=False).head(20)
    display(imp)
elif hasattr(model, 'coef_'):
    coefs = model.coef_.ravel()
    coef_df = pd.DataFrame({'feature': feature_names, 'coef': coefs}).sort_values('coef', key=lambda s: np.abs(s), ascending=False).head(20)
    display(coef_df)
else:
    print('Interprétabilité: non disponible pour ce modèle.')


## 12) Generative AI – résumé exécutif automatique
Idée simple (et facile à justifier): générer un résumé à partir des métriques + top features.

⚠️ Bonnes pratiques: donner du **contexte**, demander une **sortie structurée**, et interdire d’inventer des infos non présentes.


In [None]:
def build_prompt(metrics: dict, top_features: pd.DataFrame) -> str:
    top_txt = top_features.to_csv(index=False)
    return f"""You are a healthcare data analyst.
Write an executive summary (max 180 words) of this readmission risk model.

Constraints:
- Use ONLY the information provided below.
- If something is missing, explicitly say it is unknown.
- Output format:
  1) Problem (1 sentence)
  2) Results (bullets)
  3) Top drivers (bullets)
  4) Limitations (bullets)

Metrics:
{metrics}

Top features:
{top_txt}
""" 

# Exemple de contenu à passer au LLM
metrics_payload = {
    'cv_best_f1': float(gs.best_score_),
    'test_report': classification_report(y_test, y_pred, output_dict=True)
}

# Choisir table de drivers selon ce que tu as (imp ou coef)
drivers = None
if 'imp' in globals():
    drivers = imp
elif 'coef_df' in globals():
    drivers = coef_df
else:
    drivers = pd.DataFrame({'feature':[], 'importance_or_coef':[]})

prompt = build_prompt(metrics_payload, drivers.head(10))
print(prompt[:800])

# ---- Appel LLM (optionnel) ----
# 1) mets ta clé dans une variable d’environnement OPENAI_API_KEY
# 2) utilise l’API OpenAI / ou un autre LLM
# Ici, on laisse volontairement un STUB pour éviter de bloquer ton exécution.


## 13) Model Card (document de transparence)
Tu peux générer un fichier `MODEL_CARD.md` dans ton repo (ou le coller dans ton rapport).

In [None]:
from datetime import date

model_card_md = f"""# Model Card – Readmission Risk Classifier

## Model details
- Date: {date.today()}
- Task: binary classification (readmitted in <30 days)
- Model: {type(best_pipe.named_steps['model']).__name__}

## Intended use
- Support early identification of higher-risk readmissions.
- **Not** for clinical decision-making without human review.

## Data
- Dataset: Diabetes 130-US hospitals (1999–2008), public.
- Unit: each row = inpatient encounter.

## Metrics
- Best CV F1: {gs.best_score_:.3f}
- Test-set report: see notebook output.

## Ethical / bias considerations
- Potential bias by demographics (race, gender, age buckets).
- Evaluate subgroup performance before deployment.

## Limitations
- Labels are historical; practice patterns may have changed.
- Missing data and coding conventions can affect results.
- Feature importance ≠ causality.
"""

print(model_card_md[:800])

# Option: sauvegarder dans ton repo
# with open('MODEL_CARD.md','w',encoding='utf-8') as f:
#     f.write(model_card_md)
