# üß™ Revue de votre projet ‚Äî Churn bancaire

> Feedback automatique + suggestions concr√®tes d'am√©lioration.

## ‚úÖ Couverture des √©tapes

- **Chargement & EDA** : 5/5 (100%)
- **Qualit√© des donn√©es** : 3/3 (100%)
- **Visualisations** : 3/5 (60%)
- **Analyses m√©tier** : 1/3 (33%)
- **Mod√©lisation (optionnelle)** : 0/6 (0%)
- **Rapport ex√©cutif** : 0/1 (0%)


## üí™ Points forts
- Chargement des donn√©es OK (`pd.read_csv`).
- Exploration de base pr√©sente (head/shape/info/describe).
- Visualisations essentielles pr√©sentes (countplot/hist/box/bar/heatmap).

## üß© Points √† renforcer
### Visualisations
- Comparaison num vs cat (boxplot/violin)
- Comparaisons (Top N barplot)
### Analyses m√©tier
- Taux de churn par pays (groupby)
- Taux de churn par genre (groupby)
### Mod√©lisation (optionnelle)
- S√©paration train/test (stratify=y)
- R√©gression logistique baseline
- Random Forest (comparaison)
- Standardisation/normalisation des features
- Gestion du d√©s√©quilibre (class_weight='balanced')
- M√©triques (precision/recall/F1/ROC-AUC/confusion matrix)
### Rapport ex√©cutif
- Rapport ex√©cutif final (10 lignes)


## üßπ Nettoyage ‚Äî snippets pr√™ts √† l‚Äôemploi

In [None]:

# Colonnes √† ignorer pour l'analyse/modeling
drop_cols = [c for c in ['RowNumber','CustomerId','Surname'] if c in df.columns]
df = df.drop(columns=drop_cols, errors='ignore')

# NA & doublons
na = df.isnull().sum().sort_values(ascending=False)
print(na[na>0])
print("Doublons:", df.duplicated().sum())


## üìå KPI Churn ‚Äî taux global & par segments

In [None]:

# Taux global
if 'Exited' in df.columns:
    churn_rate = df['Exited'].mean()
    print("Taux de churn global:", round(churn_rate*100, 2), "%")

# Par pays
if {'Geography','Exited'}.issubset(df.columns):
    churn_by_country = df.groupby('Geography')['Exited'].mean().sort_values(ascending=False)
    display((churn_by_country*100).round(2).rename('% churn'))

# Par genre
if {'Gender','Exited'}.issubset(df.columns):
    churn_by_gender = df.groupby('Gender')['Exited'].mean().sort_values(ascending=False)
    display((churn_by_gender*100).round(2).rename('% churn'))


## üìä Visualisations lisibles (tri√©es & titr√©es)

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

# Barplot churn par pays (tri√©)
if {'Geography','Exited'}.issubset(df.columns):
    churn_by_country = df.groupby('Geography')['Exited'].mean().sort_values(ascending=False)
    ax = sns.barplot(x=churn_by_country.index, y=churn_by_country.values)
    ax.set_title('Taux de churn par pays')
    ax.set_xlabel('Pays'); ax.set_ylabel('Taux de churn')
    plt.xticks(rotation=30); plt.tight_layout(); plt.show()

# Histogramme √¢ge (churn vs non churn)
if {'Age','Exited'}.issubset(df.columns):
    sns.histplot(data=df, x='Age', hue='Exited', bins=30, stat='density', common_norm=False)
    plt.title('Distribution de l\'√¢ge ‚Äî churn vs non-churn'); plt.tight_layout(); plt.show()


## ü§ñ Baseline mod√®le (optionnel) ‚Äî Logistique vs RandomForest

In [1]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

target = 'Exited'
if target in df.columns:
    # Num√©riques / cat√©gorielles
    num_cols = df.select_dtypes(include=['int64','float64','int32','float32']).columns.tolist()
    cat_cols = [c for c in df.select_dtypes(exclude=['int64','float64','int32','float32']).columns if c != target]

    if target in num_cols: num_cols.remove(target)

    X = df[num_cols + cat_cols].copy()
    y = df[target].copy()

    preproc = ColumnTransformer([
        ('num', StandardScaler(with_mean=False), num_cols),
        ('cat', 'passthrough', cat_cols)
    ], remainder='drop')

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    logit = Pipeline([('prep', preproc),
                      ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))])

    rf = Pipeline([('prep', preproc),
                   ('clf', RandomForestClassifier(n_estimators=300, random_state=42, class_weight='balanced'))])

    for name, model in [('LogReg', logit), ('RandomForest', rf)]:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:,1]
        print("\n=== ", name, " ===")
        print(classification_report(y_test, y_pred, digits=3))
        print("ROC-AUC:", round(roc_auc_score(y_test, y_prob), 3))
else:
    print("üëâ Ajoute/nomme la colonne cible 'Exited' pour activer ce bloc.")


NameError: name 'df' is not defined


## üìë Rapport ex√©cutif ‚Äî Template

**Contexte** : La banque souhaite r√©duire le churn en identifiant les profils √† risque.  
**R√©sultats cl√©s** :  
- Taux de churn global : ‚Ä¶%  
- Segments les plus √† risque : ‚Ä¶ (pays, √¢ge, produits, activit√©)  
- Facteurs influents : ‚Ä¶ (Age, NumOfProducts, IsActiveMember, Geography)  
**Recommandations** :  
- Campagnes cibl√©es sur ‚Ä¶  
- Cross-sell vers ‚Ä¶  
- Relance des clients inactifs ‚Ä¶  
