# üí† FinSight Protocol : D√©tection de Fraude Financi√®re Avanc√©e
**Auteur :** Antigravity (Google Deepmind Team)

---

## üéØ Objectif
D√©velopper un mod√®le de Machine Learning robuste capable de d√©tecter des anomalies comptables et des tentatives de fraude dans un jeu de donn√©es financier d√©s√©quilibr√©.

**M√©thodologie :**
1.  **Simulation Forensique :** G√©n√©ration de donn√©es r√©alistes int√©grant des indicateurs financiers (Liquidit√©, Levier, Marge) et comportementaux (Loi de Benford).
2.  **Traitement du D√©s√©quilibre :** Utilisation de **SMOTE** (Synthetic Minority Over-sampling Technique) pour contrer la raret√© des fraudes.
3.  **Mod√©lisation :** Entra√Ænement d'un classifieur **Random Forest**.
4.  **√âvaluation :** Comparaison des performances avec et sans SMOTE.

In [None]:
# 1. INSTALLATION ET IMPORT DES LIBRAIRIES
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA

# Configuration esth√©tique
sns.set_theme(style="darkgrid")
plt.rcParams['figure.figsize'] = (12, 6)

## üõ†Ô∏è Phase 1 : G√©n√©ration de Donn√©es Forensiques (Simulation)
Nous simulons un dataset de 5000 entreprises avec 5% de fraude.

In [None]:
def generate_financial_data(n_samples=5000, fraud_ratio=0.05):
    np.random.seed(42)
    n_frauds = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_frauds
    
    # --- Profil des Entreprises Saines ---
    legit_data = pd.DataFrame({
        'current_ratio': np.random.normal(1.5, 0.3, n_legit),
        'debt_to_equity': np.random.normal(0.5, 0.1, n_legit),
        'net_margin': np.random.normal(0.10, 0.02, n_legit),
        'benford_deviation': np.random.exponential(0.05, n_legit),
        'text_complexity': np.random.normal(10, 2, n_legit),
        'is_fraud': 0
    })
    
    # --- Profil des Fraudeurs (Anomalies) ---
    fraud_data = pd.DataFrame({
        'current_ratio': np.random.normal(0.8, 0.4, n_frauds),
        'debt_to_equity': np.random.normal(1.2, 0.5, n_frauds),
        'net_margin': np.random.normal(0.15, 0.05, n_frauds),
        'benford_deviation': np.random.normal(0.3, 0.1, n_frauds),
        'text_complexity': np.random.normal(16, 3, n_frauds),
        'is_fraud': 1
    })
    
    df = pd.concat([legit_data, fraud_data], ignore_index=True)
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    return df

df = generate_financial_data()
print(f"Dataset Shape: {df.shape}")
print(f"Distribution des classes :\n{df['is_fraud'].value_counts(normalize=True)}")

## ‚öñÔ∏è Phase 2 : Le Probl√®me du D√©s√©quilibre & SMOTE

In [None]:
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Entra√Ænement Original - Fraudes : {sum(y_train == 1)} / Total : {len(y_train)}")

# Application SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Entra√Ænement Apr√®s SMOTE - Fraudes : {sum(y_train_smote == 1)} / Total : {len(y_train_smote)}")

In [None]:
# Visualisation de l'effet SMOTE via PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_smote_pca = pca.transform(X_train_smote)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].scatter(X_train_pca[y_train==0, 0], X_train_pca[y_train==0, 1], label='L√©gitime', alpha=0.3, color='blue')
axes[0].scatter(X_train_pca[y_train==1, 0], X_train_pca[y_train==1, 1], label='Fraude', alpha=0.6, color='red')
axes[0].set_title("Avant SMOTE")
axes[0].legend()

axes[1].scatter(X_smote_pca[y_train_smote==0, 0], X_smote_pca[y_train_smote==0, 1], label='L√©gitime', alpha=0.3, color='blue')
axes[1].scatter(X_smote_pca[y_train_smote==1, 0], X_smote_pca[y_train_smote==1, 1], label='Fraude (Augment√©e)', alpha=0.6, color='red')
axes[1].set_title("Apr√®s SMOTE")
axes[1].legend()

plt.show()

## üß† Phase 3 : Entra√Ænement et Comparaison (Random Forest)

In [None]:
# Mod√®le 1 : Standard
rf_standard = RandomForestClassifier(n_estimators=100, random_state=42)
rf_standard.fit(X_train, y_train)
y_pred_std = rf_standard.predict(X_test)

# Mod√®le 2 : SMOTE
rf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = rf_smote.predict(X_test)

def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(title)
    plt.xlabel('Pr√©diction')
    plt.ylabel('R√©alit√©')
    plt.xticks([0.5, 1.5], ['L√©gitime', 'Fraude'])
    plt.yticks([0.5, 1.5], ['L√©gitime', 'Fraude'])
    plt.show()

print("--- SANS SMOTE ---")
print(classification_report(y_test, y_pred_std))
plot_confusion_matrix(y_test, y_pred_std, "Sans SMOTE")

print("--- AVEC SMOTE ---")
print(classification_report(y_test, y_pred_smote))
plot_confusion_matrix(y_test, y_pred_smote, "Avec SMOTE")

## ‚úÖ Conclusion
L'utilisation de **SMOTE** a permis de r√©√©quilibrer la sensibilit√© du mod√®le, augmentant significativement la d√©tection des fraudes.