# DSO1: Impl√©mentation et √âvaluation de la M√©thodologie RLT
## Reinforcement Learning Trees sur Donn√©es Multivari√©es

**Authors:** Dhia Romdhane, Yosri Awedi, Baha Saadoui, Nour Rajhi, Bouguerra Taha, Oumaima Nacef  
**Date:** December 2025  
**Course:** Machine Learning Project - DSO1  
**M√©thodologie:** CRISP-DM + RLT (Zhu et al., 2015)

---

## üìö √Ä propos de ce Notebook (DSO1)

Ce notebook impl√©mente la **m√©thodologie RLT de base** :
1. **Variable Importance (VI)** - Estimation de l'importance globale des variables
2. **Variable Muting** - √âlimination des variables faibles
3. **RLT avec Random Forest** - Mod√®le embarqu√© de base
4. **Comparaison Baseline vs RLT** - √âvaluation des performances

### üéØ Scope du DSO1
- ‚úÖ **Mod√®le Na√Øf (Baseline)** : R√©gression/Classification simple
- ‚úÖ **RLT-RandomForest** : Impl√©mentation RLT avec RF comme mod√®le embarqu√©
- ‚úÖ **√âvaluation compl√®te** : M√©triques, visualisations, comparaisons

### üöÄ DSO2 (Travail Futur)
Le **DSO2** explorera d'autres mod√®les embarqu√©s:
- üîú XGBoost
- üîú LightGBM
- üîú Extra Trees
- üîú Gradient Boosting
- üîú Neural Networks

### üìä Datasets Support√©s
Ce notebook fonctionne sur **9 datasets** + possibilit√© d'uploader vos propres donn√©es.

---
## üì¶ Setup & Configuration

In [None]:
# Core Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
from datetime import datetime
warnings.filterwarnings('ignore')

# ML Libraries (DSO1: Models simples + Random Forest)
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, r2_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, roc_auc_score
from scipy.stats import f_classif, f_regression

# Configuration
RANDOM_STATE = 42
VI_THRESHOLD = 0.01
np.random.seed(RANDOM_STATE)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

print("="*80)
print("DSO1: Impl√©mentation RLT de Base".center(80))
print("="*80)
print("‚úì Biblioth√®ques import√©es avec succ√®s!")
print(f"üìÖ D√©marrage: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nüí° DSO1 Focus: Baseline + RLT-RandomForest")
print(f"üîú DSO2 travaillera sur: XGBoost, LightGBM, Extra Trees, etc.")

---
## üìÇ S√©lection du Dataset

**Option 1:** Choisir un dataset pr√©-charg√© (1-9)  
**Option 2:** Uploader votre propre CSV (0)

In [None]:
# Datasets disponibles
AVAILABLE_DATASETS = {
    '1': {'file': 'BostonHousing.csv', 'target': 'medv', 'type': 'regression'},
    '2': {'file': 'winequality-red.csv', 'target': 'quality', 'type': 'classification'},
    '3': {'file': 'winequality-white.csv', 'target': 'quality', 'type': 'classification'},
    '4': {'file': 'sonar data.csv', 'target': 'Class', 'type': 'classification'},
    '5': {'file': 'parkinsons.data', 'target': 'status', 'type': 'classification'},
    '6': {'file': 'wdbc.data', 'target': None, 'type': 'classification'},
    '7': {'file': 'auto-mpg.data', 'target': 'mpg', 'type': 'regression'},
    '8': {'file': 'data_school.csv', 'target': None, 'type': 'classification'},
    '9': {'file': 'breast-cancer.csv', 'target': 'diagnosis', 'type': 'classification'}
}

print("üìä DATASETS DISPONIBLES:")
print("="*80)
for key, info in AVAILABLE_DATASETS.items():
    print(f"{key}. {info['file']:<30} Type: {info['type']:<15} Target: {info['target'] or 'Auto'}")
print("\n0. Uploader votre propre CSV")
print("="*80)

In [None]:
def load_dataset(choice='1'):
    """
    Charger un dataset selon le choix.
    
    Returns:
    --------
    df, target_col, problem_type
    """
    if choice == '0':
        print("üì§ Mode Upload: Uploadez votre fichier CSV")
        print("‚ö†Ô∏è Apr√®s upload, ex√©cutez la cellule suivante pour traiter le fichier")
        return None, None, None
    
    elif choice in AVAILABLE_DATASETS:
        dataset_info = AVAILABLE_DATASETS[choice]
        filepath = dataset_info['file']
        
        try:
            # Charger le fichier
            if filepath.endswith('.data'):
                df = pd.read_csv(filepath, header=None if 'wdbc' in filepath else 0)
            else:
                df = pd.read_csv(filepath)
            
            # D√©terminer la colonne cible
            if dataset_info['target']:
                target_col = dataset_info['target']
            elif 'wdbc' in filepath:
                target_col = df.columns[1]
                df = df.iloc[:, 1:]
            else:
                target_col = df.columns[-1]
            
            problem_type = dataset_info['type']
            
            print(f"‚úì Dataset charg√©: {filepath}")
            print(f"  Forme: {df.shape}")
            print(f"  Cible: {target_col}")
            print(f"  Type: {problem_type}")
            
            return df, target_col, problem_type
            
        except Exception as e:
            print(f"‚ùå Erreur lors du chargement de {filepath}: {e}")
            return None, None, None
    else:
        print("‚ùå Choix invalide")
        return None, None, None

# CHANGEZ CE NUM√âRO pour tester diff√©rents datasets (1-9)
DATASET_CHOICE = '1'

df, target_col, problem_type = load_dataset(DATASET_CHOICE)

if df is not None:
    print(f"\nüìä Aper√ßu du Dataset:")
    display(df.head())

---
## üîç Analyse Exploratoire

In [None]:
if df is not None:
    print("üìä INFORMATIONS SUR LE DATASET")
    print("="*80)
    print(f"Forme: {df.shape[0]} √©chantillons, {df.shape[1]} caract√©ristiques")
    print(f"Cible: {target_col}")
    print(f"Type de probl√®me: {problem_type}")
    print(f"\nValeurs manquantes: {df.isnull().sum().sum()}")
    print(f"Doublons: {df.duplicated().sum()}")
    
    print("\nüìà Distribution de la cible:")
    if problem_type == 'classification':
        print(df[target_col].value_counts())
    else:
        print(df[target_col].describe())

In [None]:
# Visualisations
if df is not None:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Distribution de la cible
    if problem_type == 'classification':
        df[target_col].value_counts().plot(kind='bar', ax=axes[0], color='steelblue', alpha=0.7)
        axes[0].set_title('Distribution des Classes', fontsize=14, fontweight='bold')
        axes[0].set_ylabel('Nombre')
    else:
        axes[0].hist(df[target_col], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
        axes[0].set_title('Distribution de la Cible', fontsize=14, fontweight='bold')
        axes[0].set_ylabel('Fr√©quence')
    axes[0].grid(alpha=0.3)
    
    # Heatmap de corr√©lation
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 1:
        corr_matrix = df[numeric_cols].corr()
        if target_col in corr_matrix.columns:
            top_features = corr_matrix[target_col].abs().nlargest(min(10, len(corr_matrix))).index
            sns.heatmap(df[top_features].corr(), annot=True, fmt='.2f', cmap='coolwarm', 
                       center=0, ax=axes[1], square=True)
            axes[1].set_title('Matrice de Corr√©lation (Top Features)', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

---
## üõ†Ô∏è RLT √âTAPE 1: Pr√©traitement des Donn√©es

In [None]:
if df is not None:
    print("üîß PR√âTRAITEMENT DES DONN√âES")
    print("="*80)
    
    # S√©parer features et cible
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    
    # Encoder les variables cat√©gorielles
    categorical_features = X.select_dtypes(exclude=[np.number]).columns
    if len(categorical_features) > 0:
        print(f"‚ö†Ô∏è Encodage de {len(categorical_features)} variables cat√©gorielles...")
        for col in categorical_features:
            le = LabelEncoder()
            X[col] = le.fit_transform(X[col].astype(str))
    
    # Encoder la cible si classification
    if problem_type == 'classification':
        if y.dtype == 'object' or not np.issubdtype(y.dtype, np.number):
            le_target = LabelEncoder()
            y = le_target.fit_transform(y)
    
    # Standardiser les features
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
    
    print(f"\n‚úì Pr√©traitement termin√©")
    print(f"  Features: {X_scaled.shape[1]}")
    print(f"  √âchantillons: {len(y)}")

---
## üß† RLT √âTAPE 2: Calcul de Variable Importance (VI)

**M√©thodologie RLT (Zhu et al., 2015):**
1. Random Forest feature importance (40%)
2. Tests statistiques F-statistic/corr√©lation (60%)

In [None]:
if df is not None:
    print("üß† CALCUL DE VARIABLE IMPORTANCE (VI)")
    print("="*80)
    print("\nM√©thode: Random Forest (40%) + Tests Statistiques (60%)")
    
    # M√©thode 1: Random Forest VI
    if problem_type == 'classification':
        rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)
        f_scores, _ = f_classif(X_scaled, y)
    else:
        rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)
        f_scores, _ = f_regression(X_scaled, y)
    
    rf.fit(X_scaled, y)
    vi_rf = rf.feature_importances_
    
    # M√©thode 2: VI Statistique
    vi_stat = np.abs(f_scores)
    
    # Normaliser
    vi_rf = vi_rf / vi_rf.sum()
    vi_stat = vi_stat / vi_stat.sum()
    
    # Agr√©ger (RLT DSO1: RF 40% + Stat 60%)
    VI_RF_WEIGHT = 0.4
    VI_STAT_WEIGHT = 0.6
    
    vi_aggregate = VI_RF_WEIGHT * vi_rf + VI_STAT_WEIGHT * vi_stat
    
    # DataFrame VI
    vi_df = pd.DataFrame({
        'Feature': X_scaled.columns,
        'VI_RandomForest': vi_rf,
        'VI_Statistical': vi_stat,
        'VI_Aggregate': vi_aggregate
    }).sort_values('VI_Aggregate', ascending=False)
    
    print("\nüìä Top 10 Features par Importance:")
    display(vi_df.head(10))
    
    print("\n‚úì Variable Importance calcul√©e (DSO1: RF + Statistical)")

In [None]:
# Visualisation VI
if df is not None:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Top 10 features
    top_10 = vi_df.head(10)
    axes[0].barh(range(len(top_10)), top_10['VI_Aggregate'], color='steelblue', alpha=0.8)
    axes[0].set_yticks(range(len(top_10)))
    axes[0].set_yticklabels(top_10['Feature'])
    axes[0].invert_yaxis()
    axes[0].set_xlabel('Variable Importance (Agr√©g√©e)', fontsize=12)
    axes[0].set_title('Top 10 Features - RLT Variable Importance', fontsize=14, fontweight='bold')
    axes[0].grid(axis='x', alpha=0.3)
    
    # Comparaison RF vs Statistical
    x = np.arange(len(top_10))
    width = 0.35
    axes[1].barh(x - width/2, top_10['VI_RandomForest'], width, label='Random Forest (40%)', alpha=0.8)
    axes[1].barh(x + width/2, top_10['VI_Statistical'], width, label='Statistical (60%)', alpha=0.8)
    axes[1].set_yticks(x)
    axes[1].set_yticklabels(top_10['Feature'])
    axes[1].invert_yaxis()
    axes[1].set_xlabel('Variable Importance', fontsize=12)
    axes[1].set_title('Comparaison des M√©thodes VI', fontsize=14, fontweight='bold')
    axes[1].legend()
    axes[1].grid(axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

---
## üîá RLT √âTAPE 3: Variable Muting (√âlimination de Variables)

In [None]:
if df is not None:
    print(f"üîá APPLICATION DU VARIABLE MUTING (seuil = {VI_THRESHOLD})")
    print("="*80)
    
    # Identifier features √† garder
    high_vi_features = vi_df[vi_df['VI_Aggregate'] >= VI_THRESHOLD]['Feature'].tolist()
    low_vi_features = vi_df[vi_df['VI_Aggregate'] < VI_THRESHOLD]['Feature'].tolist()
    
    # Garantir au moins 5 features
    if len(high_vi_features) < 5:
        high_vi_features = vi_df.head(5)['Feature'].tolist()
        low_vi_features = vi_df.iloc[5:]['Feature'].tolist()
        print("‚ö†Ô∏è Moins de 5 features au-dessus du seuil, conservation des 5 meilleures")
    
    # Cr√©er dataset mut√©
    X_muted = X_scaled[high_vi_features]
    
    muted_count = len(low_vi_features)
    muted_pct = (muted_count / X_scaled.shape[1]) * 100
    
    print(f"\nüìä R√©sultats du Muting:")
    print(f"  ‚Ä¢ Features Originales: {X_scaled.shape[1]}")
    print(f"  ‚Ä¢ Features Conserv√©es: {len(high_vi_features)} ({100-muted_pct:.1f}%)")
    print(f"  ‚Ä¢ Features Mut√©es: {muted_count} ({muted_pct:.1f}%)")
    
    if muted_count > 0 and muted_count <= 10:
        print(f"\nüîá Features Mut√©es (VI faible):")
        for feat in low_vi_features[:10]:
            vi_value = vi_df[vi_df['Feature'] == feat]['VI_Aggregate'].values[0]
            print(f"    ‚Ä¢ {feat}: VI = {vi_value:.4f}")
    
    print(f"\n‚úì Variable Muting termin√© - R√©duction: {muted_pct:.1f}%")

---
## ü§ñ RLT √âTAPE 4: Entra√Ænement des Mod√®les

### DSO1 - Mod√®les de Base:
1. **Baseline (Na√Øf)** - R√©gression/Classification simple sur toutes les features
2. **RLT-RandomForest** - Random Forest sur features mut√©es

### DSO2 (Futur) - Autres Mod√®les Embarqu√©s:
- XGBoost, LightGBM, Extra Trees, Gradient Boosting, etc.

In [None]:
if df is not None:
    print("="*80)
    print("DSO1: ENTRA√éNEMENT DES MOD√àLES".center(80))
    print("="*80)
    print("\nüìå Mod√®les DSO1:")
    print("  1. Baseline (Na√Øf) - Toutes les features")
    print("  2. RLT-RandomForest - Features mut√©es")
    print("\nüîú DSO2 explorera: XGBoost, LightGBM, Extra Trees, etc.")
    print("="*80)
    
    # Configuration selon le type de probl√®me
    if problem_type == 'classification':
        baseline_model = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
        rlt_model = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
        scoring = 'accuracy'
        metric_name = 'Accuracy'
    else:
        baseline_model = LinearRegression()
        rlt_model = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
        cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
        scoring = 'r2'
        metric_name = 'R¬≤'
    
    # Entra√Ænement Baseline (toutes features)
    print("\nüìä BASELINE (Mod√®le Na√Øf) - Toutes les features:")
    print("-" * 60)
    baseline_scores = cross_val_score(baseline_model, X_scaled, y, cv=cv, scoring=scoring, n_jobs=-1)
    baseline_mean = baseline_scores.mean()
    baseline_std = baseline_scores.std()
    print(f"  {metric_name} = {baseline_mean:.4f} (¬±{baseline_std:.4f})")
    print(f"  Nombre de features: {X_scaled.shape[1]}")
    
    # Entra√Ænement RLT (features mut√©es)
    print("\nüìä RLT-RANDOMFOREST - Features mut√©es:")
    print("-" * 60)
    rlt_scores = cross_val_score(rlt_model, X_muted, y, cv=cv, scoring=scoring, n_jobs=-1)
    rlt_mean = rlt_scores.mean()
    rlt_std = rlt_scores.std()
    print(f"  {metric_name} = {rlt_mean:.4f} (¬±{rlt_std:.4f})")
    print(f"  Nombre de features: {X_muted.shape[1]}")
    
    # Stocker les r√©sultats
    results = {
        'Baseline': {'mean': baseline_mean, 'std': baseline_std, 'n_features': X_scaled.shape[1]},
        'RLT-RandomForest': {'mean': rlt_mean, 'std': rlt_std, 'n_features': X_muted.shape[1]}
    }

---
## üìä RLT √âTAPE 5: √âvaluation et Comparaison

In [None]:
if df is not None:
    print("\n" + "="*80)
    print("COMPARAISON FINALE: BASELINE vs RLT".center(80))
    print("="*80)
    
    # Calcul de l'am√©lioration
    improvement = ((rlt_mean - baseline_mean) / baseline_mean) * 100
    feature_reduction = muted_pct
    
    # Affichage comparatif
    print(f"\nüèÜ R√âSULTATS DSO1:")
    print("-" * 80)
    print(f"  Baseline (Na√Øf):         {metric_name} = {baseline_mean:.4f} (¬±{baseline_std:.4f})")
    print(f"                           Features: {X_scaled.shape[1]}")
    print()
    print(f"  RLT-RandomForest:        {metric_name} = {rlt_mean:.4f} (¬±{rlt_std:.4f})")
    print(f"                           Features: {X_muted.shape[1]} (r√©duction: {feature_reduction:.1f}%)")
    print("-" * 80)
    
    print(f"\nüí° ANALYSE:")
    print(f"  Am√©lioration RLT:        {improvement:+.2f}%")
    print(f"  R√©duction de features:   {feature_reduction:.1f}%")
    
    winner = "RLT-RandomForest" if rlt_mean > baseline_mean else "Baseline"
    winner_icon = "üéØ" if winner == "RLT-RandomForest" else "‚ö†Ô∏è"
    print(f"\n{winner_icon} GAGNANT: {winner}")
    
    print("\n" + "="*80)
    print("üí¨ CONCLUSION DSO1:")
    if rlt_mean > baseline_mean:
        print(f"  ‚úÖ RLT am√©liore les performances de {improvement:.2f}%")
        print(f"  ‚úÖ Avec {feature_reduction:.1f}% de features en moins!")
    else:
        print(f"  ‚ö†Ô∏è Baseline meilleur de {abs(improvement):.2f}%")
        print(f"  üí° RLT peut mieux fonctionner sur datasets haute dimension")
    print("\nüîú DSO2 testera d'autres mod√®les embarqu√©s (XGBoost, LightGBM, etc.)")
    print("="*80)

In [None]:
# Visualisation de la comparaison
if df is not None:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Graphique 1: Comparaison des performances
    models = ['Baseline\n(Na√Øf)', 'RLT-\nRandomForest']
    scores = [baseline_mean, rlt_mean]
    colors = ['steelblue', 'orange']
    
    bars = axes[0].bar(models, scores, color=colors, alpha=0.7, edgecolor='black', width=0.6)
    axes[0].set_ylabel(f'{metric_name} Score', fontsize=12, fontweight='bold')
    axes[0].set_title('DSO1: Baseline vs RLT-RandomForest', fontsize=14, fontweight='bold')
    axes[0].set_ylim([min(scores) * 0.95, max(scores) * 1.05])
    axes[0].grid(axis='y', alpha=0.3)
    
    # Ajouter les valeurs sur les barres
    for bar, score in zip(bars, scores):
        height = bar.get_height()
        axes[0].text(bar.get_x() + bar.get_width()/2., height,
                    f'{score:.4f}', ha='center', va='bottom', 
                    fontsize=12, fontweight='bold')
    
    # Graphique 2: Am√©lioration et r√©duction de features
    metrics = ['Performance\nImprovement (%)', 'Feature\nReduction (%)']
    values = [improvement, feature_reduction]
    colors2 = ['green' if improvement > 0 else 'red', 'blue']
    
    bars = axes[1].bar(metrics, values, color=colors2, alpha=0.7, edgecolor='black', width=0.6)
    axes[1].axhline(y=0, color='black', linestyle='-', linewidth=1)
    axes[1].set_ylabel('Pourcentage (%)', fontsize=12, fontweight='bold')
    axes[1].set_title('RLT: Am√©lioration vs R√©duction', fontsize=14, fontweight='bold')
    axes[1].grid(axis='y', alpha=0.3)
    
    # Ajouter les valeurs
    for bar, value in zip(bars, values):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height,
                    f'{value:+.1f}%' if 'Improvement' in bar.get_x() else f'{value:.1f}%',
                    ha='center', va='bottom' if height > 0 else 'top',
                    fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

---
## üìà √âvaluation D√©taill√©e sur Test Set

In [None]:
if df is not None:
    print("üìà √âVALUATION SUR ENSEMBLE DE TEST")
    print("="*80)
    
    # Split train/test
    X_train_full, X_test_full, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.2, random_state=RANDOM_STATE, 
        stratify=y if problem_type == 'classification' else None
    )
    
    X_train_muted = X_train_full[high_vi_features]
    X_test_muted = X_test_full[high_vi_features]
    
    # Entra√Æner et pr√©dire
    baseline_model.fit(X_train_full, y_train)
    rlt_model.fit(X_train_muted, y_train)
    
    y_pred_baseline = baseline_model.predict(X_test_full)
    y_pred_rlt = rlt_model.predict(X_test_muted)
    
    # M√©triques
    if problem_type == 'classification':
        baseline_score = accuracy_score(y_test, y_pred_baseline)
        rlt_score = accuracy_score(y_test, y_pred_rlt)
        
        print(f"\nüéØ Accuracy sur Test Set:")
        print(f"  Baseline:        {baseline_score:.4f}")
        print(f"  RLT-RandomForest: {rlt_score:.4f}")
        print(f"  Diff√©rence:      {(rlt_score - baseline_score):.4f}")
        
        print(f"\nüìä Classification Report (RLT):")
        print(classification_report(y_test, y_pred_rlt))
    else:
        baseline_r2 = r2_score(y_test, y_pred_baseline)
        rlt_r2 = r2_score(y_test, y_pred_rlt)
        
        baseline_rmse = np.sqrt(mean_squared_error(y_test, y_pred_baseline))
        rlt_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rlt))
        
        print(f"\nüéØ M√©triques sur Test Set:")
        print(f"\n  R¬≤ Score:")
        print(f"    Baseline:         {baseline_r2:.4f}")
        print(f"    RLT-RandomForest: {rlt_r2:.4f}")
        print(f"  \n  RMSE:")
        print(f"    Baseline:         {baseline_rmse:.4f}")
        print(f"    RLT-RandomForest: {rlt_rmse:.4f}")
    
    test_improvement = ((rlt_score if problem_type == 'classification' else rlt_r2) - 
                       (baseline_score if problem_type == 'classification' else baseline_r2)) / \
                      (baseline_score if problem_type == 'classification' else baseline_r2) * 100
    
    print(f"\nüí° Am√©lioration sur Test: {test_improvement:+.2f}%")

---
## üíæ Sauvegarder les R√©sultats

In [None]:
if df is not None:
    # Cr√©er un r√©sum√©
    summary = pd.DataFrame({
        'Model': ['Baseline (Na√Øf)', 'RLT-RandomForest'],
        'Score_Mean': [baseline_mean, rlt_mean],
        'Score_Std': [baseline_std, rlt_std],
        'N_Features': [X_scaled.shape[1], X_muted.shape[1]],
        'Dataset': [AVAILABLE_DATASETS[DATASET_CHOICE]['file']] * 2,
        'Problem_Type': [problem_type] * 2
    })
    
    output_file = f"DSO1_Results_{AVAILABLE_DATASETS[DATASET_CHOICE]['file'].replace('.', '_')}.csv"
    summary.to_csv(output_file, index=False)
    
    print(f"üíæ R√©sultats sauvegard√©s: {output_file}")
    display(summary)

---
## üìù Conclusions DSO1

### üéØ Ce que nous avons accompli (DSO1):

1. **‚úÖ Impl√©mentation RLT compl√®te:**
   - Variable Importance (RF + Statistical)
   - Variable Muting
   - Comparaison Baseline vs RLT

2. **‚úÖ Mod√®les DSO1:**
   - Baseline Na√Øf (Logistic/Linear Regression)
   - RLT-RandomForest

3. **‚úÖ √âvaluation rigoureuse:**
   - Cross-validation 5-fold
   - Test set evaluation
   - M√©triques multiples
   - Visualisations

### üöÄ Pour DSO2 (Travail Futur):

Le **DSO2** explorera d'autres mod√®les embarqu√©s pour RLT:

**Mod√®les √† tester:**
- üîú **XGBoost** - Gradient boosting optimis√©
- üîú **LightGBM** - Gradient boosting rapide
- üîú **Extra Trees** - Variation de Random Forest
- üîú **Gradient Boosting** - Boosting classique
- üîú **CatBoost** - Pour variables cat√©gorielles
- üîú **Neural Networks** - Approche deep learning

**Pistes d'am√©lioration DSO2:**
- Feature Engineering avanc√©
- Hyperparameter tuning
- Stacking de mod√®les
- Feature combinations (interactions)

### üìö Recommandations:

**Quand utiliser RLT:**
- ‚úÖ Datasets avec > 20 features
- ‚úÖ Pr√©sence de variables bruit√©es
- ‚úÖ Besoin d'interpr√©tabilit√©
- ‚úÖ Contraintes de vitesse (moins de features)

**Quand √©viter RLT:**
- ‚ö†Ô∏è Datasets avec < 10 features
- ‚ö†Ô∏è Toutes les features sont importantes
- ‚ö†Ô∏è √âchantillons tr√®s petits (n < 100)

---

## üìñ R√©f√©rences

1. **Zhu, R., Zeng, D., & Kosorok, M. R. (2015).** "Reinforcement Learning Trees." *Journal of the American Statistical Association*, 110(512), 1770-1784.

2. **Breiman, L. (2001).** "Random Forests." *Machine Learning*, 45(1), 5-32.

3. **CRISP-DM Methodology** - Cross-Industry Standard Process for Data Mining

---

**Authors:** Dhia Romdhane, Yosri Awedi, Baha Saadoui, Nour Rajhi, Bouguerra Taha, Oumaima Nacef  
**Course:** Machine Learning Project - DSO1  
**Date:** December 2025  
**Repository:** https://github.com/yosriawedi/ML-Project-RLT

---

## üéâ DSO1 Termin√©!

Ce notebook a d√©montr√© l'**impl√©mentation et l'√©valuation de base de la m√©thodologie RLT**.

**Prochaine √©tape:** DSO2 explorera d'autres mod√®les embarqu√©s! üöÄ