# Analyse Random Forest des données physiques

Ce notebook implémente la classification par Random Forest sur notre jeu de données physiques pour la détection d'attaques. Comme pour nos autres analyses, nous évaluerons les performances sur la classification binaire (attaque vs normal) et la classification multi-classes (types d'attaques spécifiques).

In [None]:
# Libraries
import pandas as pd
import numpy as np
from pickleshare import PickleShareDB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    precision_score, recall_score, accuracy_score, f1_score,
    confusion_matrix, matthews_corrcoef, balanced_accuracy_score
)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
import tracemalloc
import os
np.random.seed(42)

## Chargement des données

In [2]:
# Load data from prep_data
data_dir = '../prep_data' 
db = PickleShareDB(os.path.join(data_dir, 'kity'))

# Load raw data
df_phy_1 = db['df_phy_1']
df_phy_2 = db['df_phy_2']
df_phy_3 = db['df_phy_3']
df_phy_4 = db['df_phy_4']
df_phy_norm = db['df_phy_norm']

# Load label mapping
label_mapping = db['label_mapping']

## Préparation des données

Pour Random Forest, nous garderons la plupart des caractéristiques numériques mais pouvons supprimer les colonnes redondantes ou non informatives :
- Supprimer la colonne Time car elle n'est pas pertinente pour la classification
- Garder les relevés Tank et Flow sensor car ce sont des indicateurs physiques clés
- Garder les états des vannes et pompes car ils représentent les actions du système
- Supprimer toutes les colonnes constantes identifiées dans l'analyse précédente

In [3]:
def prepare_data(df):
    """Prepare data with proper handling of missing values"""
    # Remove Time column
    df_prepared = df.drop(columns=['Time'])
    
    # Split features and labels
    X = df_prepared.drop(columns=['Label', 'Label_n'])
    y_label = df_prepared['Label']
    y_label_n = df_prepared['Label_n']
    
    return X, y_label, y_label_n

def preprocess_data(df):
    """Preprocess data with imputation and scaling"""
    # Separate numeric and categorical columns
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = df.select_dtypes(include=['category', 'bool']).columns
    
    # Handle numeric features
    if len(numeric_cols) > 0:
        num_imputer = SimpleImputer(strategy='mean')
        scaler = StandardScaler()
        df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    # Handle categorical features
    if len(categorical_cols) > 0:
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
        df = pd.get_dummies(df, columns=categorical_cols)
    
    return df

# Prepare datasets
print("Preparing datasets...")
prepared_data = {}
for name, df in {
    'phy_1': df_phy_1,
    'phy_2': df_phy_2,
    'phy_3': df_phy_3,
    'phy_4': df_phy_4,
    'phy_norm': df_phy_norm
}.items():
    X, y_label, y_label_n = prepare_data(df)
    X_processed = preprocess_data(X)
    prepared_data[name] = {
        'X': X_processed,
        'y_label': y_label,
        'y_label_n': y_label_n
    }

# Combine datasets
X_all = pd.concat([data['X'] for data in prepared_data.values()])
y_label_all = pd.concat([data['y_label'] for data in prepared_data.values()])
y_label_n_all = pd.concat([data['y_label_n'] for data in prepared_data.values()])

print("Data preparation complete.")
print("X_all shape:", X_all.shape)

Preparing datasets...
Data preparation complete.
X_all shape: (10923, 44)


## Implémentation et fonctions d'évaluation Random Forest

In [4]:
def train_evaluate_rf(X_train, X_test, y_train, y_test):
    """Train and evaluate Random Forest model with performance metrics"""
    print("NaN check in train_evaluate_rf:")
    print("X_train NaNs:", X_train.isna().sum().sum())
    print("X_test NaNs:", X_test.isna().sum().sum())
    
    # Handle any remaining NaN values if needed
    if X_train.isna().sum().sum() > 0 or X_test.isna().sum().sum() > 0:
        imputer = SimpleImputer(strategy='mean')
        X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
        X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Measure training time and memory
    tracemalloc.start()
    start_fit_time = time.time()
    
    rf.fit(X_train, y_train)
    
    fit_time = time.time() - start_fit_time
    current, peak = tracemalloc.get_traced_memory()
    fit_memory_usage = peak / (1024 * 1024)  # Convert to MB
    tracemalloc.stop()
    
    # Measure prediction time and memory
    tracemalloc.start()
    start_predict_time = time.time()
    
    y_pred = rf.predict(X_test)
    
    predict_time = time.time() - start_predict_time
    current, peak = tracemalloc.get_traced_memory()
    predict_memory_usage = peak / (1024 * 1024)  # Convert to MB
    tracemalloc.stop()
    
    # Calculate metrics
    conf_matrix = confusion_matrix(y_test, y_pred)
    TN, FP, FN, TP = conf_matrix.ravel()
    
    return {
        'confusion_matrix': conf_matrix,
        'TP': TP,
        'FP': FP,
        'TN': TN,
        'FN': FN,
        'precision': precision_score(y_test, y_pred, average='binary'),
        'recall': recall_score(y_test, y_pred, average='binary'),
        'tnr': TN / (TN + FP) if (TN + FP) != 0 else 0,
        'fpr': FP / (FP + TN) if (FP + TN) != 0 else 0,
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, average='binary'),
        'balanced_accuracy': balanced_accuracy_score(y_test, y_pred),
        'mcc': matthews_corrcoef(y_test, y_pred),
        'fit_time': fit_time,
        'predict_time': predict_time,
        'fit_memory_usage': fit_memory_usage,
        'predict_memory_usage': predict_memory_usage,
        'feature_importance': dict(zip(X_train.columns, rf.feature_importances_))
    }

## Entraînement du modèle et évaluation - Classification binaire

In [5]:
# Split data for binary classification
X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_label_n_all, test_size=0.2, random_state=42
)

# Train and evaluate model
binary_results = train_evaluate_rf(X_train, X_test, y_train, y_test)

# Add metadata
binary_results['data'] = 'PHY'
binary_results['model_type'] = 'Random Forest'
binary_results['attack_type'] = 'labeln'

NaN check in train_evaluate_rf:
X_train NaNs: 6800
X_test NaNs: 1703


## Visualisation des résultats de classification binaire

In [6]:
# Plot confusion matrix
fig = go.Figure(data=go.Heatmap(
    z=binary_results['confusion_matrix'],
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    text=binary_results['confusion_matrix'],
    texttemplate="%{text}",
    textfont={"size": 16},
    colorscale='Blues'
))

fig.update_layout(
    title='Confusion Matrix - Binary Classification',
    height=400,
    width=500
)

fig.show()

In [7]:
# Plot feature importance
feature_importance = pd.DataFrame(
    list(binary_results['feature_importance'].items()),
    columns=['Feature', 'Importance']
).sort_values('Importance', ascending=False).head(10)

fig = go.Figure(data=[
    go.Bar(x=feature_importance['Feature'],
           y=feature_importance['Importance'])
])

fig.update_layout(
    title='Top 10 Most Important Features',
    xaxis_title='Feature',
    yaxis_title='Importance Score'
)

fig.show()

## Classification multi-classes

In [8]:
# Encode labels
le = LabelEncoder()
le.classes_ = np.array(list(label_mapping.keys()))
y_label_all_encoded = le.transform(y_label_all)

# Dictionary to store results for each attack type
multiclass_results = {}

# Train and evaluate for each attack type
for attack_label, encoded_label in label_mapping.items():
    if attack_label != 'normal':  # Skip normal class as it's our reference
        print(f"\nProcessing attack type: {attack_label}")
        
        # Create binary labels for this attack type
        y_binary = (y_label_all_encoded == encoded_label).astype(int)
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X_all, y_binary, test_size=0.2, random_state=42
        )
        
        results = train_evaluate_rf(X_train, X_test, y_train, y_test)
        results['data'] = 'PHY'
        results['model_type'] = 'Random Forest'
        results['attack_type'] = attack_label
        multiclass_results[attack_label] = results


Processing attack type: DoS
NaN check in train_evaluate_rf:
X_train NaNs: 6800
X_test NaNs: 1703

Processing attack type: MITM
NaN check in train_evaluate_rf:
X_train NaNs: 6800
X_test NaNs: 1703

Processing attack type: physical fault
NaN check in train_evaluate_rf:
X_train NaNs: 6800
X_test NaNs: 1703

Processing attack type: scan
NaN check in train_evaluate_rf:
X_train NaNs: 6800
X_test NaNs: 1703



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



## Visualisation des résultats de classification multi-classes 

In [9]:
def plot_attack_metrics(metric_name):
    attacks = list(multiclass_results.keys())
    values = [results[metric_name] for results in multiclass_results.values()]
    
    fig = go.Figure(data=[go.Bar(x=attacks, y=values)])
    
    fig.update_layout(
        title=f'{metric_name} by Attack Type',
        yaxis_title='Score',
        xaxis_title='Attack Type'
    )
    return fig

# Plot key metrics
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1']
for metric in metrics_to_plot:
    plot_attack_metrics(metric).show()

In [10]:
# Plot confusion matrices for each attack type
for attack_type, results in multiclass_results.items():
    fig = go.Figure(data=go.Heatmap(
        z=results['confusion_matrix'],
        x=['Predicted Negative', 'Predicted Positive'],
        y=['Actual Negative', 'Actual Positive'],
        text=results['confusion_matrix'],
        texttemplate="%{text}",
        textfont={"size": 16},
        colorscale='Blues'
    ))
    
    fig.update_layout(
        title=f'Confusion Matrix - {attack_type}',
        height=400,
        width=500
    )
    
    fig.show()

In [11]:
# Create summary table
summary_data = []

for attack_type, results in multiclass_results.items():
    summary_data.append({
        'Attack Type': attack_type,
        'Accuracy': results['accuracy'],
        'Precision': results['precision'],
        'Recall': results['recall'],
        'F1 Score': results['f1'],
        'Training Time (s)': results['fit_time'],
        'Prediction Time (s)': results['predict_time'],
        'Training Memory (MB)': results['fit_memory_usage'],
        'Prediction Memory (MB)': results['predict_memory_usage']
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.round(4)
display(summary_df)

Unnamed: 0,Attack Type,Accuracy,Precision,Recall,F1 Score,Training Time (s),Prediction Time (s),Training Memory (MB),Prediction Memory (MB)
0,DoS,0.9977,1.0,0.9254,0.9612,0.5198,0.0102,2.2601,0.4601
1,MITM,0.9977,0.9947,0.9791,0.9868,0.7058,0.0141,2.2591,0.4599
2,physical fault,0.995,0.9571,0.964,0.9606,0.6343,0.0166,2.256,0.4618
3,scan,0.9982,0.0,0.0,0.0,0.5109,0.0107,2.2556,0.4591


## Analyse de l'importance des caractéristiques

In [12]:
# Analyze feature importance across different attack types
feature_importance_by_attack = {}

for attack_type, results in multiclass_results.items():
    # Get top 10 features for each attack type
    importance_df = pd.DataFrame(
        list(results['feature_importance'].items()),
        columns=['Feature', 'Importance']
    ).sort_values('Importance', ascending=False).head(10)
    
    feature_importance_by_attack[attack_type] = importance_df
    
    # Plot
    fig = go.Figure(data=[
        go.Bar(x=importance_df['Feature'],
               y=importance_df['Importance'])
    ])

    fig.update_layout(
        title=f'Top 10 Most Important Features for {attack_type}',
        xaxis_title='Feature',
        yaxis_title='Importance Score'
    )

    fig.show()

## Analyse des performances computationnelles

In [13]:
# Plot computational performance metrics
performance_fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=('Training and Prediction Time', 'Memory Usage')
)

# Time metrics
attacks = list(multiclass_results.keys())
train_times = [results['fit_time'] for results in multiclass_results.values()]
predict_times = [results['predict_time'] for results in multiclass_results.values()]

performance_fig.add_trace(
    go.Bar(name='Training Time', x=attacks, y=train_times),
    row=1, col=1
)
performance_fig.add_trace(
    go.Bar(name='Prediction Time', x=attacks, y=predict_times),
    row=1, col=1
)

# Memory metrics
train_memory = [results['fit_memory_usage'] for results in multiclass_results.values()]
predict_memory = [results['predict_memory_usage'] for results in multiclass_results.values()]

performance_fig.add_trace(
    go.Bar(name='Training Memory', x=attacks, y=train_memory),
    row=2, col=1
)
performance_fig.add_trace(
    go.Bar(name='Prediction Memory', x=attacks, y=predict_memory),
    row=2, col=1
)

performance_fig.update_layout(
    height=800,
    title_text="Computational Performance by Attack Type",
    barmode='group'
)

performance_fig.show()

## Sauvegarder les résultats pour Streamlit

In [14]:
# Save binary classification results
db['PHY_results_rf_labeln'] = binary_results

# Save multiclass results
for attack_label in multiclass_results.keys():
    db[f'PHY_results_rf_{attack_label}'] = multiclass_results[attack_label]

# Save summary statistics
db['rf_summary_stats'] = {
    'summary_df': summary_df,
    'binary_results': binary_results,
    'multiclass_results': multiclass_results,
    'feature_importance_by_attack': feature_importance_by_attack
}

## Résumé de l'analyse

Points clés de l'analyse Random Forest :

1. Classification binaire (Attaque vs Normal) :
   - Métriques de performance globales
   - Caractéristiques les plus importantes pour la détection
   - Comparaison avec les autres modèles

2. Classification multi-classes (Par type d'attaque) :
   - Variations de performance selon les types d'attaque
   - Schémas d'importance des caractéristiques pour différentes attaques 
   - Défis de détection spécifiques aux attaques

3. Efficacité computationnelle :
   - Analyse des temps d'entraînement et de prédiction
   - Schémas d'utilisation de la mémoire
   - Considérations sur la scalabilité

4. Analyse de l'importance des caractéristiques :
   - Indicateurs clés pour différents types d'attaques
   - Caractéristiques importantes communes entre types d'attaques
   - Potentiel d'optimisation par sélection de caractéristiques