# Analyse KNN des données physiques

Ce notebook implémente la classification KNN sur notre jeu de données physiques, en utilisant :
1. Les données prétraitées d'origine
2. Les données réduites par ACP de notre analyse précédente

Nous comparerons les performances et créerons des visualisations pour les deux approches.

In [9]:
# Libraries
import pandas as pd
import numpy as np
from pickleshare import PickleShareDB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    precision_score, recall_score, accuracy_score, f1_score,
    confusion_matrix, matthews_corrcoef, balanced_accuracy_score
)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
import tracemalloc
import os

## Chargement des données

In [10]:
# Load data from prep_data
data_dir = '../prep_data' 
db = PickleShareDB(os.path.join(data_dir, 'kity'))

# Load raw data
df_phy_1 = db['df_phy_1']
df_phy_2 = db['df_phy_2']
df_phy_3 = db['df_phy_3']
df_phy_4 = db['df_phy_4']
df_phy_norm = db['df_phy_norm']

# Load PCA results
pca_results = db['pca_results_phy']

# Load label mapping
label_mapping = db['label_mapping']

## Préparation des données

Pour KNN, nous devons préparer soigneusement nos données :
1. Supprimer la colonne Time car l'information temporelle n'est pas directement utile pour la classification
2. Conserver les lectures des capteurs Tank et Flow qui sont des indicateurs physiques clés
3. Conserver les états des vannes et pompes qui représentent les actions du système
4. Mettre à l'échelle les caractéristiques numériques
5. Encoder les variables catégorielles
6. Supprimer les colonnes constantes ou redondantes identifiées dans l'analyse précédente

In [11]:
def prepare_data(df):
    """Prepare data with proper handling of missing values"""
    # Remove Time column
    df_prepared = df.drop(columns=['Time'])
    
    # Split features and labels
    X = df_prepared.drop(columns=['Label', 'Label_n'])
    y_label = df_prepared['Label']
    y_label_n = df_prepared['Label_n']
    
    return X, y_label, y_label_n

def preprocess_data(df):
    """Preprocess data with imputation, scaling, and encoding"""
    # Separate numeric and categorical columns
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = df.select_dtypes(include=['category', 'bool']).columns
    
    # Handle numeric features
    if len(numeric_cols) > 0:
        # First impute missing values
        num_imputer = SimpleImputer(strategy='mean')
        df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])
        
        # Then scale
        scaler = StandardScaler()
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    # Handle categorical features
    if len(categorical_cols) > 0:
        # First impute missing values with most frequent value
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
        
        # Then one-hot encode
        df = pd.get_dummies(df, columns=categorical_cols)
    
    return df

# Prepare datasets
print("Preparing datasets...")
prepared_data = {}
for name, df in {
    'phy_1': df_phy_1,
    'phy_2': df_phy_2,
    'phy_3': df_phy_3,
    'phy_4': df_phy_4,
    'phy_norm': df_phy_norm
}.items():
    X, y_label, y_label_n = prepare_data(df)
    X_processed = preprocess_data(X)
    prepared_data[name] = {
        'X': X_processed,
        'y_label': y_label,
        'y_label_n': y_label_n
    }

# Combine datasets
X_all = pd.concat([data['X'] for data in prepared_data.values()])
y_label_all = pd.concat([data['y_label'] for data in prepared_data.values()])
y_label_n_all = pd.concat([data['y_label_n'] for data in prepared_data.values()])

# Get PCA data and handle any NaN values
X_pca = pca_results['transformed_data'].drop(columns=['Label', 'Label_n', 'source'])
if X_pca.isna().any().any():
    imputer = SimpleImputer(strategy='mean')
    X_pca = pd.DataFrame(
        imputer.fit_transform(X_pca),
        columns=X_pca.columns,
        index=X_pca.index
    )

print("Data preparation complete.")
print("X_all shape:", X_all.shape)
print("X_pca shape:", X_pca.shape)

# Check for any remaining NaN values
print("\nChecking for NaN values:")
print("X_all NaN count:", X_all.isna().sum().sum())
print("X_pca NaN count:", X_pca.isna().sum().sum())

Preparing datasets...
Data preparation complete.
X_all shape: (10923, 44)
X_pca shape: (10923, 5)

Checking for NaN values:
X_all NaN count: 8503
X_pca NaN count: 0


## Implémentation KNN et fonctions d'évaluation

In [12]:
def train_evaluate_knn(X_train, X_test, y_train, y_test, n_neighbors=5):
    """Train and evaluate KNN model with performance metrics and NaN checks"""
    print("NaN check in train_evaluate_knn:")
    print("X_train NaNs:", X_train.isna().sum().sum())
    print("X_test NaNs:", X_test.isna().sum().sum())
    
    # Handle any remaining NaN values if needed
    if X_train.isna().sum().sum() > 0 or X_test.isna().sum().sum() > 0:
        imputer = SimpleImputer(strategy='mean')
        X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
        X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
    
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    
    # Measure training time and memory
    start_fit_time = time.time()
    tracemalloc.start()
    
    knn.fit(X_train, y_train)
    
    fit_time = time.time() - start_fit_time
    current, peak = tracemalloc.get_traced_memory()
    fit_memory_usage = peak / (1024 * 1024)  # Convert to MB
    tracemalloc.stop()
    
    # Measure prediction time and memory
    start_predict_time = time.time()
    tracemalloc.start()
    
    y_pred = knn.predict(X_test)
    
    predict_time = time.time() - start_predict_time
    current, peak = tracemalloc.get_traced_memory()
    predict_memory_usage = peak / (1024 * 1024)  # Convert to MB
    tracemalloc.stop()
    
    # Calculate metrics
    conf_matrix = confusion_matrix(y_test, y_pred)
    TN, FP, FN, TP = conf_matrix.ravel()
    
    return {
        'confusion_matrix': conf_matrix,
        'TP': TP,
        'FP': FP,
        'TN': TN,
        'FN': FN,
        'precision': precision_score(y_test, y_pred, average='binary'),
        'recall': recall_score(y_test, y_pred, average='binary'),
        'tnr': TN / (TN + FP) if (TN + FP) != 0 else 0,
        'fpr': FP / (FP + TN) if (FP + TN) != 0 else 0,
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, average='binary'),
        'balanced_accuracy': balanced_accuracy_score(y_test, y_pred),
        'mcc': matthews_corrcoef(y_test, y_pred),
        'fit_time': fit_time,
        'predict_time': predict_time,
        'fit_memory_usage': fit_memory_usage,
        'predict_memory_usage': predict_memory_usage
    }

## Entraînement du modèle et évaluation - Classification binaire


In [13]:
# Split data for both original and PCA-transformed features
X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_label_n_all, test_size=0.2, random_state=42
)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y_label_n_all, test_size=0.2, random_state=42
)

# Train and evaluate models
results_original = train_evaluate_knn(X_train, X_test, y_train, y_test)
results_pca = train_evaluate_knn(X_train_pca, X_test_pca, y_train_pca, y_test_pca)

# Add metadata
results_original['data'] = 'PHY'
results_original['model_type'] = 'KNN'
results_original['attack_type'] = 'labeln'

results_pca['data'] = 'PHY'
results_pca['model_type'] = 'KNN PCA'
results_pca['attack_type'] = 'labeln'

NaN check in train_evaluate_knn:
X_train NaNs: 6800
X_test NaNs: 1703
NaN check in train_evaluate_knn:
X_train NaNs: 0
X_test NaNs: 0


## Visualisation des résultats de classification binaire

In [14]:
def plot_confusion_matrix(conf_matrix, title):
    """Plot confusion matrix using plotly"""
    fig = go.Figure(data=go.Heatmap(
        z=conf_matrix,
        x=['Predicted Negative', 'Predicted Positive'],
        y=['Actual Negative', 'Actual Positive'],
        text=conf_matrix,
        texttemplate="%{text}",
        textfont={"size": 16},
        colorscale='Blues'
    ))
    
    fig.update_layout(
        title=title,
        height=400,
        width=500
    )
    
    return fig

# Create confusion matrix plots
fig1 = plot_confusion_matrix(
    results_original['confusion_matrix'],
    'Confusion Matrix - Original Features'
)
fig1.show()

fig2 = plot_confusion_matrix(
    results_pca['confusion_matrix'],
    'Confusion Matrix - PCA Features'
)
fig2.show()

In [15]:
# Compare performance metrics
metrics_to_compare = [
    'accuracy', 'precision', 'recall', 'f1',
    'balanced_accuracy', 'mcc'
]

comparison_data = {
    'Metric': metrics_to_compare,
    'Original': [results_original[m] for m in metrics_to_compare],
    'PCA': [results_pca[m] for m in metrics_to_compare]
}

fig = go.Figure(data=[
    go.Bar(name='Original', x=comparison_data['Metric'],
           y=comparison_data['Original']),
    go.Bar(name='PCA', x=comparison_data['Metric'],
           y=comparison_data['PCA'])
])

fig.update_layout(
    title='Performance Metrics Comparison',
    barmode='group',
    yaxis_title='Score'
)
fig.show()

In [16]:
# Compare computational performance
performance_metrics = ['fit_time', 'predict_time', 'fit_memory_usage', 'predict_memory_usage']

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Training Time', 'Prediction Time', 
                   'Training Memory Usage', 'Prediction Memory Usage')
)

fig.add_trace(
    go.Bar(
        x=['Original', 'PCA'],
        y=[results_original['fit_time'], results_pca['fit_time']],
        name='Training Time'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Bar(
        x=['Original', 'PCA'],
        y=[results_original['predict_time'], results_pca['predict_time']],
        name='Prediction Time'
    ),
    row=1, col=2
)

fig.add_trace(
    go.Bar(
        x=['Original', 'PCA'],
        y=[results_original['fit_memory_usage'], results_pca['fit_memory_usage']],
        name='Training Memory'
    ),
    row=2, col=1
)

fig.add_trace(
    go.Bar(
        x=['Original', 'PCA'],
        y=[results_original['predict_memory_usage'], results_pca['predict_memory_usage']],
        name='Prediction Memory'
    ),
    row=2, col=2
)

fig.update_layout(
    height=800,
    title_text="Computational Performance Comparison",
    showlegend=False
)
fig.show()

## Classification multi-classes


In [17]:
# Encode labels
le = LabelEncoder()
le.classes_ = np.array(list(label_mapping.keys()))
y_label_all_encoded = le.transform(y_label_all)

# Dictionary to store results for each attack type
multiclass_results = {}
multiclass_results_pca = {}

# Train and evaluate for each attack type
for attack_label, encoded_label in label_mapping.items():
    if attack_label != 'normal':  # Skip normal class as it's our reference
        print(f"\nProcessing attack type: {attack_label}")
        
        # Create binary labels for this attack type
        y_binary = (y_label_all_encoded == encoded_label).astype(int)
        
        # Original features
        X_train, X_test, y_train, y_test = train_test_split(
            X_all, y_binary, test_size=0.2, random_state=42
        )
        
        results = train_evaluate_knn(X_train, X_test, y_train, y_test)
        results['data'] = 'PHY'
        results['model_type'] = 'KNN'
        results['attack_type'] = attack_label
        multiclass_results[attack_label] = results
        
        # PCA features
        X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
            X_pca, y_binary, test_size=0.2, random_state=42
        )
        
        results_pca = train_evaluate_knn(X_train_pca, X_test_pca, y_train_pca, y_test_pca)
        results_pca['data'] = 'PHY'
        results_pca['model_type'] = 'KNN PCA'
        results_pca['attack_type'] = attack_label
        multiclass_results_pca[attack_label] = results_pca


Processing attack type: DoS
NaN check in train_evaluate_knn:
X_train NaNs: 6800
X_test NaNs: 1703
NaN check in train_evaluate_knn:
X_train NaNs: 0
X_test NaNs: 0

Processing attack type: MITM
NaN check in train_evaluate_knn:
X_train NaNs: 6800
X_test NaNs: 1703
NaN check in train_evaluate_knn:
X_train NaNs: 0
X_test NaNs: 0

Processing attack type: physical fault
NaN check in train_evaluate_knn:
X_train NaNs: 6800
X_test NaNs: 1703
NaN check in train_evaluate_knn:
X_train NaNs: 0
X_test NaNs: 0

Processing attack type: scan
NaN check in train_evaluate_knn:
X_train NaNs: 6800
X_test NaNs: 1703
NaN check in train_evaluate_knn:
X_train NaNs: 0
X_test NaNs: 0



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



In [18]:
# Visualize multi-class results
def plot_attack_metrics(metric_name):
    attacks = list(multiclass_results.keys())
    original_values = [results[metric_name] for results in multiclass_results.values()]
    pca_values = [results[metric_name] for results in multiclass_results_pca.values()]
    
    fig = go.Figure(data=[
        go.Bar(name='Original', x=attacks, y=original_values),
        go.Bar(name='PCA', x=attacks, y=pca_values)
    ])
    
    fig.update_layout(
        title=f'{metric_name} by Attack Type',
        barmode='group',
        yaxis_title='Score'
    )
    return fig

# Plot key metrics
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1']
for metric in metrics_to_plot:
    plot_attack_metrics(metric).show()

In [19]:
# Create summary table
summary_data = []

for attack_type in multiclass_results.keys():
    orig_results = multiclass_results[attack_type]
    pca_results = multiclass_results_pca[attack_type]
    
    summary_data.append({
        'Attack Type': attack_type,
        'Original Accuracy': orig_results['accuracy'],
        'PCA Accuracy': pca_results['accuracy'],
        'Original F1': orig_results['f1'],
        'PCA F1': pca_results['f1'],
        'Original Time (s)': orig_results['fit_time'] + orig_results['predict_time'],
        'PCA Time (s)': pca_results['fit_time'] + pca_results['predict_time']
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.round(4)
display(summary_df)

Unnamed: 0,Attack Type,Original Accuracy,PCA Accuracy,Original F1,PCA F1,Original Time (s),PCA Time (s)
0,DoS,0.9977,0.9982,0.963,0.9697,0.1078,0.0998
1,MITM,0.9922,0.9881,0.9556,0.9326,0.1155,0.0982
2,physical fault,0.995,0.9973,0.9609,0.9789,0.1094,0.0991
3,scan,0.9982,0.9982,0.0,0.0,0.1105,0.0984


## Sauvegarde des résultats pour Streamlit


In [20]:
# Save binary classification results
db['PHY_results_knn_labeln'] = results_original
db['PHY_results_knn_pca_labeln'] = results_pca

# Save multiclass results
for attack_label in multiclass_results.keys():
    db[f'PHY_results_knn_{attack_label}'] = multiclass_results[attack_label]
    db[f'PHY_results_knn_pca_{attack_label}'] = multiclass_results_pca[attack_label]

# Save summary statistics
db['knn_summary_stats'] = {
    'summary_df': summary_df,
    'binary_results': {
        'original': results_original,
        'pca': results_pca
    },
    'multiclass_results': {
        'original': multiclass_results,
        'pca': multiclass_results_pca
    }
}

## Résumé de l'analyse

Points clés de l'analyse KNN :

1. Classification binaire (Attaque vs Normal) :
   - Métriques de performance globales
   - Comparaison entre les caractéristiques originales et réduites par ACP
   - Performances sur les données déséquilibrées

2. Classification multi-classes (Par type d'attaque) :
   - Variations de performance selon les types d'attaque
   - Identification des attaques facilement/difficilement détectables
   - Impact de l'ACP sur différents types d'attaques

3. Efficacité computationnelle :
   - Analyse des temps d'entraînement et de prédiction
   - Schémas d'utilisation de la mémoire 
   - Considérations sur la scalabilité

4. Comparaison entre les approches :
   - Compromis entre précision et ressources computationnelles
   - Avantages et inconvénients de l'ACP
   - Scénarios où chaque approche est préférable