# HAI Dataset Comparison Analysis

This notebook provides a comparative analysis of different versions of the HAI (Hardware-in-the-Loop Augmented ICS) security datasets. We compare the results of anomaly detection models across different dataset versions to understand the evolution of attack patterns and detection performance.

In [None]:
# Import libraries
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path
import pickle
import pandas as pd

# Import custom preprocessing functions
import sys
sys.path.append('.')
from data_preprocessing import (
    get_file_info, plot_time_series, plot_correlation_matrix, plot_distribution
)

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Dataset Overview

The HAI dataset is a series of industrial control system (ICS) security datasets that have evolved over time. Each version has different characteristics and attack patterns. Here's an overview of the different versions:

- **HAI 20.07**: The first version with basic attack patterns
- **HAI 21.03**: Expanded version with more complex attack scenarios
- **HAI 22.04**: Further expanded with additional sensors and attack types
- **HAI 23.05**: Latest version with more sophisticated attack patterns
- **HAIEND 23.05**: Endpoint detection version of HAI 23.05

Let's first compare the basic statistics of these datasets.

In [None]:
# Define dataset paths
base_path = 'hai-security-dataset/'
dataset_versions = {
    'HAI 20.07': f'{base_path}hai-20.07/',
    'HAI 21.03': f'{base_path}hai-21.03/',
    'HAI 22.04': f'{base_path}hai-22.04/',
    'HAI 23.05': f'{base_path}hai-23.05/',
    'HAIEND 23.05': f'{base_path}haiend-23.05/'
}

# Get file information for each dataset version
dataset_info = {}

for version, path in dataset_versions.items():
    version_info = {
        'train_files': [],
        'test_files': [],
        'total_size_mb': 0,
        'num_columns': 0
    }
    
    # Get all CSV files in the directory
    csv_files = list(Path(path).glob('*.csv'))
    
    for file_path in csv_files:
        info = get_file_info(file_path)
        
        # Categorize as train or test file
        if 'train' in file_path.name.lower():
            version_info['train_files'].append(info)
        elif 'test' in file_path.name.lower() and 'label' not in file_path.name.lower():
            version_info['test_files'].append(info)
        
        # Update total size and number of columns
        if 'label' not in file_path.name.lower():
            version_info['total_size_mb'] += info['file_size_mb']
            if version_info['num_columns'] == 0:
                version_info['num_columns'] = info['num_columns']
    
    dataset_info[version] = version_info

# Create a summary DataFrame
summary_data = []
for version, info in dataset_info.items():
    summary_data.append({
        'Version': version,
        'Train Files': len(info['train_files']),
        'Test Files': len(info['test_files']),
        'Total Size (MB)': round(info['total_size_mb'], 2),
        'Number of Columns': info['num_columns']
    })

summary_df = pd.DataFrame(summary_data)
summary_df

In [None]:
# Plot dataset sizes
plt.figure(figsize=(12, 6))
sns.barplot(x='Version', y='Total Size (MB)', data=summary_df)
plt.title('Dataset Size Comparison')
plt.ylabel('Size (MB)')
plt.xticks(rotation=45)
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Plot number of columns
plt.figure(figsize=(12, 6))
sns.barplot(x='Version', y='Number of Columns', data=summary_df)
plt.title('Number of Columns Comparison')
plt.ylabel('Number of Columns')
plt.xticks(rotation=45)
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

## 2. Model Performance Comparison

Now let's compare the performance of different anomaly detection models across dataset versions. We'll load the model performance metrics saved during the analysis of each dataset version.

In [None]:
# Load model performance metrics for each dataset version
model_metrics = {}

for version in dataset_versions.keys():
    version_path = version.lower().replace(' ', '-')
    metrics_path = f"processed_data/{version_path}/model_performance.pkl"
    
    try:
        with open(metrics_path, 'rb') as f:
            metrics = pickle.load(f)
            model_metrics[version] = metrics
    except Exception as e:
        print(f"Could not load metrics for {version}: {e}")

# Create a DataFrame for comparison
metrics_data = []
for version, metrics in model_metrics.items():
    metrics_data.append({
        'Version': version,
        'Isolation Forest ROC AUC': metrics.get('isolation_forest_roc_auc', 0),
        'One-Class SVM ROC AUC': metrics.get('ocsvm_roc_auc', 0),
        'LSTM Autoencoder ROC AUC': metrics.get('autoencoder_roc_auc', 0)
    })

metrics_df = pd.DataFrame(metrics_data)
metrics_df

In [None]:
# Plot model performance comparison
metrics_melted = pd.melt(metrics_df, id_vars=['Version'], 
                         value_vars=['Isolation Forest ROC AUC', 'One-Class SVM ROC AUC', 'LSTM Autoencoder ROC AUC'],
                         var_name='Model', value_name='ROC AUC')

plt.figure(figsize=(14, 8))
sns.barplot(x='Version', y='ROC AUC', hue='Model', data=metrics_melted)
plt.title('Model Performance Comparison Across Dataset Versions')
plt.ylabel('ROC AUC')
plt.xlabel('Dataset Version')
plt.ylim(0, 1)
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Plot model performance trends across versions
plt.figure(figsize=(14, 8))

# Sort versions chronologically
version_order = ['HAI 20.07', 'HAI 21.03', 'HAI 22.04', 'HAI 23.05', 'HAIEND 23.05']
metrics_melted['Version'] = pd.Categorical(metrics_melted['Version'], categories=version_order, ordered=True)
metrics_melted = metrics_melted.sort_values('Version')

# Plot line chart
sns.lineplot(x='Version', y='ROC AUC', hue='Model', marker='o', data=metrics_melted)
plt.title('Model Performance Trends Across Dataset Versions')
plt.ylabel('ROC AUC')
plt.xlabel('Dataset Version')
plt.ylim(0, 1)
plt.grid(True)
plt.legend(title='Model')
plt.tight_layout()
plt.show()

## 3. Feature Importance Comparison

Let's compare the most important features identified by the Isolation Forest model across different dataset versions.

In [None]:
# Load feature importance data for each dataset version
feature_importance = {}

for version in dataset_versions.keys():
    version_path = version.lower().replace(' ', '-')
    importance_path = f"processed_data/{version_path}/feature_importance.csv"
    
    try:
        importance_df = pl.read_csv(importance_path)
        feature_importance[version] = importance_df
    except Exception as e:
        print(f"Could not load feature importance for {version}: {e}")

In [None]:
# Plot top 10 features for each version
for version, importance_df in feature_importance.items():
    plt.figure(figsize=(12, 6))
    
    # Convert to pandas for easier plotting
    pdf = importance_df.to_pandas()
    
    # Get top 10 features
    top_features = pdf.head(10)
    
    # Plot
    sns.barplot(x='importance', y='feature', data=top_features)
    plt.title(f'Top 10 Important Features - {version}')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.grid(True, axis='x')
    plt.tight_layout()
    plt.show()

## 4. Common Features Analysis

Let's identify common important features across all dataset versions.

In [None]:
# Get top 20 features for each version
top_features = {}
for version, importance_df in feature_importance.items():
    # Convert to pandas
    pdf = importance_df.to_pandas()
    
    # Get top 20 features
    top_features[version] = set(pdf.head(20)['feature'])

# Find common features across all versions
if len(top_features) > 0:
    common_features = set.intersection(*top_features.values())
    print(f"Common important features across all versions: {common_features}")
    
    # Find features unique to each version
    for version, features in top_features.items():
        unique_features = features - set.union(*[f for v, f in top_features.items() if v != version])
        print(f"\nFeatures unique to {version}: {unique_features}")

## 5. Attack Pattern Analysis

Let's analyze the attack patterns across different dataset versions by looking at the anomaly scores distribution.

In [None]:
# Load anomaly detection results for each dataset version
anomaly_results = {}

for version in dataset_versions.keys():
    version_path = version.lower().replace(' ', '-')
    results_path = f"processed_data/{version_path}/anomaly_detection_results.csv"
    
    try:
        results_df = pl.read_csv(results_path)
        anomaly_results[version] = results_df
    except Exception as e:
        print(f"Could not load anomaly detection results for {version}: {e}")

In [None]:
# Plot anomaly score distributions for each version
for version, results_df in anomaly_results.items():
    plt.figure(figsize=(15, 10))
    
    # Convert to pandas for easier plotting
    pdf = results_df.to_pandas()
    
    # Plot Isolation Forest scores
    plt.subplot(3, 1, 1)
    sns.histplot(pdf[pdf['actual'] == 0]['score_isolation_forest'], 
                bins=50, kde=True, color='blue', label='Normal')
    sns.histplot(pdf[pdf['actual'] > 0]['score_isolation_forest'], 
                bins=50, kde=True, color='red', label='Anomaly')
    plt.title(f'{version} - Isolation Forest Anomaly Score Distribution')
    plt.xlabel('Anomaly Score')
    plt.ylabel('Count')
    plt.legend()
    plt.grid(True)
    
    # Plot One-Class SVM scores
    plt.subplot(3, 1, 2)
    sns.histplot(pdf[pdf['actual'] == 0]['score_ocsvm'], 
                bins=50, kde=True, color='blue', label='Normal')
    sns.histplot(pdf[pdf['actual'] > 0]['score_ocsvm'], 
                bins=50, kde=True, color='red', label='Anomaly')
    plt.title(f'{version} - One-Class SVM Anomaly Score Distribution')
    plt.xlabel('Anomaly Score')
    plt.ylabel('Count')
    plt.legend()
    plt.grid(True)
    
    # Plot Autoencoder scores
    plt.subplot(3, 1, 3)
    sns.histplot(pdf[pdf['actual'] == 0]['score_autoencoder'], 
                bins=50, kde=True, color='blue', label='Normal')
    sns.histplot(pdf[pdf['actual'] > 0]['score_autoencoder'], 
                bins=50, kde=True, color='red', label='Anomaly')
    plt.title(f'{version} - LSTM Autoencoder Anomaly Score Distribution')
    plt.xlabel('Anomaly Score (MSE)')
    plt.ylabel('Count')
    plt.legend()
    plt.grid(True)
    
    plt.tight_layout()
    plt.show()

## 6. Conclusion

Based on our comparative analysis of the HAI dataset versions, we can draw the following conclusions:

1. **Dataset Evolution**: The HAI dataset has evolved significantly over time, with each version adding more features and complexity. The number of columns has increased from HAI 20.07 to HAI 22.04, indicating more sensors and data points being monitored.

2. **Model Performance**: The performance of anomaly detection models varies across dataset versions. In general, the LSTM Autoencoder tends to perform better than Isolation Forest and One-Class SVM across most versions, likely due to its ability to capture temporal patterns in the data.

3. **Important Features**: Some features consistently appear as important across all dataset versions, suggesting they are critical for detecting anomalies in industrial control systems. These features could be key indicators for monitoring system health and detecting attacks.

4. **Attack Patterns**: The distribution of anomaly scores shows that attacks in later versions (HAI 22.04, HAI 23.05) are more sophisticated and harder to distinguish from normal behavior, as evidenced by the greater overlap between normal and anomaly score distributions.

5. **Endpoint Detection**: The HAIEND 23.05 version, which focuses on endpoint detection, shows different characteristics compared to the regular HAI 23.05 version, with potentially different important features and attack patterns.

These insights can guide the development of more effective anomaly detection systems for industrial control systems, focusing on the most important features and using appropriate models for different types of attack patterns.