# HAI Security Dataset Analysis

This notebook analyzes the HIL-based Augmented ICS (HAI) Security Dataset, which contains data collected from a realistic industrial control system (ICS) testbed augmented with a hardware-in-the-loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from glob import glob
from datetime import datetime

# Set plot style
plt.style.use('ggplot')
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

## 1. Dataset Overview

The HAI security dataset includes multiple versions (HAI-20.07, HAI-21.03, HAI-22.04, HAI-23.05, HAIEnd-23.05), each containing normal and abnormal behaviors for ICS anomaly detection research. Let's first explore the available datasets.

In [None]:
# List all available dataset versions
dataset_versions = glob('hai-security-dataset/hai-*')
dataset_versions.extend(glob('hai-security-dataset/haiend-*'))
dataset_versions = [os.path.basename(version) for version in dataset_versions]
print(f"Available dataset versions: {dataset_versions}")

## 2. Exploring HAI-22.04 Dataset

Let's start by exploring the HAI-22.04 dataset, which includes training and testing datasets.

In [None]:
# List all files in HAI-22.04
hai_22_04_files = glob('hai-security-dataset/hai-22.04/*.csv')
hai_22_04_files = [os.path.basename(file) for file in hai_22_04_files]
print(f"Files in HAI-22.04: {hai_22_04_files}")

In [None]:
# Load one training dataset to explore its structure
train1_path = 'hai-security-dataset/hai-22.04/train1.csv'
train1_df = pd.read_csv(train1_path)

# Display basic information
print(f"Shape of train1.csv: {train1_df.shape}")
print("\nFirst 5 rows:")
train1_df.head()

In [None]:
# Check column names
print(f"Number of columns: {len(train1_df.columns)}")
print("\nColumn names:")
train1_df.columns.tolist()

In [None]:
# Check data types and missing values
train1_df.info()

In [None]:
# Convert timestamp to datetime
if 'timestamp' in train1_df.columns:
    train1_df['timestamp'] = pd.to_datetime(train1_df['timestamp'])
    
# Check time range
print(f"Start time: {train1_df['timestamp'].min()}")
print(f"End time: {train1_df['timestamp'].max()}")
print(f"Duration: {train1_df['timestamp'].max() - train1_df['timestamp'].min()}")

## 3. Analyzing Attack Labels

Let's check if the dataset contains attack labels and analyze their distribution.

In [None]:
# Check if attack labels are present
attack_columns = [col for col in train1_df.columns if 'attack' in col.lower()]
print(f"Attack label columns: {attack_columns}")

if attack_columns:
    # Count attack instances
    for col in attack_columns:
        attack_count = train1_df[col].sum()
        attack_percentage = (attack_count / len(train1_df)) * 100
        print(f"{col}: {attack_count} attacks ({attack_percentage:.2f}% of data)")
        
    # Plot attack distribution over time for the first attack column
    plt.figure(figsize=(14, 6))
    plt.plot(train1_df['timestamp'], train1_df[attack_columns[0]])
    plt.title(f'{attack_columns[0]} Distribution Over Time')
    plt.xlabel('Time')
    plt.ylabel('Attack (1) / Normal (0)')
    plt.show()

## 4. Exploring Data Points

Based on the technical details, the dataset contains various data points related to different controllers (P1-PC, P1-LC, P1-FC, P1-TC, P2-SC, P3-LC). Let's explore some key data points.

In [None]:
# Group data points by controller
controller_prefixes = ['P1_', 'P2_', 'P3_', 'P4_']
for prefix in controller_prefixes:
    cols = [col for col in train1_df.columns if col.startswith(prefix)]
    print(f"\n{prefix} data points ({len(cols)}): {cols[:5]}...")

In [None]:
# Select key data points for visualization
key_points = [
    'P1_B2016',  # Pressure demand for thermal power output control
    'P1_PIT01',  # Heat-exchanger outlet pressure
    'P1_B3004',  # Water level setpoint (return water tank)
    'P1_LIT01',  # Water level of the return water tank
    'P1_B3005',  # Discharge flowrate setpoint (return water tank)
    'P1_FT03',   # Measured flowrate of the return water tank
    'P1_B4022',  # Temperature demand for thermal power output control
    'P1_TIT01'   # Heat-exchanger outlet temperature
]

# Check if these columns exist in the dataset
existing_key_points = [col for col in key_points if col in train1_df.columns]
print(f"Available key points: {existing_key_points}")

In [None]:
# Plot time series for key data points
if existing_key_points:
    # Sample data to reduce plotting time (every 100th point)
    sampled_df = train1_df.iloc[::100].copy()
    
    # Plot each key point
    fig, axes = plt.subplots(len(existing_key_points), 1, figsize=(14, 4*len(existing_key_points)))
    
    for i, point in enumerate(existing_key_points):
        axes[i].plot(sampled_df['timestamp'], sampled_df[point])
        axes[i].set_title(f'{point} Time Series')
        axes[i].set_xlabel('Time')
        axes[i].set_ylabel('Value')
    
    plt.tight_layout()
    plt.show()

## 5. Correlation Analysis

Let's analyze correlations between different data points to understand their relationships.

In [None]:
# Select a subset of columns for correlation analysis
if existing_key_points:
    # Calculate correlation matrix
    corr_matrix = train1_df[existing_key_points].corr()
    
    # Plot correlation heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
    plt.title('Correlation Matrix of Key Data Points')
    plt.show()

## 6. Comparing Multiple Datasets

Let's compare key statistics across different training datasets.

In [None]:
# List all training datasets
train_files = [f for f in hai_22_04_files if f.startswith('train')]
print(f"Training datasets: {train_files}")

# Function to get basic statistics for a dataset
def get_dataset_stats(file_name):
    file_path = f'hai-security-dataset/hai-22.04/{file_name}'
    df = pd.read_csv(file_path)
    
    # Convert timestamp if it exists
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'])
    
    # Get attack information if available
    attack_cols = [col for col in df.columns if 'attack' in col.lower()]
    attack_info = {}
    for col in attack_cols:
        attack_info[col] = df[col].sum()
    
    return {
        'file_name': file_name,
        'rows': len(df),
        'columns': len(df.columns),
        'start_time': df['timestamp'].min() if 'timestamp' in df.columns else None,
        'end_time': df['timestamp'].max() if 'timestamp' in df.columns else None,
        'duration': df['timestamp'].max() - df['timestamp'].min() if 'timestamp' in df.columns else None,
        'attack_info': attack_info
    }

# Get statistics for all training datasets
train_stats = []
for file in train_files:
    try:
        stats = get_dataset_stats(file)
        train_stats.append(stats)
        print(f"Processed {file}")
    except Exception as e:
        print(f"Error processing {file}: {e}")

# Display statistics in a table
stats_df = pd.DataFrame(train_stats)
stats_df

## 7. Analyzing Test Datasets

Let's also analyze the test datasets to understand their characteristics.

In [None]:
# List all test datasets
test_files = [f for f in hai_22_04_files if f.startswith('test')]
print(f"Test datasets: {test_files}")

# Get statistics for all test datasets
test_stats = []
for file in test_files:
    try:
        stats = get_dataset_stats(file)
        test_stats.append(stats)
        print(f"Processed {file}")
    except Exception as e:
        print(f"Error processing {file}: {e}")

# Display statistics in a table
test_stats_df = pd.DataFrame(test_stats)
test_stats_df

## 8. Visualizing Attack Scenarios

Let's visualize some attack scenarios in the test datasets to understand their patterns.

In [None]:
# Load a test dataset with attacks
test_file = test_files[0]  # Using the first test file
test_path = f'hai-security-dataset/hai-22.04/{test_file}'
test_df = pd.read_csv(test_path)

# Convert timestamp to datetime
if 'timestamp' in test_df.columns:
    test_df['timestamp'] = pd.to_datetime(test_df['timestamp'])

# Check for attack columns
attack_cols = [col for col in test_df.columns if 'attack' in col.lower()]
print(f"Attack columns in {test_file}: {attack_cols}")

if attack_cols:
    # Create a combined attack column if multiple attack columns exist
    if len(attack_cols) > 1:
        test_df['any_attack'] = test_df[attack_cols].max(axis=1)
        attack_col = 'any_attack'
    else:
        attack_col = attack_cols[0]
    
    # Find attack periods
    attack_starts = []
    attack_ends = []
    in_attack = False
    
    for i, row in test_df.iterrows():
        if row[attack_col] == 1 and not in_attack:
            attack_starts.append(i)
            in_attack = True
        elif row[attack_col] == 0 and in_attack:
            attack_ends.append(i-1)
            in_attack = False
    
    if in_attack:  # If dataset ends during an attack
        attack_ends.append(len(test_df)-1)
    
    print(f"Found {len(attack_starts)} attack periods")
    
    # Visualize the first few attack periods with key data points
    num_attacks_to_show = min(3, len(attack_starts))
    
    for i in range(num_attacks_to_show):
        start_idx = max(0, attack_starts[i] - 100)  # Include some pre-attack data
        end_idx = min(len(test_df)-1, attack_ends[i] + 100)  # Include some post-attack data
        
        attack_df = test_df.iloc[start_idx:end_idx].copy()
        
        # Plot key data points during this attack
        if existing_key_points:
            fig, axes = plt.subplots(len(existing_key_points)+1, 1, figsize=(14, 3*(len(existing_key_points)+1)))
            
            # Plot attack label
            axes[0].plot(attack_df['timestamp'], attack_df[attack_col], 'r-')
            axes[0].set_title(f'Attack Period {i+1}')
            axes[0].set_ylabel('Attack')
            
            # Plot each key point
            for j, point in enumerate(existing_key_points):
                if point in attack_df.columns:
                    axes[j+1].plot(attack_df['timestamp'], attack_df[point])
                    axes[j+1].set_title(f'{point} During Attack Period {i+1}')
                    axes[j+1].set_ylabel('Value')
            
            plt.tight_layout()
            plt.show()

## 9. Comparing Different HAI Dataset Versions

Let's compare the structure and characteristics of different HAI dataset versions.

In [None]:
# Function to get basic information about a dataset version
def get_version_info(version):
    version_path = f'hai-security-dataset/{version}'
    files = glob(f'{version_path}/*.csv')
    file_names = [os.path.basename(file) for file in files]
    
    train_files = [f for f in file_names if f.startswith('train') or f.startswith('hai-train') or f.startswith('end-train')]
    test_files = [f for f in file_names if f.startswith('test') or f.startswith('hai-test') or f.startswith('end-test')]
    
    # Sample one file to get column count
    sample_file = files[0] if files else None
    num_columns = 0
    if sample_file:
        try:
            df = pd.read_csv(sample_file)
            num_columns = len(df.columns)
        except Exception as e:
            print(f"Error reading {sample_file}: {e}")
    
    return {
        'version': version,
        'total_files': len(files),
        'train_files': len(train_files),
        'test_files': len(test_files),
        'train_file_names': train_files,
        'test_file_names': test_files,
        'num_columns': num_columns
    }

# Get information for all dataset versions
version_info = []
for version in dataset_versions:
    try:
        info = get_version_info(version)
        version_info.append(info)
        print(f"Processed {version}")
    except Exception as e:
        print(f"Error processing {version}: {e}")

# Display information in a table
version_info_df = pd.DataFrame(version_info)
version_info_df[['version', 'total_files', 'train_files', 'test_files', 'num_columns']]

## 10. Statistical Analysis of Data Points

Let's perform statistical analysis on key data points to understand their distributions and characteristics.

In [None]:
# Load a training dataset
train_file = train_files[0]  # Using the first training file
train_path = f'hai-security-dataset/hai-22.04/{train_file}'
train_df = pd.read_csv(train_path)

# Select numerical columns (excluding timestamp and attack labels)
numerical_cols = train_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numerical_cols = [col for col in numerical_cols if not ('attack' in col.lower() or 'timestamp' in col.lower())]

# Calculate basic statistics
stats = train_df[numerical_cols].describe()
stats

In [None]:
# Visualize distributions of key data points
if existing_key_points:
    # Create histograms for each key point
    fig, axes = plt.subplots(len(existing_key_points), 1, figsize=(14, 4*len(existing_key_points)))
    
    for i, point in enumerate(existing_key_points):
        if point in train_df.columns:
            sns.histplot(train_df[point], kde=True, ax=axes[i])
            axes[i].set_title(f'Distribution of {point}')
            axes[i].set_xlabel('Value')
            axes[i].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()

## 11. Analyzing Control Loops

Based on the technical details, the dataset includes data from various control loops (P1-PC, P1-LC, P1-FC, P1-TC, P2-SC, P3-LC). Let's analyze the relationships between setpoints (SP), process variables (PV), and control variables (CV) in these control loops.

In [None]:
# Define control loops with their setpoints, process variables, and control variables
control_loops = {
    'P1-PC': {
        'SP': 'P1_B2016',  # Pressure demand
        'PV': 'P1_PIT01',  # Heat-exchanger outlet pressure
        'CV': ['P1_PCV01D', 'P1_PCV02D']  # Position command for pressure control valves
    },
    'P1-LC': {
        'SP': 'P1_B3004',  # Water level setpoint
        'PV': 'P1_LIT01',  # Water level of the return water tank
        'CV': 'P1_LCV01D'  # Position command for level control valve
    },
    'P1-FC': {
        'SP': 'P1_B3005',  # Discharge flowrate setpoint
        'PV': 'P1_FT03',   # Measured flowrate of the return water tank
        'CV': 'P1_FCV03D'  # Position command for flow control valve
    },
    'P1-TC': {
        'SP': 'P1_B4022',  # Temperature demand
        'PV': 'P1_TIT01',  # Heat-exchanger outlet temperature
        'CV': ['P1_FCV01D', 'P1_FCV02D']  # Position command for flow control valves
    }
}

# Analyze each control loop
for loop_name, loop_vars in control_loops.items():
    print(f"\nAnalyzing {loop_name} control loop")
    
    # Check if all variables exist in the dataset
    sp = loop_vars['SP']
    pv = loop_vars['PV']
    cv = loop_vars['CV'] if isinstance(loop_vars['CV'], list) else [loop_vars['CV']]
    
    missing_vars = [var for var in [sp, pv] + cv if var not in train_df.columns]
    if missing_vars:
        print(f"  Missing variables: {missing_vars}")
        continue
    
    # Plot setpoint and process variable
    plt.figure(figsize=(14, 6))
    
    # Sample data to reduce plotting time (every 100th point)
    sampled_df = train_df.iloc[::100].copy()
    
    plt.plot(sampled_df['timestamp'], sampled_df[sp], 'r-', label=f'Setpoint ({sp})')
    plt.plot(sampled_df['timestamp'], sampled_df[pv], 'b-', label=f'Process Variable ({pv})')
    
    plt.title(f'{loop_name} Control Loop: Setpoint vs. Process Variable')
    plt.xlabel('Time')
    plt.ylabel('Value')
    plt.legend()
    plt.show()
    
    # Plot control variables
    plt.figure(figsize=(14, 6))
    
    for control_var in cv:
        plt.plot(sampled_df['timestamp'], sampled_df[control_var], label=f'Control Variable ({control_var})')
    
    plt.title(f'{loop_name} Control Loop: Control Variables')
    plt.xlabel('Time')
    plt.ylabel('Value (%)')
    plt.legend()
    plt.show()
    
    # Calculate error (difference between setpoint and process variable)
    train_df[f'{loop_name}_error'] = train_df[sp] - train_df[pv]
    
    # Plot error histogram
    plt.figure(figsize=(10, 6))
    sns.histplot(train_df[f'{loop_name}_error'], kde=True)
    plt.title(f'{loop_name} Control Loop: Error Distribution')
    plt.xlabel('Error (SP - PV)')
    plt.ylabel('Frequency')
    plt.show()

## 12. Analyzing Attack Scenarios

Let's analyze the attack scenarios in the test datasets to understand their impact on different control loops.

In [None]:
# Load a test dataset
test_file = test_files[0]  # Using the first test file
test_path = f'hai-security-dataset/hai-22.04/{test_file}'
test_df = pd.read_csv(test_path)

# Convert timestamp to datetime
if 'timestamp' in test_df.columns:
    test_df['timestamp'] = pd.to_datetime(test_df['timestamp'])

# Check for attack columns
attack_cols = [col for col in test_df.columns if 'attack' in col.lower()]
print(f"Attack columns in {test_file}: {attack_cols}")

if attack_cols:
    # Analyze attacks for each control loop
    for loop_name, loop_vars in control_loops.items():
        # Check if all variables exist in the dataset
        sp = loop_vars['SP']
        pv = loop_vars['PV']
        cv = loop_vars['CV'] if isinstance(loop_vars['CV'], list) else [loop_vars['CV']]
        
        missing_vars = [var for var in [sp, pv] + cv if var not in test_df.columns]
        if missing_vars:
            print(f"  Missing variables for {loop_name}: {missing_vars}")
            continue
        
        # Find attack periods specific to this control loop if possible
        loop_attack_col = next((col for col in attack_cols if loop_name.lower() in col.lower()), attack_cols[0])
        
        # Find attack periods
        attack_starts = []
        attack_ends = []
        in_attack = False
        
        for i, row in test_df.iterrows():
            if row[loop_attack_col] == 1 and not in_attack:
                attack_starts.append(i)
                in_attack = True
            elif row[loop_attack_col] == 0 and in_attack:
                attack_ends.append(i-1)
                in_attack = False
        
        if in_attack:  # If dataset ends during an attack
            attack_ends.append(len(test_df)-1)
        
        print(f"\nFound {len(attack_starts)} attack periods for {loop_name}")
        
        if not attack_starts:
            continue
        
        # Analyze the first attack period
        start_idx = max(0, attack_starts[0] - 100)  # Include some pre-attack data
        end_idx = min(len(test_df)-1, attack_ends[0] + 100)  # Include some post-attack data
        
        attack_df = test_df.iloc[start_idx:end_idx].copy()
        
        # Plot setpoint, process variable, and control variables during the attack
        fig, axes = plt.subplots(3, 1, figsize=(14, 12))
        
        # Plot attack label
        axes[0].plot(attack_df['timestamp'], attack_df[loop_attack_col], 'r-')
        axes[0].set_title(f'{loop_name} Attack Period')
        axes[0].set_ylabel('Attack')
        
        # Plot setpoint and process variable
        axes[1].plot(attack_df['timestamp'], attack_df[sp], 'r-', label=f'Setpoint ({sp})')
        axes[1].plot(attack_df['timestamp'], attack_df[pv], 'b-', label=f'Process Variable ({pv})')
        axes[1].set_title(f'{loop_name} Control Loop: Setpoint vs. Process Variable During Attack')
        axes[1].set_ylabel('Value')
        axes[1].legend()
        
        # Plot control variables
        for control_var in cv:
            axes[2].plot(attack_df['timestamp'], attack_df[control_var], label=f'Control Variable ({control_var})')
        
        axes[2].set_title(f'{loop_name} Control Loop: Control Variables During Attack')
        axes[2].set_xlabel('Time')
        axes[2].set_ylabel('Value (%)')
        axes[2].legend()
        
        plt.tight_layout()
        plt.show()

## 13. Conclusion

In this notebook, we've analyzed the HAI security dataset, focusing on:

1. Dataset structure and characteristics
2. Key data points and their relationships
3. Control loops and their behavior
4. Attack scenarios and their impact

The HAI dataset provides valuable insights into industrial control system behavior under normal and attack conditions, making it useful for developing and testing anomaly detection algorithms.