# HAI-21.03 Dataset Anomaly Detection - LSTM Autoencoder Model

This notebook uses an LSTM Autoencoder for anomaly detection on the HAI-21.03 dataset.

The HAI dataset contains data from industrial control systems (ICS), where training data does not include attack labels, while test data includes attack labels.

## 1. Import Necessary Libraries

In [None]:
# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os
import pickle
import time
import gc

# Import deep learning libraries
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential, load_model, save_model
from tensorflow.keras.layers import Input, LSTM, RepeatVector, Dense, Dropout, TimeDistributed
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

# Import preprocessing libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, auc, roc_curve, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

# Set plot style
plt.style.use('ggplot')
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure TensorFlow to use memory growth
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    for device in physical_devices:
        tf.config.experimental.set_memory_growth(device, True)
        print(f"Memory growth enabled for {device}")
else:
    print("No GPU devices found, using CPU.")

## 2. Load Dataset

In [None]:
# Set data path
data_path = "../../hai-security-dataset/hai-21.03/"

# Load training data
train_files = [f for f in os.listdir(data_path) if f.startswith('train')]
train_dfs = []

for file in train_files:
    print(f"Loading training file: {file}")
    df = pd.read_csv(f"{data_path}{file}", sep=",")
    train_dfs.append(df)
    
# Combine training data
train_df = pd.concat(train_dfs, ignore_index=True)
print(f"Training data shape: {train_df.shape}")

In [None]:
# Load test data
test_files = [f for f in os.listdir(data_path) if f.startswith('test')]
test_dfs = []

for file in test_files:
    print(f"Loading test file: {file}")
    df = pd.read_csv(f"{data_path}{file}", sep=",")
    test_dfs.append(df)
    
# Combine test data
test_df = pd.concat(test_dfs, ignore_index=True)
print(f"Test data shape: {test_df.shape}")

## 3. Data Preprocessing

In [None]:
# Check basic information of the dataset
print("Column names in training dataset:")
print(train_df.columns.tolist())

# Check for missing values
print("\nMissing values in training dataset:")
print(train_df.isnull().sum().sum())

print("\nMissing values in test dataset:")
print(test_df.isnull().sum().sum())

In [None]:
# Convert timestamp to datetime
train_df['time'] = pd.to_datetime(train_df['time'])
test_df['time'] = pd.to_datetime(test_df['time'])

# Extract attack labels from test data
attack_columns = [col for col in test_df.columns if 'attack' in col.lower()]
print(f"Attack label columns: {attack_columns}")

# Create a combined attack label (if multiple attack columns exist)
if len(attack_columns) > 1:
    test_df['attack_combined'] = test_df[attack_columns].max(axis=1)
    y_test = test_df['attack_combined']
else:
    y_test = test_df[attack_columns[0]]

# Print attack distribution
print(f"\nAttack distribution in test data:\n{y_test.value_counts()}")

In [None]:
# Select features for training and testing
# Exclude timestamp and attack labels
feature_columns = [col for col in train_df.columns if col not in ['time'] + attack_columns]
print(f"Number of features: {len(feature_columns)}")

# Prepare training and testing data
X_train_raw = train_df[feature_columns].values
X_test_raw = test_df[feature_columns].values

# Scale the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_raw)
X_test_scaled = scaler.transform(X_test_raw)

print(f"Training data shape after preprocessing: {X_train_scaled.shape}")
print(f"Testing data shape after preprocessing: {X_test_scaled.shape}")

## 4. Create Sequences for LSTM (Memory-Efficient Version)

In [None]:
# Define sequence length
seq_length = 20  # Reduced from 30 to save memory

# Function to create sequences in batches to save memory
def create_sequences_batch(data, seq_length, batch_size=1000):
    n_samples = len(data) - seq_length
    n_batches = (n_samples + batch_size - 1) // batch_size  # Ceiling division
    
    sequences = []
    for i in range(n_batches):
        start_idx = i * batch_size
        end_idx = min(start_idx + batch_size, n_samples)
        
        batch_sequences = []
        for j in range(start_idx, end_idx):
            sequence = data[j:j+seq_length]
            batch_sequences.append(sequence)
        
        sequences.extend(batch_sequences)
        
        # Clear memory
        del batch_sequences
        gc.collect()
        
    return np.array(sequences)

# Take a sample of the training data to reduce memory usage
sample_size = min(100000, len(X_train_scaled))  # Limit to 100,000 samples
sample_indices = np.random.choice(len(X_train_scaled), sample_size, replace=False)
X_train_sample = X_train_scaled[sample_indices]
print(f"Using {sample_size} samples for training (reduced from {len(X_train_scaled)})")

# Create sequences for training data
print("Creating sequences for training data...")
X_train_seq = create_sequences_batch(X_train_sample, seq_length)
print(f"Training sequences shape: {X_train_seq.shape}")

# Split training data into training and validation sets
X_train_lstm, X_val_lstm = train_test_split(X_train_seq, test_size=0.2, random_state=42)
print(f"LSTM training data shape: {X_train_lstm.shape}")
print(f"LSTM validation data shape: {X_val_lstm.shape}")

# Clear memory
del X_train_sample, X_train_seq
gc.collect()

## 5. Build LSTM Autoencoder Model (Simplified)

In [None]:
# Define model parameters
input_dim = X_train_lstm.shape[2]  # Number of features
timesteps = X_train_lstm.shape[1]  # Sequence length
latent_dim = 16  # Reduced from 32 to save memory

# Build the LSTM Autoencoder model (simplified version)
def build_lstm_autoencoder(input_dim, timesteps, latent_dim):
    # Encoder
    inputs = Input(shape=(timesteps, input_dim))
    encoded = LSTM(32, return_sequences=False)(inputs)  # Reduced from 64 to 32
    
    # Decoder
    decoded = RepeatVector(timesteps)(encoded)
    decoded = LSTM(32, return_sequences=True)(decoded)  # Reduced from 64 to 32
    decoded = TimeDistributed(Dense(input_dim))(decoded)
    
    # Autoencoder model
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
    
    # Encoder model (for extracting the latent representation)
    encoder = Model(inputs, encoded)
    
    return autoencoder, encoder

# Create the model
autoencoder, encoder = build_lstm_autoencoder(input_dim, timesteps, latent_dim)

# Print model summary
autoencoder.summary()

## 6. Train the Model with Memory Optimization

In [None]:
# Define callbacks
model_dir = "./"
os.makedirs(model_dir, exist_ok=True)
model_path = os.path.join(model_dir, "lstm_autoencoder_model.h5")

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint(model_path, monitor='val_loss', save_best_only=True)

# Train the model with smaller batch size and fewer epochs
print("Training LSTM Autoencoder model...")
start_time = time.time()

history = autoencoder.fit(
    X_train_lstm, X_train_lstm,
    epochs=20,  # Reduced from 50 to 20
    batch_size=32,  # Reduced from 64 to 32
    validation_data=(X_val_lstm, X_val_lstm),
    callbacks=[early_stopping, model_checkpoint],
    verbose=1
)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Clear memory
del X_train_lstm, X_val_lstm
gc.collect()

In [None]:
# Plot training history
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()

## 7. Save the Model and Scaler

In [None]:
# Save the scaler
scaler_filename = os.path.join(model_dir, "scaler.pkl")
with open(scaler_filename, 'wb') as file:
    pickle.dump(scaler, file)
    
print(f"Model saved to {model_path}")
print(f"Scaler saved to {scaler_filename}")

## 8. Evaluate the Model (Memory-Efficient Version)

In [None]:
# Process test data in batches to save memory
def process_test_data_in_batches(X_test, y_test, seq_length, batch_size=1000):
    all_mse = []
    all_y = []
    
    for i in range(0, len(X_test) - seq_length, batch_size):
        end_idx = min(i + batch_size, len(X_test) - seq_length)
        print(f"Processing batch {i//batch_size + 1}, samples {i} to {end_idx}")
        
        # Create sequences for this batch
        X_batch_seq = []
        y_batch = []
        
        for j in range(i, end_idx):
            X_batch_seq.append(X_test[j:j+seq_length])
            # Use the label of the last timestep in the sequence
            y_batch.append(y_test.iloc[j+seq_length-1])
        
        X_batch_seq = np.array(X_batch_seq)
        
        # Get reconstructions
        X_batch_pred = autoencoder.predict(X_batch_seq, batch_size=32, verbose=0)
        
        # Calculate reconstruction error (MSE) for each sequence
        batch_mse = np.mean(np.square(X_batch_seq - X_batch_pred), axis=(1, 2))
        
        all_mse.extend(batch_mse)
        all_y.extend(y_batch)
        
        # Clear memory
        del X_batch_seq, X_batch_pred, batch_mse
        gc.collect()
    
    return np.array(all_mse), np.array(all_y)

# Process test data
print("Processing test data in batches...")
start_time = time.time()

mse, y_test_seq = process_test_data_in_batches(X_test_scaled, y_test, seq_length)

processing_time = time.time() - start_time
print(f"Test data processing completed in {processing_time:.2f} seconds")
print(f"Processed {len(mse)} test sequences")

In [None]:
# Plot reconstruction error distribution
plt.figure(figsize=(12, 6))
plt.hist(mse, bins=50, alpha=0.7)
plt.title('Reconstruction Error Distribution')
plt.xlabel('Reconstruction Error (MSE)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [None]:
# Plot reconstruction error by class
plt.figure(figsize=(12, 6))
plt.hist(mse[y_test_seq == 0], bins=50, alpha=0.7, label='Normal', color='blue')
plt.hist(mse[y_test_seq == 1], bins=50, alpha=0.7, label='Anomaly', color='red')
plt.title('Reconstruction Error by Class')
plt.xlabel('Reconstruction Error (MSE)')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.show()

## 9. Threshold Optimization

In [None]:
# Try different thresholds for anomaly detection
thresholds = np.linspace(min(mse), max(mse), 100)
f1_scores = []
precision_scores = []
recall_scores = []

for threshold in thresholds:
    y_pred_threshold = np.where(mse >= threshold, 1, 0)
    f1_scores.append(f1_score(y_test_seq, y_pred_threshold))
    precision_scores.append(precision_score(y_test_seq, y_pred_threshold))
    recall_scores.append(recall_score(y_test_seq, y_pred_threshold))

# Find the threshold that maximizes F1 score
best_threshold_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_threshold_idx]
best_f1 = f1_scores[best_threshold_idx]
best_precision = precision_scores[best_threshold_idx]
best_recall = recall_scores[best_threshold_idx]

print(f"Best threshold: {best_threshold:.4f}")
print(f"Best F1 score: {best_f1:.4f}")
print(f"Precision at best threshold: {best_precision:.4f}")
print(f"Recall at best threshold: {best_recall:.4f}")

In [None]:
# Plot F1, precision, and recall scores for different thresholds
plt.figure(figsize=(12, 8))
plt.plot(thresholds, f1_scores, 'b-', label='F1 Score')
plt.plot(thresholds, precision_scores, 'g-', label='Precision')
plt.plot(thresholds, recall_scores, 'r-', label='Recall')
plt.axvline(x=best_threshold, color='k', linestyle='--', label=f'Best Threshold: {best_threshold:.4f}')
plt.title('Performance Metrics vs. Threshold')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Apply the optimized threshold
y_pred_optimized = np.where(mse >= best_threshold, 1, 0)

# Evaluate with optimized threshold
print("Confusion Matrix with Optimized Threshold:")
cm_optimized = confusion_matrix(y_test_seq, y_pred_optimized)
print(cm_optimized)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_optimized, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Normal', 'Anomaly'],
            yticklabels=['Normal', 'Anomaly'])
plt.title('Confusion Matrix with Optimized Threshold')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Classification report
print("\nClassification Report with Optimized Threshold:")
print(classification_report(y_test_seq, y_pred_optimized))

## 10. ROC and Precision-Recall Curves

In [None]:
# ROC Curve
fpr, tpr, _ = roc_curve(y_test_seq, mse)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Precision-Recall Curve
precision_curve, recall_curve, _ = precision_recall_curve(y_test_seq, mse)
pr_auc = auc(recall_curve, precision_curve)

plt.figure(figsize=(10, 8))
plt.plot(recall_curve, precision_curve, color='blue', lw=2, label=f'PR curve (area = {pr_auc:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.show()

## 11. Analyze Anomalies Over Time

In [None]:
# Create a DataFrame with timestamps, actual labels, and predictions
# We need to adjust the timestamps to match the sequence structure
# This is an approximation since we processed in batches
start_idx = seq_length - 1
end_idx = start_idx + len(mse)
if end_idx <= len(test_df):
    timestamps = test_df['time'].iloc[start_idx:end_idx].values
else:
    # If we don't have enough timestamps, just use what we have
    timestamps = test_df['time'].iloc[start_idx:].values
    timestamps = np.pad(timestamps, (0, len(mse) - len(timestamps)), 'edge')

results_df = pd.DataFrame({
    'timestamp': timestamps[:len(mse)],
    'actual': y_test_seq,
    'predicted': y_pred_optimized,
    'reconstruction_error': mse
})

# Plot actual vs predicted anomalies over time
plt.figure(figsize=(16, 8))

# Sample data for better visualization if dataset is large
sample_size = min(10000, len(results_df))
sample_indices = np.linspace(0, len(results_df)-1, sample_size, dtype=int)
sample_df = results_df.iloc[sample_indices]

plt.plot(sample_df['timestamp'], sample_df['actual'], 'b-', alpha=0.5, label='Actual')
plt.plot(sample_df['timestamp'], sample_df['predicted'], 'r-', alpha=0.5, label='Predicted')
plt.title('Actual vs Predicted Anomalies Over Time')
plt.xlabel('Time')
plt.ylabel('Anomaly (1) / Normal (0)')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Plot reconstruction error over time with actual labels
plt.figure(figsize=(16, 8))

# Create a colormap based on actual labels
colors = np.where(sample_df['actual'] == 1, 'red', 'blue')

plt.scatter(sample_df['timestamp'], sample_df['reconstruction_error'], c=colors, alpha=0.5, s=10)
plt.title('Reconstruction Error Over Time')
plt.xlabel('Time')
plt.ylabel('Reconstruction Error')
plt.grid(True)

# Add a horizontal line at the threshold
plt.axhline(y=best_threshold, color='g', linestyle='--')

# Add a legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Actual Anomaly'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='Normal'),
    Line2D([0], [0], color='g', linestyle='--', label='Threshold')
]
plt.legend(handles=legend_elements)

plt.show()

## 12. Save the Optimized Threshold

In [None]:
# Save the optimized threshold
threshold_filename = os.path.join(model_dir, "optimized_threshold.pkl")
with open(threshold_filename, 'wb') as file:
    pickle.dump(best_threshold, file)
    
print(f"Optimized threshold saved to {threshold_filename}")

## 13. Conclusion

In this notebook, we have:

1. Loaded and preprocessed the HAI-21.03 dataset
2. Created sequences for LSTM processing using a memory-efficient approach
3. Built and trained a simplified LSTM Autoencoder model for anomaly detection
4. Evaluated the model's performance using reconstruction error
5. Optimized the anomaly detection threshold to maximize F1 score
6. Saved the model, scaler, and optimized threshold for future use

The LSTM Autoencoder has demonstrated its ability to detect anomalies in time series data from industrial control systems by learning the normal patterns in the data and identifying deviations from these patterns as potential anomalies.

To make this notebook more memory-efficient, we implemented several optimizations:
1. Used a smaller sequence length (20 instead of 30)
2. Simplified the model architecture (fewer layers and units)
3. Used a smaller batch size for training (32 instead of 64)
4. Processed data in batches to reduce memory usage
5. Used garbage collection to free memory after processing each batch
6. Sampled the training data to reduce the overall dataset size

These optimizations allow the model to run on systems with limited GPU memory while still providing effective anomaly detection capabilities.