**# FLEAD: Offline Federated Learning Simulation with the Edge-IIoTset Dataset**

This notebook provides an offline simulation of the FLEAD project's core machine learning pipeline. It uses the `Edge-IIoTset` dataset to demonstrate the end-to-end process of training a federated Long Short-Term Memory (LSTM) model for anomaly detection.

### Process Overview

The simulation executes the following key steps:

*   **Data Loading & Preprocessing:** Loads the `Edge-IIoTset` CSV and transforms the time series data into sequential windows suitable for an LSTM model.

*   **Federated Partitioning:** Distributes the data windows among multiple simulated clients, supporting both IID and non-IID (time-skewed) distributions.

*   **Federated Training:** Implements the Federated Averaging (FedAvg) algorithm to iteratively:
    1.  Train local LSTM models on each client's data subset.
    2.  Aggregate model updates to produce an improved global model.

*   **Classification Modes:** Supports two distinct, configurable modes:
    1.  **Binary:** (Benign vs. Attack) using the `Attack_label` column.
    2.  **Multi-Class:** (Specific Attack Type) using the `Attack_type` column.

*   **Evaluation & Visualization:** Tracks the global model's performance (accuracy and loss) after each round and visualizes the learning progress.

**### Note on Scope**

This notebook focuses exclusively on the data science and federated learning components for understanding purposes. For this simulation, distributed data streaming technologies such as Apache Kafka and Apache Flink are not implemented.

In [21]:
# 1) Install & set up Kaggle API
!pip install -q kaggle
from google.colab import files

# Prompt for API token upload
print("➡️ Please upload your kaggle.json (from Kaggle > Account > Create New API Token)")
files.upload()

# Move the token to the correct directory and set permissions
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

print("✅ Kaggle API configured.")

➡️ Please upload your kaggle.json (from Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
✅ Kaggle API configured.


In [22]:
# 2) Download, unzip, and select the correct dataset
!kaggle datasets download -d sibasispradhan/edge-iiotset-dataset -p /content/edge_iiot -q
!unzip -q -o /content/edge_iiot/edge-iiotset-dataset.zip -d /content/edge_iiot
print("✅ Dataset downloaded and unzipped.")

import pathlib

data_dir = pathlib.Path("/content/edge_iiot")

# As per the creators' research paper, we used their main pre-processed files.
#'DNN-EdgeIIoT-dataset.csv' is specified for Deep Learning models (our goal).
#'ML-EdgeIIoT-dataset.csv' is the designated backup.
# Explicitly ignored other undocumented files (like live_data_training.csv) to ensure valid results.
preferred_order = ["DNN-EdgeIIoT-dataset.csv", "ML-EdgeIIoT-dataset.csv"]
CSV_PATH = None

# Find the best available dataset from our corrected list.
for name in preferred_order:
    # Use rglob to find the file even if it's nested in a subfolder.
    found_files = list(data_dir.rglob(name))
    if found_files:
        CSV_PATH = str(found_files[0])
        break # Stop as soon as the best option is found.

# --- Final Confirmation ---
print("-" * 50)
if CSV_PATH:
    selected_file = pathlib.Path(CSV_PATH).name
    # NEW LOGIC: Check if the selected file was the top priority.
    if selected_file == preferred_order[0]:
        print(f"✅ Successfully selected the primary dataset: {selected_file}")
        print("(Backup dataset 'ML-EdgeIIoT-dataset.csv' was found but not needed).")
    else:
        # This means the primary was not found, and we are using the backup.
        print(f"⚠️ Warning: Primary dataset '{preferred_order[0]}' not found.")
        print(f"➡️ Using the designated backup dataset: {selected_file}")
else:
    # This error is important. We can't proceed without a valid dataset.
    print("❌ ERROR: Neither the DNN nor the ML dataset file could be found.")

Dataset URL: https://www.kaggle.com/datasets/sibasispradhan/edge-iiotset-dataset
License(s): MIT
✅ Dataset downloaded and unzipped.
--------------------------------------------------
✅ Successfully selected the primary dataset: DNN-EdgeIIoT-dataset.csv
(Backup dataset 'ML-EdgeIIoT-dataset.csv' was found but not needed).


In [23]:
# === Core Libraries ===
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# === Scikit-learn for Preprocessing & Evaluation ===
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# === TensorFlow for Deep Learning ===
# We only need the main tensorflow import. Keras is accessible via tf.keras.
import tensorflow as tf

# === Reproducibility ===
# Set a seed for random number generators in all libraries to ensure
# that results can be reproduced on subsequent runs.
SEED = 42
os.environ['TF_DETERMINISTIC_OPS'] = '1' # Optional: Aims for more deterministic TF operations
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# === Environment Check ===
print(f"✅ TensorFlow version: {tf.__version__}")

# Check for GPU availability for hardware acceleration
if tf.config.list_physical_devices('GPU'):
    print("✅ GPU is available.")
else:
    print("⚠️ No GPU detected. Training will run on the CPU.")

✅ TensorFlow version: 2.19.0
✅ GPU is available.


In [24]:
# === Configuration ===

# --- CRITICAL FIX ---
# The CSV_PATH is now automatically set by the previous cell.
# We DO NOT redefine it here. This ensures we always use the correct, programmatically found file.
# If you need to see the path, you can run: print(CSV_PATH)

# --- NEW: Time Series Parameters ---
# These are essential for preparing the data for the LSTM model.
SEQUENCE_LENGTH  = 20    # The number of time steps in each sample (window size).
TIME_COLUMN_NAME = None  # Set to the name of the timestamp column, e.g., 'frame.time', or None if we use row order.

# --- Federated Learning Setup ---
NUM_CLIENTS   = 4     # Number of simulated edge devices (clients).
ROUNDS        = 5     # Number of global communication rounds.
LOCAL_EPOCHS  = 2     # Local training epochs per client before aggregation.
BATCH_SIZE    = 256   # Mini-batch size for each client's local training.
IID_SPLIT     = True  # True = IID (randomly shuffled data); False = non-IID (clients get sequential time chunks).

# --- Task & Model Architecture Setup ---
# Set MULTICLASS to True to use 'Attack_type', False to use 'Attack_label'.
MULTICLASS    = False #(For testing pursoses will be changed later)
N_FEATURES    = None  # Will be set automatically after loading data.
NUM_CLASSES   = None  # Will be set automatically based on the label column.

# Neural network hyperparameters
HIDDEN_UNITS  = [128, 64, 32] # Hidden layer sizes for the LSTM/Dense network.
DROPOUT       = 0.2         # Dropout rate for regularization.
LR            = 1e-3        # Learning rate for the Adam optimizer.

# --- Confirmation ---
print("✅ Configuration loaded successfully.")
print(f"Federated Learning: {NUM_CLIENTS} clients, {ROUNDS} rounds.")
print(f"Task: {'Multi-class' if MULTICLASS else 'Binary'} classification.")
print(f"Time Series Window Size: {SEQUENCE_LENGTH} steps.")

✅ Configuration loaded successfully.
Federated Learning: 4 clients, 5 rounds.
Task: Binary classification.
Time Series Window Size: 20 steps.


In [25]:
# === Data Loading and Preparation ===

# 1. Load the dataset using the path set by our file-finder logic
assert os.path.exists(CSV_PATH), f"❌ File not found: {CSV_PATH}"
df = pd.read_csv(CSV_PATH, low_memory=False)

# 2. Handle Labels based on the MULTICLASS configuration switch
if MULTICLASS:
    LABEL_COL_NAME = 'Attack_type'
    print(f"Running in MULTI-CLASS mode. Using label column: '{LABEL_COL_NAME}'")
    encoder = LabelEncoder()
    df['label_encoded'] = encoder.fit_transform(df[LABEL_COL_NAME])
else:
    LABEL_COL_NAME = 'Attack_label'
    print(f"Running in BINARY mode. Using label column: '{LABEL_COL_NAME}'")
    df['label_encoded'] = df[LABEL_COL_NAME]

# 3. Sort by Time (if a time column is specified and exists)
if TIME_COLUMN_NAME and TIME_COLUMN_NAME in df.columns:
    print(f"🕒 Converting and sorting by time column: {TIME_COLUMN_NAME}")

    # --- THE DEFINITIVE FIX ---
    # Step A: First, force the column to a numeric type. Any non-number text becomes NaN.
    numeric_timestamps = pd.to_numeric(df[TIME_COLUMN_NAME], errors='coerce')

    # Step B: Now, convert the clean numbers (Unix timestamps) to datetime objects.
    df[TIME_COLUMN_NAME] = pd.to_datetime(numeric_timestamps, unit='s', errors='coerce')

    # Step C: Drop any rows that failed conversion and then sort the rest.
    df.dropna(subset=[TIME_COLUMN_NAME], inplace=True)
    df.sort_values(TIME_COLUMN_NAME, inplace=True)
else:
    print("⚠️ No time column specified or found. Using original row order.")

# 4. Separate Features (X) and Labels (y)
potential_label_cols = ['Attack_label', 'Attack_type', 'label_encoded']
feature_cols = [col for col in df.columns if df[col].dtype in ['float64', 'int64'] and col not in potential_label_cols]
X = df[feature_cols].astype(np.float32)
y = df['label_encoded'].values

# 5. Finalize and Update Dynamic Configuration
N_FEATURES = X.shape[1]
NUM_CLASSES = len(np.unique(y))

# --- Summary ---
print("-" * 50)
print(f"✅ Data preparation complete.")
print(f"Features (X) shape: {X.shape}")
print(f"Labels (y) shape: {y.shape}")
print(f"Number of features detected: {N_FEATURES}")
print(f"Number of classes detected: {NUM_CLASSES}")
if MULTICLASS:
    print(f"Class mapping: {list(encoder.classes_)}")
print("-" * 50)
print("Data Head:")
display(df.head(3))

Running in BINARY mode. Using label column: 'Attack_label'
⚠️ No time column specified or found. Using original row order.
--------------------------------------------------
✅ Data preparation complete.
Features (X) shape: (2219201, 42)
Labels (y) shape: (2219201,)
Number of features detected: 42
Number of classes detected: 2
--------------------------------------------------
Data Head:


Unnamed: 0,frame.time,ip.src_host,ip.dst_host,arp.dst.proto_ipv4,arp.opcode,arp.hw.size,arp.src.proto_ipv4,icmp.checksum,icmp.seq_le,icmp.transmit_timestamp,...,mqtt.protoname,mqtt.topic,mqtt.topic_len,mqtt.ver,mbtcp.len,mbtcp.trans_id,mbtcp.unit_id,Attack_label,Attack_type,label_encoded
0,2021 11:44:10.081753000,192.168.0.128,192.168.0.101,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0,Normal,0
1,2021 11:44:10.162218000,192.168.0.101,192.168.0.128,0,0.0,0.0,0,0.0,0.0,0.0,...,MQTT,0,0.0,4.0,0.0,0.0,0.0,0,Normal,0
2,2021 11:44:10.162271000,192.168.0.128,192.168.0.101,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0,Normal,0


In [26]:
# === Final Data Prep Part 1: Split, Subset, Scale, and Window ===

# 1. Split the full data into initial Training and Testing sets (chronological)
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X, y, test_size=0.2, shuffle=False, random_state=SEED
)

# 2. --- CRITICAL FIX: Create a smaller subset from BOTH sets to prevent crashes ---
DEV_SET_FRACTION = 0.1 # Use 10% of the data for this simulation
train_subset_size = int(len(X_train_full) * DEV_SET_FRACTION)
test_subset_size = int(len(X_test_full) * DEV_SET_FRACTION)

X_train_dev = X_train_full[:train_subset_size]
y_train_dev = y_train_full[:train_subset_size]
X_test_dev = X_test_full[:test_subset_size]
y_test_dev = y_test_full[:test_subset_size]

print(f"Using a {DEV_SET_FRACTION*100}% subset for this simulation.")
print(f"  - Dev Training set size: {len(X_train_dev)} rows.")
print(f"  - Dev Testing set size:  {len(X_test_dev)} rows.\n")

# 3. Scale the feature data (fitting ONLY on the training subset)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_dev)
X_test_scaled = scaler.transform(X_test_dev)
print("✅ Subsets created and scaled successfully.")

# 4. Create time series sequences (windows) from the smaller subsets
def create_sequences(features, labels, seq_length):
    """Creates overlapping sequences from time series data."""
    X_seq, y_seq = [], []
    for i in range(len(features) - seq_length + 1):
        X_seq.append(features[i:(i + seq_length)])
        y_seq.append(labels[i + seq_length - 1])
    return np.array(X_seq), np.array(y_seq)

X_train_seq, y_train_seq = create_sequences(X_train_scaled, y_train_dev, SEQUENCE_LENGTH)
X_test_seq, y_test_seq = create_sequences(X_test_scaled, y_test_dev, SEQUENCE_LENGTH)

# --- Summary ---
print("-" * 50)
print("✅ Time series sequences created from subsets.")
print(f"Final training features shape: {X_train_seq.shape}")
print(f"Final testing features shape:  {X_test_seq.shape}")

INPUT_SHAPE = (X_train_seq.shape[1], X_train_seq.shape[2])
print(f"\n➡️ Model input shape will be: {INPUT_SHAPE}")

Using a 10.0% subset for this simulation.
  - Dev Training set size: 177536 rows.
  - Dev Testing set size:  44384 rows.

✅ Subsets created and scaled successfully.
--------------------------------------------------
✅ Time series sequences created from subsets.
Final training features shape: (177517, 20, 42)
Final testing features shape:  (44365, 20, 42)

➡️ Model input shape will be: (20, 42)


In [27]:
# === Final Data Prep Part 1: Split, Stratified Subset, Scale, and Window ===

# 1. Split the full data into initial Training and Testing sets (chronological)
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X, y, test_size=0.2, shuffle=False, random_state=SEED
)

# 2. --- STRATIFIED subset to ensure both classes are present ---
DEV_SET_FRACTION = 0.02 # Use 2% of the data for this simulation

# Use train_test_split again to create a smaller, representative subset.
# 'stratify=y_train_full' is the key: it preserves the percentage of each class.
_, X_train_dev, _, y_train_dev = train_test_split(
    X_train_full, y_train_full,
    test_size=DEV_SET_FRACTION,
    shuffle=True, # Shuffle is needed for a random representative sample
    stratify=y_train_full,
    random_state=SEED
)

# We still use a simple chronological subset for the test data
test_subset_size = int(len(X_test_full) * DEV_SET_FRACTION)
X_test_dev = X_test_full[:test_subset_size]
y_test_dev = y_test_full[:test_subset_size]

print(f"Using a stratified {DEV_SET_FRACTION*100}% subset for training.")
print(f"  - Dev Training set size: {len(X_train_dev)} rows.")
print(f"  - Dev Testing set size:  {len(X_test_dev)} rows.\n")

# 3. Scale the feature data (fitting ONLY on the training subset)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_dev)
X_test_scaled = scaler.transform(X_test_dev)
print("✅ Subsets created and scaled successfully.")

# 4. Create time series sequences (windows)
def create_sequences(features, labels, seq_length):
    X_seq, y_seq = [], []
    for i in range(len(features) - seq_length + 1):
        X_seq.append(features[i:(i + seq_length)])
        y_seq.append(labels[i + seq_length - 1])
    return np.array(X_seq), np.array(y_seq)

X_train_seq, y_train_seq = create_sequences(X_train_scaled, y_train_dev, SEQUENCE_LENGTH)
X_test_seq, y_test_seq = create_sequences(X_test_scaled, y_test_dev, SEQUENCE_LENGTH)

# --- Summary ---
print("-" * 50)
print("✅ Time series sequences created from subsets.")
print(f"Final training features shape: {X_train_seq.shape}")
print(f"Final testing features shape:  {X_test_seq.shape}")
INPUT_SHAPE = (X_train_seq.shape[1], X_train_seq.shape[2])
print(f"\n➡️ Model input shape will be: {INPUT_SHAPE}")

Using a stratified 2.0% subset for training.
  - Dev Training set size: 35508 rows.
  - Dev Testing set size:  8876 rows.

✅ Subsets created and scaled successfully.
--------------------------------------------------
✅ Time series sequences created from subsets.
Final training features shape: (35489, 20, 42)
Final testing features shape:  (8857, 20, 42)

➡️ Model input shape will be: (20, 42)


In [28]:
# === Final Data Prep Part 2: Create Federated Clients ===

def create_federated_clients(X, y, num_clients, iid=True):
    """Partitions the training data among a number of simulated clients."""
    client_data = {}
    all_indices = np.arange(len(X))

    if iid:
        np.random.shuffle(all_indices)
        print("Creating an IID data split (data is shuffled)...")
    else:
        print("Creating a non-IID data split (clients get sequential time chunks)...")

    client_indices = np.array_split(all_indices, num_clients)

    for i in range(num_clients):
        client_X = X[client_indices[i]]
        client_y = y[client_indices[i]]
        client_data[f'client_{i+1}'] = (client_X, client_y)

    return client_data

# Create the client data using the balanced, subsetted training sequences
federated_train_data = create_federated_clients(
    X_train_seq, y_train_seq, NUM_CLIENTS, iid=IID_SPLIT
)

# --- Summary of the Federated Data Split ---
print("-" * 50)
print(f"✅ Successfully created {len(federated_train_data)} federated clients.")
for i in range(NUM_CLIENTS):
    client_id = f'client_{i+1}'
    client_X, client_y = federated_train_data[client_id]

    # This line will now show the proof of stratification
    label_counts = np.bincount(client_y, minlength=NUM_CLASSES)

    print(f"  - {client_id}: X shape = {client_X.shape}, Label distribution = {label_counts}")

Creating an IID data split (data is shuffled)...
--------------------------------------------------
✅ Successfully created 4 federated clients.
  - client_1: X shape = (8873, 20, 42), Label distribution = [8095  778]
  - client_2: X shape = (8872, 20, 42), Label distribution = [8086  786]
  - client_3: X shape = (8872, 20, 42), Label distribution = [8067  805]
  - client_4: X shape = (8872, 20, 42), Label distribution = [8048  824]


In [29]:
# === Define the LSTM Model Architecture ===

def create_lstm_model(input_shape, num_classes):
    """
    Creates, compiles, and returns a Keras LSTM model using the Functional API.
    """
    # --- NEW: Add a check to prevent the cryptic ValueError ---
    # This gives a clear error if the data prep cell didn't run correctly.
    assert num_classes is not None, "NUM_CLASSES is None. Please re-run the 'Final Data Prep' cell before this one."
    assert input_shape is not None, "INPUT_SHAPE is None. Please re-run the 'Final Data Prep' cell before this one."

    # Define the input layer
    inputs = tf.keras.layers.Input(shape=input_shape)

    # --- Hidden Layers ---
    x = tf.keras.layers.LSTM(HIDDEN_UNITS[0], return_sequences=True)(inputs)
    x = tf.keras.layers.Dropout(DROPOUT)(x)

    x = tf.keras.layers.LSTM(HIDDEN_UNITS[1])(x)
    x = tf.keras.layers.Dropout(DROPOUT)(x)

    x = tf.keras.layers.Dense(HIDDEN_UNITS[2])(x)
    x = tf.keras.layers.Dropout(DROPOUT)(x)

    # --- Dynamic Output Layer ---
    if num_classes == 2:
        outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
        loss_function = 'binary_crossentropy'
    else:
        outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
        loss_function = 'sparse_categorical_crossentropy'

    # --- Create and Compile the Model ---
    model = tf.keras.Model(inputs=inputs, outputs=outputs)

    optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
    model.compile(optimizer=optimizer,
                  loss=loss_function,
                  metrics=['accuracy'])

    return model

# Create an instance of the global model to initialize it and see its structure
global_model = create_lstm_model(INPUT_SHAPE, NUM_CLASSES)

# --- Summary of the Model ---
print("✅ LSTM model created successfully.")
global_model.summary()

✅ LSTM model created successfully.


In [35]:
# === Implement the Federated Averaging (FedAvg) Loop ===

# Create a history dictionary to store the performance of the global model after each round
history = {'loss': [], 'accuracy': []}

# This is the main federated learning loop
for r in range(ROUNDS):
    print(f"\n--- Round {r+1}/{ROUNDS} ---")

    # 1. LOCAL TRAINING on each client
    local_model_weights = []
    for client_id, (client_X, client_y) in federated_train_data.items():
        local_model = create_lstm_model(INPUT_SHAPE, NUM_CLASSES)
        local_model.set_weights(global_model.get_weights())

        print(f"  Training {client_id}...")
        local_model.fit(client_X, client_y,
                        epochs=LOCAL_EPOCHS,
                        batch_size=BATCH_SIZE,
                        verbose=0)

        local_model_weights.append(local_model.get_weights())

    # 2. --- ROBUST GLOBAL AGGREGATION (THE FIX) ---

    # Initialize a new list to hold the averaged weights
    new_global_weights = []

    # Get the number of layers in the model
    num_layers = len(global_model.get_weights())

    # Loop through each layer of the model
    for i in range(num_layers):
        # Get the weights for this specific layer from all clients
        layer_weights = np.array([client_weights[i] for client_weights in local_model_weights], dtype=object)

        # Calculate the average for this layer's weights
        avg_layer_weights = np.mean(layer_weights, axis=0)

        # Add the averaged weights for this layer to our new list
        new_global_weights.append(avg_layer_weights)

    # Set the global model's weights to this new averaged list
    global_model.set_weights(new_global_weights)

    # 3. GLOBAL EVALUATION
    loss, acc = global_model.evaluate(X_test_seq, y_test_seq, verbose=0)

    print(f"  GLOBAL MODEL EVALUATION: Loss={loss:.4f}, Accuracy={acc:.4f}")

    history['loss'].append(loss)
    history['accuracy'].append(acc)

print("\n✅ Federated training complete.")


--- Round 1/5 ---
  Training client_1...
  Training client_2...
  Training client_3...
  Training client_4...
  GLOBAL MODEL EVALUATION: Loss=1.3843, Accuracy=0.0000

--- Round 2/5 ---
  Training client_1...
  Training client_2...
  Training client_3...
  Training client_4...
  GLOBAL MODEL EVALUATION: Loss=1.0857, Accuracy=0.3579

--- Round 3/5 ---
  Training client_1...
  Training client_2...
  Training client_3...
  Training client_4...
  GLOBAL MODEL EVALUATION: Loss=0.5129, Accuracy=0.6583

--- Round 4/5 ---
  Training client_1...
  Training client_2...
  Training client_3...
  Training client_4...
  GLOBAL MODEL EVALUATION: Loss=0.1690, Accuracy=0.9904

--- Round 5/5 ---
  Training client_1...
  Training client_2...
  Training client_3...
  Training client_4...
  GLOBAL MODEL EVALUATION: Loss=0.1076, Accuracy=0.9922

✅ Federated training complete.
