# BGP Normal Traffic Generation using GAN Algorithms

## Overview
This notebook implements a comprehensive pipeline for generating synthetic BGP normal traffic using various GAN architectures:
- **TimeGAN**: Temporal generative adversarial network for time-series
- **DoppelGANger**: High-fidelity time-series generation
- **LSTM-GAN**: LSTM-based generative adversarial network

## Workflow
1. Load and preprocess BGP traffic data
2. Filter to normal traffic only
3. Feature selection and normalization
4. Sequence building for time-series models
5. Train multiple GAN architectures
6. Evaluate and compare models
7. Generate synthetic normal traffic

---

## Part 1: Environment Setup and Imports

In [20]:
# Install required packages (uncomment if needed)
# !pip install torch torchvision pandas numpy scikit-learn matplotlib seaborn tqdm statsmodels

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm
import warnings
from statsmodels.tsa.stattools import acf
from scipy import stats
import json
import os
from datetime import datetime

warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


## Part 2: Configuration

In [22]:
# ============================================================================
# CONFIGURATION - Modify these parameters as needed
# ============================================================================

CONFIG = {
    # Data paths
    'data_path': '/home/smotaali/BGP_Traffic_Generation/results/rrc04_20251116_exctracted_1s_FIXED.csv',
    'output_dir': '/home/smotaali/BGP_Traffic_Generation/results/gan_outputs_opus/',
    
    # Feature configuration
    'selected_features': [
        'announcements', 'withdrawals', 'nlri_ann', 'dups',
        'origin_0', 'origin_2', 'origin_changes',
        'imp_wd', 'imp_wd_spath', 'imp_wd_dpath',
        'as_path_max', 'unique_as_path_max',
        'edit_distance_avg', 'edit_distance_max',
        'edit_distance_dict_0', 'edit_distance_dict_1', 'edit_distance_dict_2',
        'edit_distance_dict_3', 'edit_distance_dict_4', 'edit_distance_dict_5',
        'edit_distance_dict_6',
        'edit_distance_unique_dict_0', 'edit_distance_unique_dict_1',
        'number_rare_ases', 'rare_ases_avg',
        'nadas', 'flaps'
    ],
    
    # Columns to drop
    'drop_columns': ['label', 'window_start', 'window_end'],
    
    # Sequence parameters
    'sequence_length': 30,  # T: window length in seconds (adjustable: 10-60)
    'stride': 1,  # Sliding window stride
    
    # Train/test split
    'test_size': 0.2,
    'validation_size': 0.1,  # From training set
    
    # Training parameters
    'batch_size': 64,
    'epochs': 200,
    'learning_rate': 0.0002,
    'beta1': 0.5,  # Adam optimizer beta1
    'beta2': 0.999,  # Adam optimizer beta2
    
    # Model architecture
    'latent_dim': 32,
    'hidden_dim': 128,
    'num_layers': 3,
    
    # Evaluation
    'n_synthetic_samples': 1000,  # Number of synthetic sequences to generate
}

# Create output directory
os.makedirs(CONFIG['output_dir'], exist_ok=True)
print(f"Configuration loaded. Output directory: {CONFIG['output_dir']}")

Configuration loaded. Output directory: /home/smotaali/BGP_Traffic_Generation/results/gan_outputs_opus/


## Part 3: Data Loading and Preprocessing

In [23]:
# Load the data
print("Loading data...")
df = pd.read_csv(CONFIG['data_path'])
print(f"Original data shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nLabel distribution:\n{df['label'].value_counts()}")

Loading data...
Original data shape: (86387, 30)

Columns: ['announcements', 'withdrawals', 'nlri_ann', 'dups', 'origin_0', 'origin_2', 'origin_changes', 'imp_wd', 'imp_wd_spath', 'imp_wd_dpath', 'as_path_max', 'unique_as_path_max', 'edit_distance_avg', 'edit_distance_max', 'edit_distance_dict_0', 'edit_distance_dict_1', 'edit_distance_dict_2', 'edit_distance_dict_3', 'edit_distance_dict_4', 'edit_distance_dict_5', 'edit_distance_dict_6', 'edit_distance_unique_dict_0', 'edit_distance_unique_dict_1', 'number_rare_ases', 'rare_ases_avg', 'nadas', 'flaps', 'label', 'window_start', 'window_end']

Label distribution:
label
normal    86387
Name: count, dtype: int64


In [24]:
# Filter to normal traffic only
print("\n" + "="*60)
print("Filtering to NORMAL traffic only...")
print("="*60)

df_normal = df[df['label'] == 'normal'].copy()
print(f"Normal traffic rows: {len(df_normal)} ({100*len(df_normal)/len(df):.2f}% of total)")


Filtering to NORMAL traffic only...
Normal traffic rows: 86387 (100.00% of total)


In [25]:
# Select only the specified features
print("\n" + "="*60)
print("Feature Selection")
print("="*60)

# Check which features exist in the data
available_features = [f for f in CONFIG['selected_features'] if f in df_normal.columns]
missing_features = [f for f in CONFIG['selected_features'] if f not in df_normal.columns]

if missing_features:
    print(f"Warning: Missing features: {missing_features}")

print(f"Using {len(available_features)} features: {available_features}")

# Select features
df_features = df_normal[available_features].copy()

# Handle any missing values
print(f"\nMissing values before cleaning: {df_features.isnull().sum().sum()}")
df_features = df_features.fillna(0)  # Fill NaN with 0 for BGP features
print(f"Data shape after feature selection: {df_features.shape}")


Feature Selection
Using 27 features: ['announcements', 'withdrawals', 'nlri_ann', 'dups', 'origin_0', 'origin_2', 'origin_changes', 'imp_wd', 'imp_wd_spath', 'imp_wd_dpath', 'as_path_max', 'unique_as_path_max', 'edit_distance_avg', 'edit_distance_max', 'edit_distance_dict_0', 'edit_distance_dict_1', 'edit_distance_dict_2', 'edit_distance_dict_3', 'edit_distance_dict_4', 'edit_distance_dict_5', 'edit_distance_dict_6', 'edit_distance_unique_dict_0', 'edit_distance_unique_dict_1', 'number_rare_ases', 'rare_ases_avg', 'nadas', 'flaps']

Missing values before cleaning: 0
Data shape after feature selection: (86387, 27)


In [26]:
# Display statistics of selected features
print("\n" + "="*60)
print("Feature Statistics (Normal Traffic)")
print("="*60)
df_features.describe().T


Feature Statistics (Normal Traffic)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
announcements,86387.0,182.400141,313.207475,0.0,24.0,61.0,179.0,12661.0
withdrawals,86387.0,15.683124,60.897714,0.0,1.0,3.0,9.0,2454.0
nlri_ann,86387.0,146.069629,272.502888,0.0,16.0,43.0,128.0,12632.0
dups,86387.0,7.523181,26.891729,0.0,0.0,1.0,4.0,947.0
origin_0,86387.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
origin_2,86387.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
origin_changes,86387.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imp_wd,86387.0,12.660632,40.066748,0.0,1.0,3.0,8.0,1435.0
imp_wd_spath,86387.0,2.739868,19.699433,0.0,0.0,0.0,0.0,601.0
imp_wd_dpath,86387.0,9.920764,32.776439,0.0,1.0,3.0,7.0,1435.0


## Part 4: Normalization and Sequence Building

In [27]:
# Normalize the data using StandardScaler
print("="*60)
print("Normalizing data with StandardScaler...")
print("="*60)

scaler = StandardScaler()
data_normalized = scaler.fit_transform(df_features.values)

print(f"Normalized data shape: {data_normalized.shape}")
print(f"Normalized data range: [{data_normalized.min():.4f}, {data_normalized.max():.4f}]")
print(f"Normalized data mean: {data_normalized.mean():.6f}")
print(f"Normalized data std: {data_normalized.std():.6f}")

Normalizing data with StandardScaler...
Normalized data shape: (86387, 27)
Normalized data range: [-2.2273, 89.4763]
Normalized data mean: 0.000000
Normalized data std: 0.902671


In [28]:
def create_sequences(data, seq_length, stride=1):
    """
    Create sliding window sequences from time-series data.
    
    Args:
        data: numpy array of shape (num_timesteps, num_features)
        seq_length: length of each sequence (T)
        stride: step size between sequences
        
    Returns:
        sequences: numpy array of shape (num_samples, seq_length, num_features)
    """
    sequences = []
    for i in range(0, len(data) - seq_length + 1, stride):
        seq = data[i:i + seq_length]
        sequences.append(seq)
    return np.array(sequences)

# Create sequences
print("\n" + "="*60)
print(f"Building sequences (T={CONFIG['sequence_length']}, stride={CONFIG['stride']})...")
print("="*60)

sequences = create_sequences(data_normalized, CONFIG['sequence_length'], CONFIG['stride'])
print(f"Sequences shape: {sequences.shape}")
print(f"  - Number of samples: {sequences.shape[0]}")
print(f"  - Sequence length (T): {sequences.shape[1]}")
print(f"  - Number of features: {sequences.shape[2]}")


Building sequences (T=30, stride=1)...


Sequences shape: (86358, 30, 27)
  - Number of samples: 86358
  - Sequence length (T): 30
  - Number of features: 27


In [29]:
# Train/Test/Validation Split
print("\n" + "="*60)
print("Train/Test/Validation Split")
print("="*60)

# First split: train+val vs test
X_train_val, X_test = train_test_split(
    sequences, 
    test_size=CONFIG['test_size'], 
    random_state=SEED,
    shuffle=True
)

# Second split: train vs val
X_train, X_val = train_test_split(
    X_train_val, 
    test_size=CONFIG['validation_size'], 
    random_state=SEED,
    shuffle=True
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")


Train/Test/Validation Split
Training set: (62177, 30, 27)
Validation set: (6909, 30, 27)
Test set: (17272, 30, 27)


In [30]:
# Convert to PyTorch tensors and create DataLoaders
train_tensor = torch.FloatTensor(X_train)
val_tensor = torch.FloatTensor(X_val)
test_tensor = torch.FloatTensor(X_test)

train_dataset = TensorDataset(train_tensor)
val_dataset = TensorDataset(val_tensor)
test_dataset = TensorDataset(test_tensor)

train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=CONFIG['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

print(f"Train batches: {len(train_loader)}")
print(f"Validation batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")

Train batches: 971
Validation batches: 108
Test batches: 270


## Part 5: GAN Model Definitions

### 5.1 TimeGAN Architecture

In [31]:
# ============================================================================
# TimeGAN Components
# Based on: "Time-series Generative Adversarial Networks" (Yoon et al., NeurIPS 2019)
# ============================================================================

class TimeGAN_Embedder(nn.Module):
    """Embedding network: maps original feature space to latent space."""
    def __init__(self, input_dim, hidden_dim, num_layers):
        super().__init__()
        self.rnn = nn.GRU(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, hidden_dim)
        self.activation = nn.Sigmoid()
        
    def forward(self, x):
        h, _ = self.rnn(x)
        h = self.fc(h)
        return self.activation(h)


class TimeGAN_Recovery(nn.Module):
    """Recovery network: maps from latent space back to original feature space."""
    def __init__(self, hidden_dim, output_dim, num_layers):
        super().__init__()
        self.rnn = nn.GRU(hidden_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, h):
        h_hat, _ = self.rnn(h)
        x_hat = self.fc(h_hat)
        return x_hat


class TimeGAN_Generator(nn.Module):
    """Generator: generates synthetic latent representations from noise."""
    def __init__(self, latent_dim, hidden_dim, num_layers):
        super().__init__()
        self.rnn = nn.GRU(latent_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, hidden_dim)
        self.activation = nn.Sigmoid()
        
    def forward(self, z):
        e_hat, _ = self.rnn(z)
        e_hat = self.fc(e_hat)
        return self.activation(e_hat)


class TimeGAN_Supervisor(nn.Module):
    """Supervisor: captures temporal dynamics in latent space."""
    def __init__(self, hidden_dim, num_layers):
        super().__init__()
        self.rnn = nn.GRU(hidden_dim, hidden_dim, num_layers - 1, batch_first=True)
        self.fc = nn.Linear(hidden_dim, hidden_dim)
        self.activation = nn.Sigmoid()
        
    def forward(self, h):
        s, _ = self.rnn(h)
        s = self.fc(s)
        return self.activation(s)


class TimeGAN_Discriminator(nn.Module):
    """Discriminator: distinguishes real from synthetic sequences."""
    def __init__(self, hidden_dim, num_layers):
        super().__init__()
        self.rnn = nn.GRU(hidden_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)
        
    def forward(self, h):
        d_out, _ = self.rnn(h)
        y_hat = self.fc(d_out)
        return y_hat


class TimeGAN:
    """Complete TimeGAN model."""
    def __init__(self, input_dim, hidden_dim, latent_dim, num_layers, device):
        self.device = device
        self.hidden_dim = hidden_dim
        self.latent_dim = latent_dim
        
        # Initialize networks
        self.embedder = TimeGAN_Embedder(input_dim, hidden_dim, num_layers).to(device)
        self.recovery = TimeGAN_Recovery(hidden_dim, input_dim, num_layers).to(device)
        self.generator = TimeGAN_Generator(latent_dim, hidden_dim, num_layers).to(device)
        self.supervisor = TimeGAN_Supervisor(hidden_dim, num_layers).to(device)
        self.discriminator = TimeGAN_Discriminator(hidden_dim, num_layers).to(device)
        
        # Loss function
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCEWithLogitsLoss()
        
    def _get_optimizers(self, lr):
        """Create optimizers for each training phase."""
        # Embedding optimizer
        e_params = list(self.embedder.parameters()) + list(self.recovery.parameters())
        self.opt_e = optim.Adam(e_params, lr=lr)
        
        # Supervised optimizer
        self.opt_s = optim.Adam(self.supervisor.parameters(), lr=lr)
        
        # Generator optimizer
        g_params = list(self.generator.parameters()) + list(self.supervisor.parameters())
        self.opt_g = optim.Adam(g_params, lr=lr)
        
        # Discriminator optimizer
        self.opt_d = optim.Adam(self.discriminator.parameters(), lr=lr)
        
    def train(self, train_loader, epochs, lr=0.001, gamma=1.0):
        """Train TimeGAN in four phases."""
        self._get_optimizers(lr)
        history = {'e_loss': [], 's_loss': [], 'g_loss': [], 'd_loss': []}
        
        # Phase 1: Embedding network training
        print("Phase 1: Training Embedding Network...")
        for epoch in tqdm(range(epochs // 4)):
            e_losses = []
            for batch in train_loader:
                x = batch[0].to(self.device)
                
                self.opt_e.zero_grad()
                h = self.embedder(x)
                x_tilde = self.recovery(h)
                e_loss = self.mse_loss(x, x_tilde)
                e_loss.backward()
                self.opt_e.step()
                e_losses.append(e_loss.item())
            history['e_loss'].append(np.mean(e_losses))
        
        # Phase 2: Supervised network training
        print("Phase 2: Training Supervised Network...")
        for epoch in tqdm(range(epochs // 4)):
            s_losses = []
            for batch in train_loader:
                x = batch[0].to(self.device)
                
                self.opt_s.zero_grad()
                h = self.embedder(x)
                h_hat_supervise = self.supervisor(h)
                s_loss = self.mse_loss(h[:, 1:, :], h_hat_supervise[:, :-1, :])
                s_loss.backward()
                self.opt_s.step()
                s_losses.append(s_loss.item())
            history['s_loss'].append(np.mean(s_losses))
        
        # Phase 3 & 4: Joint training
        print("Phase 3 & 4: Joint Training...")
        for epoch in tqdm(range(epochs // 2)):
            g_losses, d_losses = [], []
            for batch in train_loader:
                x = batch[0].to(self.device)
                batch_size, seq_len, _ = x.shape
                
                # Random noise
                z = torch.randn(batch_size, seq_len, self.latent_dim).to(self.device)
                
                # Train Generator (2 steps per discriminator step)
                for _ in range(2):
                    self.opt_g.zero_grad()
                    
                    # Real embeddings
                    h = self.embedder(x)
                    h_hat_supervise = self.supervisor(h)
                    
                    # Synthetic embeddings
                    e_hat = self.generator(z)
                    h_hat = self.supervisor(e_hat)
                    
                    # Synthetic data recovery
                    x_hat = self.recovery(h_hat)
                    
                    # Discriminator output for fake
                    y_fake = self.discriminator(h_hat)
                    y_fake_e = self.discriminator(e_hat)
                    
                    # Generator losses
                    g_loss_u = self.bce_loss(y_fake, torch.ones_like(y_fake))
                    g_loss_u_e = self.bce_loss(y_fake_e, torch.ones_like(y_fake_e))
                    g_loss_s = self.mse_loss(h[:, 1:, :], h_hat_supervise[:, :-1, :])
                    
                    # Moment matching
                    g_loss_v1 = torch.mean(torch.abs(torch.sqrt(x_hat.var(dim=0, unbiased=False) + 1e-6) - 
                                                     torch.sqrt(x.var(dim=0, unbiased=False) + 1e-6)))
                    g_loss_v2 = torch.mean(torch.abs(x_hat.mean(dim=0) - x.mean(dim=0)))
                    
                    g_loss = g_loss_u + g_loss_u_e + gamma * g_loss_s + 100 * (g_loss_v1 + g_loss_v2)
                    g_loss.backward()
                    self.opt_g.step()
                
                # Train Discriminator
                self.opt_d.zero_grad()
                
                h = self.embedder(x)
                e_hat = self.generator(z)
                h_hat = self.supervisor(e_hat)
                
                y_real = self.discriminator(h)
                y_fake = self.discriminator(h_hat)
                y_fake_e = self.discriminator(e_hat)
                
                d_loss_real = self.bce_loss(y_real, torch.ones_like(y_real))
                d_loss_fake = self.bce_loss(y_fake, torch.zeros_like(y_fake))
                d_loss_fake_e = self.bce_loss(y_fake_e, torch.zeros_like(y_fake_e))
                
                d_loss = d_loss_real + d_loss_fake + d_loss_fake_e
                
                if d_loss > 0.15:  # Only update if discriminator is not too strong
                    d_loss.backward()
                    self.opt_d.step()
                
                g_losses.append(g_loss.item())
                d_losses.append(d_loss.item())
            
            history['g_loss'].append(np.mean(g_losses))
            history['d_loss'].append(np.mean(d_losses))
        
        return history
    
    def generate(self, n_samples, seq_len):
        """Generate synthetic sequences."""
        self.generator.eval()
        self.supervisor.eval()
        self.recovery.eval()
        
        with torch.no_grad():
            z = torch.randn(n_samples, seq_len, self.latent_dim).to(self.device)
            e_hat = self.generator(z)
            h_hat = self.supervisor(e_hat)
            x_hat = self.recovery(h_hat)
        
        return x_hat.cpu().numpy()

print("TimeGAN model defined.")

TimeGAN model defined.


### 5.2 LSTM-GAN Architecture

In [32]:
# ============================================================================
# LSTM-GAN Components
# ============================================================================

class LSTMGAN_Generator(nn.Module):
    """LSTM-based Generator for time-series."""
    def __init__(self, latent_dim, hidden_dim, output_dim, num_layers, seq_len):
        super().__init__()
        self.seq_len = seq_len
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(latent_dim, hidden_dim, num_layers, batch_first=True, dropout=0.2)
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, output_dim),
            nn.Tanh()  # Output in [-1, 1] range for normalized data
        )
        
    def forward(self, z):
        # z: (batch, seq_len, latent_dim)
        lstm_out, _ = self.lstm(z)
        output = self.fc(lstm_out)
        return output


class LSTMGAN_Discriminator(nn.Module):
    """LSTM-based Discriminator for time-series."""
    def __init__(self, input_dim, hidden_dim, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True, dropout=0.2)
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim // 2, 1)
        )
        
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        lstm_out, _ = self.lstm(x)
        # Use the last timestep for classification
        output = self.fc(lstm_out[:, -1, :])
        return output


class LSTMGAN:
    """Complete LSTM-GAN model."""
    def __init__(self, input_dim, hidden_dim, latent_dim, num_layers, seq_len, device):
        self.device = device
        self.latent_dim = latent_dim
        self.seq_len = seq_len
        
        self.generator = LSTMGAN_Generator(
            latent_dim, hidden_dim, input_dim, num_layers, seq_len
        ).to(device)
        
        self.discriminator = LSTMGAN_Discriminator(
            input_dim, hidden_dim, num_layers
        ).to(device)
        
        self.bce_loss = nn.BCEWithLogitsLoss()
        
    def train(self, train_loader, epochs, lr=0.0002, beta1=0.5, beta2=0.999):
        """Train LSTM-GAN."""
        opt_g = optim.Adam(self.generator.parameters(), lr=lr, betas=(beta1, beta2))
        opt_d = optim.Adam(self.discriminator.parameters(), lr=lr, betas=(beta1, beta2))
        
        history = {'g_loss': [], 'd_loss': []}
        
        for epoch in tqdm(range(epochs)):
            g_losses, d_losses = [], []
            
            for batch in train_loader:
                real_data = batch[0].to(self.device)
                batch_size = real_data.size(0)
                
                # Labels
                real_labels = torch.ones(batch_size, 1).to(self.device)
                fake_labels = torch.zeros(batch_size, 1).to(self.device)
                
                # ---------------------
                # Train Discriminator
                # ---------------------
                opt_d.zero_grad()
                
                # Real samples
                real_output = self.discriminator(real_data)
                d_loss_real = self.bce_loss(real_output, real_labels)
                
                # Fake samples
                z = torch.randn(batch_size, self.seq_len, self.latent_dim).to(self.device)
                fake_data = self.generator(z)
                fake_output = self.discriminator(fake_data.detach())
                d_loss_fake = self.bce_loss(fake_output, fake_labels)
                
                d_loss = d_loss_real + d_loss_fake
                d_loss.backward()
                opt_d.step()
                
                # ---------------------
                # Train Generator
                # ---------------------
                opt_g.zero_grad()
                
                z = torch.randn(batch_size, self.seq_len, self.latent_dim).to(self.device)
                fake_data = self.generator(z)
                fake_output = self.discriminator(fake_data)
                g_loss = self.bce_loss(fake_output, real_labels)
                
                g_loss.backward()
                opt_g.step()
                
                g_losses.append(g_loss.item())
                d_losses.append(d_loss.item())
            
            history['g_loss'].append(np.mean(g_losses))
            history['d_loss'].append(np.mean(d_losses))
            
            if (epoch + 1) % 50 == 0:
                print(f"Epoch [{epoch+1}/{epochs}] D_loss: {history['d_loss'][-1]:.4f}, G_loss: {history['g_loss'][-1]:.4f}")
        
        return history
    
    def generate(self, n_samples):
        """Generate synthetic sequences."""
        self.generator.eval()
        with torch.no_grad():
            z = torch.randn(n_samples, self.seq_len, self.latent_dim).to(self.device)
            fake_data = self.generator(z)
        return fake_data.cpu().numpy()

print("LSTM-GAN model defined.")

LSTM-GAN model defined.


### 5.3 DoppelGANger Architecture

In [33]:
# ============================================================================
# DoppelGANger Components
# Based on: "Using GANs for Sharing Networked Time Series Data" (Lin et al., IMC 2020)
# Simplified implementation focused on temporal features
# ============================================================================

class DoppelGANger_AttrGenerator(nn.Module):
    """Attribute generator for metadata/static features."""
    def __init__(self, latent_dim, hidden_dim, attr_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, attr_dim),
            nn.Tanh()
        )
        
    def forward(self, z):
        return self.net(z)


class DoppelGANger_FeatureGenerator(nn.Module):
    """Feature generator for time-series with attention mechanism."""
    def __init__(self, latent_dim, hidden_dim, feature_dim, num_layers, seq_len):
        super().__init__()
        self.seq_len = seq_len
        self.hidden_dim = hidden_dim
        
        # Initial hidden state generator
        self.init_hidden = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim * num_layers),
            nn.Tanh()
        )
        
        # LSTM for temporal generation
        self.lstm = nn.LSTM(latent_dim, hidden_dim, num_layers, batch_first=True)
        
        # Self-attention for capturing long-range dependencies
        self.attention = nn.MultiheadAttention(hidden_dim, num_heads=4, batch_first=True)
        
        # Output projection
        self.output = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, feature_dim),
            nn.Tanh()
        )
        
        self.num_layers = num_layers
        
    def forward(self, z_seq, z_attr=None):
        batch_size = z_seq.size(0)
        
        # Generate initial hidden state
        if z_attr is not None:
            h0 = self.init_hidden(z_attr)
            h0 = h0.view(self.num_layers, batch_size, self.hidden_dim)
            c0 = torch.zeros_like(h0)
            lstm_out, _ = self.lstm(z_seq, (h0, c0))
        else:
            lstm_out, _ = self.lstm(z_seq)
        
        # Apply self-attention
        attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
        
        # Residual connection
        combined = lstm_out + attn_out
        
        # Output projection
        output = self.output(combined)
        return output


class DoppelGANger_Discriminator(nn.Module):
    """Discriminator with auxiliary classifier for attributes."""
    def __init__(self, feature_dim, hidden_dim, num_layers):
        super().__init__()
        
        # Feature encoder
        self.lstm = nn.LSTM(feature_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True)
        
        # Attention pooling
        self.attention = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Real/fake classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1)
        )
        
    def forward(self, x):
        # x: (batch, seq_len, feature_dim)
        lstm_out, _ = self.lstm(x)  # (batch, seq_len, hidden_dim * 2)
        
        # Attention-weighted pooling
        attn_weights = torch.softmax(self.attention(lstm_out), dim=1)
        context = torch.sum(attn_weights * lstm_out, dim=1)  # (batch, hidden_dim * 2)
        
        # Classification
        validity = self.classifier(context)
        return validity


class DoppelGANger:
    """Complete DoppelGANger model."""
    def __init__(self, feature_dim, hidden_dim, latent_dim, num_layers, seq_len, device):
        self.device = device
        self.latent_dim = latent_dim
        self.seq_len = seq_len
        
        self.feature_gen = DoppelGANger_FeatureGenerator(
            latent_dim, hidden_dim, feature_dim, num_layers, seq_len
        ).to(device)
        
        self.attr_gen = DoppelGANger_AttrGenerator(
            latent_dim, hidden_dim, latent_dim  # Attr dim = latent dim for conditioning
        ).to(device)
        
        self.discriminator = DoppelGANger_Discriminator(
            feature_dim, hidden_dim, num_layers
        ).to(device)
        
        self.bce_loss = nn.BCEWithLogitsLoss()
        self.mse_loss = nn.MSELoss()
        
    def train(self, train_loader, epochs, lr=0.0002, beta1=0.5, beta2=0.999, n_critic=5):
        """Train DoppelGANger with WGAN-style training."""
        g_params = list(self.feature_gen.parameters()) + list(self.attr_gen.parameters())
        opt_g = optim.Adam(g_params, lr=lr, betas=(beta1, beta2))
        opt_d = optim.Adam(self.discriminator.parameters(), lr=lr, betas=(beta1, beta2))
        
        history = {'g_loss': [], 'd_loss': []}
        
        for epoch in tqdm(range(epochs)):
            g_losses, d_losses = [], []
            
            for batch in train_loader:
                real_data = batch[0].to(self.device)
                batch_size = real_data.size(0)
                
                real_labels = torch.ones(batch_size, 1).to(self.device)
                fake_labels = torch.zeros(batch_size, 1).to(self.device)
                
                # ---------------------
                # Train Discriminator
                # ---------------------
                for _ in range(n_critic):
                    opt_d.zero_grad()
                    
                    # Real samples
                    real_output = self.discriminator(real_data)
                    d_loss_real = self.bce_loss(real_output, real_labels)
                    
                    # Fake samples
                    z_attr = torch.randn(batch_size, self.latent_dim).to(self.device)
                    z_seq = torch.randn(batch_size, self.seq_len, self.latent_dim).to(self.device)
                    
                    attr_fake = self.attr_gen(z_attr)
                    fake_data = self.feature_gen(z_seq, attr_fake)
                    
                    fake_output = self.discriminator(fake_data.detach())
                    d_loss_fake = self.bce_loss(fake_output, fake_labels)
                    
                    d_loss = d_loss_real + d_loss_fake
                    d_loss.backward()
                    opt_d.step()
                
                # ---------------------
                # Train Generator
                # ---------------------
                opt_g.zero_grad()
                
                z_attr = torch.randn(batch_size, self.latent_dim).to(self.device)
                z_seq = torch.randn(batch_size, self.seq_len, self.latent_dim).to(self.device)
                
                attr_fake = self.attr_gen(z_attr)
                fake_data = self.feature_gen(z_seq, attr_fake)
                
                fake_output = self.discriminator(fake_data)
                g_loss = self.bce_loss(fake_output, real_labels)
                
                # Add feature matching loss
                real_mean = real_data.mean(dim=[0, 1])
                fake_mean = fake_data.mean(dim=[0, 1])
                g_loss += 10 * self.mse_loss(fake_mean, real_mean)
                
                g_loss.backward()
                opt_g.step()
                
                g_losses.append(g_loss.item())
                d_losses.append(d_loss.item())
            
            history['g_loss'].append(np.mean(g_losses))
            history['d_loss'].append(np.mean(d_losses))
            
            if (epoch + 1) % 50 == 0:
                print(f"Epoch [{epoch+1}/{epochs}] D_loss: {history['d_loss'][-1]:.4f}, G_loss: {history['g_loss'][-1]:.4f}")
        
        return history
    
    def generate(self, n_samples):
        """Generate synthetic sequences."""
        self.feature_gen.eval()
        self.attr_gen.eval()
        
        with torch.no_grad():
            z_attr = torch.randn(n_samples, self.latent_dim).to(self.device)
            z_seq = torch.randn(n_samples, self.seq_len, self.latent_dim).to(self.device)
            
            attr_fake = self.attr_gen(z_attr)
            fake_data = self.feature_gen(z_seq, attr_fake)
        
        return fake_data.cpu().numpy()

print("DoppelGANger model defined.")

DoppelGANger model defined.


## Part 6: Evaluation Metrics

In [34]:
# ============================================================================
# Evaluation Metrics for Time-Series GANs
# ============================================================================

class TimeSeriesEvaluator:
    """Comprehensive evaluation metrics for synthetic time-series."""
    
    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    def compute_all_metrics(self, real_data, synthetic_data):
        """
        Compute all evaluation metrics.
        
        Args:
            real_data: numpy array (n_samples, seq_len, n_features)
            synthetic_data: numpy array (n_samples, seq_len, n_features)
            
        Returns:
            Dictionary of metrics
        """
        metrics = {}
        
        # Distribution metrics
        metrics['distribution'] = self._compute_distribution_metrics(real_data, synthetic_data)
        
        # Temporal metrics
        metrics['temporal'] = self._compute_temporal_metrics(real_data, synthetic_data)
        
        # Correlation metrics
        metrics['correlation'] = self._compute_correlation_metrics(real_data, synthetic_data)
        
        # Compute overall score
        metrics['overall_score'] = self._compute_overall_score(metrics)
        
        return metrics
    
    def _compute_distribution_metrics(self, real_data, synthetic_data):
        """Compare marginal distributions using various metrics."""
        metrics = {}
        
        # Flatten temporal dimension for marginal comparison
        real_flat = real_data.reshape(-1, real_data.shape[-1])
        syn_flat = synthetic_data.reshape(-1, synthetic_data.shape[-1])
        
        # KS statistic for each feature
        ks_stats = []
        wasserstein_dists = []
        
        for i in range(real_flat.shape[1]):
            ks_stat, _ = stats.ks_2samp(real_flat[:, i], syn_flat[:, i])
            ks_stats.append(ks_stat)
            
            # Wasserstein distance
            wd = stats.wasserstein_distance(real_flat[:, i], syn_flat[:, i])
            wasserstein_dists.append(wd)
        
        metrics['ks_stats'] = dict(zip(self.feature_names, ks_stats))
        metrics['ks_mean'] = np.mean(ks_stats)
        metrics['wasserstein_dists'] = dict(zip(self.feature_names, wasserstein_dists))
        metrics['wasserstein_mean'] = np.mean(wasserstein_dists)
        
        # Mean and std comparison
        real_mean = real_flat.mean(axis=0)
        syn_mean = syn_flat.mean(axis=0)
        real_std = real_flat.std(axis=0)
        syn_std = syn_flat.std(axis=0)
        
        metrics['mean_error'] = np.mean(np.abs(real_mean - syn_mean))
        metrics['std_error'] = np.mean(np.abs(real_std - syn_std))
        
        return metrics
    
    def _compute_temporal_metrics(self, real_data, synthetic_data):
        """Compare temporal structure."""
        metrics = {}
        
        # Autocorrelation comparison
        acf_errors = []
        max_lag = min(20, real_data.shape[1] - 1)
        
        for i in range(real_data.shape[-1]):
            # Average ACF across samples
            real_acfs = []
            syn_acfs = []
            
            for j in range(min(100, real_data.shape[0])):
                try:
                    r_acf = acf(real_data[j, :, i], nlags=max_lag, fft=True)
                    real_acfs.append(r_acf)
                except:
                    pass
                    
            for j in range(min(100, synthetic_data.shape[0])):
                try:
                    s_acf = acf(synthetic_data[j, :, i], nlags=max_lag, fft=True)
                    syn_acfs.append(s_acf)
                except:
                    pass
            
            if real_acfs and syn_acfs:
                real_acf_mean = np.mean(real_acfs, axis=0)
                syn_acf_mean = np.mean(syn_acfs, axis=0)
                acf_error = np.mean(np.abs(real_acf_mean - syn_acf_mean))
                acf_errors.append(acf_error)
        
        metrics['acf_error'] = np.mean(acf_errors) if acf_errors else float('inf')
        
        # Burstiness comparison
        real_burst = self._compute_burstiness(real_data)
        syn_burst = self._compute_burstiness(synthetic_data)
        metrics['burstiness_error'] = np.mean(np.abs(real_burst - syn_burst))
        
        return metrics
    
    def _compute_burstiness(self, data):
        """Compute burstiness measure for each feature."""
        # Burstiness = (std - mean) / (std + mean)
        flat = data.reshape(-1, data.shape[-1])
        mean = flat.mean(axis=0)
        std = flat.std(axis=0)
        burstiness = (std - mean) / (std + mean + 1e-10)
        return burstiness
    
    def _compute_correlation_metrics(self, real_data, synthetic_data):
        """Compare feature correlations."""
        metrics = {}
        
        # Flatten for correlation matrix
        real_flat = real_data.reshape(-1, real_data.shape[-1])
        syn_flat = synthetic_data.reshape(-1, synthetic_data.shape[-1])
        
        # Feature correlation matrices
        real_corr = np.corrcoef(real_flat.T)
        syn_corr = np.corrcoef(syn_flat.T)
        
        # Handle NaN correlations
        real_corr = np.nan_to_num(real_corr)
        syn_corr = np.nan_to_num(syn_corr)
        
        # Frobenius norm of difference
        metrics['corr_matrix_error'] = np.linalg.norm(real_corr - syn_corr, 'fro')
        metrics['corr_matrix_error_normalized'] = metrics['corr_matrix_error'] / np.sqrt(real_corr.size)
        
        # Cross-correlation over time
        cross_corr_errors = []
        for i in range(real_data.shape[-1]):
            for j in range(i + 1, real_data.shape[-1]):
                real_xcorr = np.corrcoef(real_flat[:, i], real_flat[:, j])[0, 1]
                syn_xcorr = np.corrcoef(syn_flat[:, i], syn_flat[:, j])[0, 1]
                if not np.isnan(real_xcorr) and not np.isnan(syn_xcorr):
                    cross_corr_errors.append(abs(real_xcorr - syn_xcorr))
        
        metrics['cross_corr_error'] = np.mean(cross_corr_errors) if cross_corr_errors else float('inf')
        
        return metrics
    
    def _compute_overall_score(self, metrics):
        """Compute overall quality score (lower is better)."""
        # Weighted combination of metrics
        score = (
            0.3 * metrics['distribution']['ks_mean'] +
            0.2 * metrics['distribution']['wasserstein_mean'] +
            0.2 * metrics['temporal']['acf_error'] +
            0.15 * metrics['temporal']['burstiness_error'] +
            0.15 * metrics['correlation']['corr_matrix_error_normalized']
        )
        return score

print("Evaluation metrics defined.")

Evaluation metrics defined.


## Part 7: Visualization Functions

In [35]:
# ============================================================================
# Visualization Functions
# ============================================================================

def plot_training_history(histories, model_names, save_path=None):
    """Plot training histories for all models."""
    fig, axes = plt.subplots(len(model_names), 2, figsize=(14, 4*len(model_names)))
    
    for idx, (name, history) in enumerate(zip(model_names, histories)):
        ax_g = axes[idx, 0] if len(model_names) > 1 else axes[0]
        ax_d = axes[idx, 1] if len(model_names) > 1 else axes[1]
        
        ax_g.plot(history['g_loss'], label='Generator Loss', color='blue')
        ax_g.set_title(f'{name} - Generator Loss')
        ax_g.set_xlabel('Epoch')
        ax_g.set_ylabel('Loss')
        ax_g.legend()
        ax_g.grid(True, alpha=0.3)
        
        ax_d.plot(history['d_loss'], label='Discriminator Loss', color='red')
        ax_d.set_title(f'{name} - Discriminator Loss')
        ax_d.set_xlabel('Epoch')
        ax_d.set_ylabel('Loss')
        ax_d.legend()
        ax_d.grid(True, alpha=0.3)
    
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def plot_distribution_comparison(real_data, synthetic_data, feature_names, model_name, n_features=6, save_path=None):
    """Plot distribution comparison histograms."""
    real_flat = real_data.reshape(-1, real_data.shape[-1])
    syn_flat = synthetic_data.reshape(-1, synthetic_data.shape[-1])
    
    n_cols = 3
    n_rows = (n_features + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4*n_rows))
    axes = axes.flatten()
    
    for i in range(min(n_features, len(feature_names))):
        ax = axes[i]
        ax.hist(real_flat[:, i], bins=50, alpha=0.5, label='Real', density=True, color='blue')
        ax.hist(syn_flat[:, i], bins=50, alpha=0.5, label='Synthetic', density=True, color='red')
        ax.set_title(f'{feature_names[i]}')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    # Hide empty subplots
    for i in range(n_features, len(axes)):
        axes[i].set_visible(False)
    
    plt.suptitle(f'{model_name} - Distribution Comparison', fontsize=14)
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def plot_correlation_comparison(real_data, synthetic_data, feature_names, model_name, save_path=None):
    """Plot correlation matrix comparison."""
    real_flat = real_data.reshape(-1, real_data.shape[-1])
    syn_flat = synthetic_data.reshape(-1, synthetic_data.shape[-1])
    
    real_corr = np.corrcoef(real_flat.T)
    syn_corr = np.corrcoef(syn_flat.T)
    
    # Handle NaN
    real_corr = np.nan_to_num(real_corr)
    syn_corr = np.nan_to_num(syn_corr)
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Truncate feature names for display
    short_names = [f[:10] for f in feature_names]
    
    # Real correlation
    sns.heatmap(real_corr, ax=axes[0], cmap='coolwarm', center=0, 
                xticklabels=short_names, yticklabels=short_names, vmin=-1, vmax=1)
    axes[0].set_title('Real Data Correlation')
    axes[0].tick_params(axis='x', rotation=90, labelsize=6)
    axes[0].tick_params(axis='y', rotation=0, labelsize=6)
    
    # Synthetic correlation
    sns.heatmap(syn_corr, ax=axes[1], cmap='coolwarm', center=0,
                xticklabels=short_names, yticklabels=short_names, vmin=-1, vmax=1)
    axes[1].set_title(f'{model_name} Synthetic Correlation')
    axes[1].tick_params(axis='x', rotation=90, labelsize=6)
    axes[1].tick_params(axis='y', rotation=0, labelsize=6)
    
    # Difference
    diff = real_corr - syn_corr
    sns.heatmap(diff, ax=axes[2], cmap='RdBu', center=0,
                xticklabels=short_names, yticklabels=short_names, vmin=-1, vmax=1)
    axes[2].set_title('Difference (Real - Synthetic)')
    axes[2].tick_params(axis='x', rotation=90, labelsize=6)
    axes[2].tick_params(axis='y', rotation=0, labelsize=6)
    
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def plot_temporal_comparison(real_data, synthetic_data, feature_names, model_name, n_features=4, save_path=None):
    """Plot temporal patterns comparison."""
    fig, axes = plt.subplots(n_features, 2, figsize=(14, 3*n_features))
    
    for i in range(min(n_features, len(feature_names))):
        # Sample sequences
        real_sample = real_data[np.random.randint(0, real_data.shape[0]), :, i]
        syn_sample = synthetic_data[np.random.randint(0, synthetic_data.shape[0]), :, i]
        
        # Time series plot
        axes[i, 0].plot(real_sample, label='Real', color='blue', alpha=0.7)
        axes[i, 0].plot(syn_sample, label='Synthetic', color='red', alpha=0.7)
        axes[i, 0].set_title(f'{feature_names[i]} - Sample Sequence')
        axes[i, 0].legend()
        axes[i, 0].grid(True, alpha=0.3)
        
        # Autocorrelation plot
        max_lag = min(20, real_data.shape[1] - 1)
        try:
            real_acfs = [acf(real_data[j, :, i], nlags=max_lag, fft=True) 
                         for j in range(min(50, real_data.shape[0]))]
            syn_acfs = [acf(synthetic_data[j, :, i], nlags=max_lag, fft=True) 
                        for j in range(min(50, synthetic_data.shape[0]))]
            
            real_acf_mean = np.mean(real_acfs, axis=0)
            syn_acf_mean = np.mean(syn_acfs, axis=0)
            
            axes[i, 1].plot(real_acf_mean, label='Real', color='blue')
            axes[i, 1].plot(syn_acf_mean, label='Synthetic', color='red')
            axes[i, 1].set_title(f'{feature_names[i]} - Autocorrelation')
            axes[i, 1].legend()
            axes[i, 1].grid(True, alpha=0.3)
        except:
            axes[i, 1].text(0.5, 0.5, 'ACF computation failed', ha='center', va='center')
    
    plt.suptitle(f'{model_name} - Temporal Comparison', fontsize=14)
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()


def plot_model_comparison(all_metrics, model_names, save_path=None):
    """Plot comparison of all models."""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Extract metrics
    ks_means = [m['distribution']['ks_mean'] for m in all_metrics]
    wasserstein_means = [m['distribution']['wasserstein_mean'] for m in all_metrics]
    acf_errors = [m['temporal']['acf_error'] for m in all_metrics]
    corr_errors = [m['correlation']['corr_matrix_error_normalized'] for m in all_metrics]
    overall_scores = [m['overall_score'] for m in all_metrics]
    
    x = np.arange(len(model_names))
    
    # KS Statistics
    axes[0, 0].bar(x, ks_means, color=['blue', 'green', 'red'][:len(model_names)])
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(model_names)
    axes[0, 0].set_title('KS Statistic (lower is better)')
    axes[0, 0].set_ylabel('Mean KS Statistic')
    
    # Wasserstein Distance
    axes[0, 1].bar(x, wasserstein_means, color=['blue', 'green', 'red'][:len(model_names)])
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels(model_names)
    axes[0, 1].set_title('Wasserstein Distance (lower is better)')
    axes[0, 1].set_ylabel('Mean Wasserstein Distance')
    
    # ACF Error
    axes[1, 0].bar(x, acf_errors, color=['blue', 'green', 'red'][:len(model_names)])
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels(model_names)
    axes[1, 0].set_title('Autocorrelation Error (lower is better)')
    axes[1, 0].set_ylabel('Mean ACF Error')
    
    # Overall Score
    colors = ['blue', 'green', 'red'][:len(model_names)]
    bars = axes[1, 1].bar(x, overall_scores, color=colors)
    axes[1, 1].set_xticks(x)
    axes[1, 1].set_xticklabels(model_names)
    axes[1, 1].set_title('Overall Score (lower is better)')
    axes[1, 1].set_ylabel('Score')
    
    # Highlight best model
    best_idx = np.argmin(overall_scores)
    bars[best_idx].set_edgecolor('gold')
    bars[best_idx].set_linewidth(3)
    
    plt.suptitle('Model Comparison Summary', fontsize=14)
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()
    
    return model_names[best_idx]

print("Visualization functions defined.")

Visualization functions defined.


## Part 8: Model Training

In [36]:
# ============================================================================
# Initialize Models
# ============================================================================

num_features = len(available_features)
seq_len = CONFIG['sequence_length']
hidden_dim = CONFIG['hidden_dim']
latent_dim = CONFIG['latent_dim']
num_layers = CONFIG['num_layers']

print("="*60)
print("Model Configuration")
print("="*60)
print(f"Number of features: {num_features}")
print(f"Sequence length: {seq_len}")
print(f"Hidden dimension: {hidden_dim}")
print(f"Latent dimension: {latent_dim}")
print(f"Number of layers: {num_layers}")
print(f"Device: {device}")

Model Configuration
Number of features: 27
Sequence length: 30
Hidden dimension: 128
Latent dimension: 32
Number of layers: 3
Device: cpu


In [37]:
# Initialize TimeGAN
print("\n" + "="*60)
print("Initializing TimeGAN...")
print("="*60)

timegan = TimeGAN(
    input_dim=num_features,
    hidden_dim=hidden_dim,
    latent_dim=latent_dim,
    num_layers=num_layers,
    device=device
)

print("TimeGAN initialized.")


Initializing TimeGAN...
TimeGAN initialized.


In [None]:
# Train TimeGAN
print("\n" + "="*60)
print("Training TimeGAN...")
print("="*60)

timegan_history = timegan.train(
    train_loader=train_loader,
    epochs=CONFIG['epochs'],
    lr=CONFIG['learning_rate']
)

print("TimeGAN training complete.")


Training TimeGAN...
Phase 1: Training Embedding Network...


100%|██████████| 50/50 [2:35:01<00:00, 186.03s/it]  


Phase 2: Training Supervised Network...


100%|██████████| 50/50 [1:57:38<00:00, 141.17s/it]


Phase 3 & 4: Joint Training...


 41%|████      | 41/100 [16:05:08<18:37:12, 1136.15s/it]

In [None]:
# Initialize LSTM-GAN
print("\n" + "="*60)
print("Initializing LSTM-GAN...")
print("="*60)

lstmgan = LSTMGAN(
    input_dim=num_features,
    hidden_dim=hidden_dim,
    latent_dim=latent_dim,
    num_layers=num_layers,
    seq_len=seq_len,
    device=device
)

print("LSTM-GAN initialized.")

In [None]:
# Train LSTM-GAN
print("\n" + "="*60)
print("Training LSTM-GAN...")
print("="*60)

lstmgan_history = lstmgan.train(
    train_loader=train_loader,
    epochs=CONFIG['epochs'],
    lr=CONFIG['learning_rate'],
    beta1=CONFIG['beta1'],
    beta2=CONFIG['beta2']
)

print("LSTM-GAN training complete.")

In [None]:
# Initialize DoppelGANger
print("\n" + "="*60)
print("Initializing DoppelGANger...")
print("="*60)

doppelganger = DoppelGANger(
    feature_dim=num_features,
    hidden_dim=hidden_dim,
    latent_dim=latent_dim,
    num_layers=num_layers,
    seq_len=seq_len,
    device=device
)

print("DoppelGANger initialized.")

In [None]:
# Train DoppelGANger
print("\n" + "="*60)
print("Training DoppelGANger...")
print("="*60)

doppelganger_history = doppelganger.train(
    train_loader=train_loader,
    epochs=CONFIG['epochs'],
    lr=CONFIG['learning_rate'],
    beta1=CONFIG['beta1'],
    beta2=CONFIG['beta2']
)

print("DoppelGANger training complete.")

In [None]:
# Plot training histories
print("\n" + "="*60)
print("Plotting Training Histories")
print("="*60)

histories = [timegan_history, lstmgan_history, doppelganger_history]
model_names = ['TimeGAN', 'LSTM-GAN', 'DoppelGANger']

plot_training_history(
    histories=histories,
    model_names=model_names,
    save_path=os.path.join(CONFIG['output_dir'], 'training_history.png')
)

## Part 9: Generate Synthetic Data and Evaluate

In [None]:
# Generate synthetic data from each model
print("="*60)
print("Generating Synthetic Data")
print("="*60)

n_synthetic = CONFIG['n_synthetic_samples']

# TimeGAN
print(f"Generating {n_synthetic} samples from TimeGAN...")
timegan_synthetic = timegan.generate(n_synthetic, seq_len)
print(f"TimeGAN synthetic shape: {timegan_synthetic.shape}")

# LSTM-GAN
print(f"Generating {n_synthetic} samples from LSTM-GAN...")
lstmgan_synthetic = lstmgan.generate(n_synthetic)
print(f"LSTM-GAN synthetic shape: {lstmgan_synthetic.shape}")

# DoppelGANger
print(f"Generating {n_synthetic} samples from DoppelGANger...")
doppelganger_synthetic = doppelganger.generate(n_synthetic)
print(f"DoppelGANger synthetic shape: {doppelganger_synthetic.shape}")

In [None]:
# Initialize evaluator
evaluator = TimeSeriesEvaluator(available_features)

# Evaluate each model
print("\n" + "="*60)
print("Evaluating Models")
print("="*60)

# Use test set for evaluation
real_test_data = X_test

print("\nEvaluating TimeGAN...")
timegan_metrics = evaluator.compute_all_metrics(real_test_data, timegan_synthetic)

print("\nEvaluating LSTM-GAN...")
lstmgan_metrics = evaluator.compute_all_metrics(real_test_data, lstmgan_synthetic)

print("\nEvaluating DoppelGANger...")
doppelganger_metrics = evaluator.compute_all_metrics(real_test_data, doppelganger_synthetic)

In [None]:
# Print detailed metrics
def print_metrics(model_name, metrics):
    print(f"\n{'='*60}")
    print(f"{model_name} Evaluation Results")
    print(f"{'='*60}")
    
    print("\n--- Distribution Metrics ---")
    print(f"  Mean KS Statistic: {metrics['distribution']['ks_mean']:.4f}")
    print(f"  Mean Wasserstein Distance: {metrics['distribution']['wasserstein_mean']:.4f}")
    print(f"  Mean Error: {metrics['distribution']['mean_error']:.4f}")
    print(f"  Std Error: {metrics['distribution']['std_error']:.4f}")
    
    print("\n--- Temporal Metrics ---")
    print(f"  ACF Error: {metrics['temporal']['acf_error']:.4f}")
    print(f"  Burstiness Error: {metrics['temporal']['burstiness_error']:.4f}")
    
    print("\n--- Correlation Metrics ---")
    print(f"  Correlation Matrix Error: {metrics['correlation']['corr_matrix_error']:.4f}")
    print(f"  Normalized Corr Error: {metrics['correlation']['corr_matrix_error_normalized']:.4f}")
    print(f"  Cross-Correlation Error: {metrics['correlation']['cross_corr_error']:.4f}")
    
    print(f"\n>>> OVERALL SCORE: {metrics['overall_score']:.4f} (lower is better)")

print_metrics('TimeGAN', timegan_metrics)
print_metrics('LSTM-GAN', lstmgan_metrics)
print_metrics('DoppelGANger', doppelganger_metrics)

In [None]:
# Distribution comparison plots for each model
print("\n" + "="*60)
print("Plotting Distribution Comparisons")
print("="*60)

plot_distribution_comparison(
    real_test_data, timegan_synthetic, available_features, 'TimeGAN',
    save_path=os.path.join(CONFIG['output_dir'], 'timegan_distributions.png')
)

plot_distribution_comparison(
    real_test_data, lstmgan_synthetic, available_features, 'LSTM-GAN',
    save_path=os.path.join(CONFIG['output_dir'], 'lstmgan_distributions.png')
)

plot_distribution_comparison(
    real_test_data, doppelganger_synthetic, available_features, 'DoppelGANger',
    save_path=os.path.join(CONFIG['output_dir'], 'doppelganger_distributions.png')
)

In [None]:
# Correlation comparison plots
print("\n" + "="*60)
print("Plotting Correlation Comparisons")
print("="*60)

plot_correlation_comparison(
    real_test_data, timegan_synthetic, available_features, 'TimeGAN',
    save_path=os.path.join(CONFIG['output_dir'], 'timegan_correlation.png')
)

plot_correlation_comparison(
    real_test_data, lstmgan_synthetic, available_features, 'LSTM-GAN',
    save_path=os.path.join(CONFIG['output_dir'], 'lstmgan_correlation.png')
)

plot_correlation_comparison(
    real_test_data, doppelganger_synthetic, available_features, 'DoppelGANger',
    save_path=os.path.join(CONFIG['output_dir'], 'doppelganger_correlation.png')
)

In [None]:
# Temporal comparison plots
print("\n" + "="*60)
print("Plotting Temporal Comparisons")
print("="*60)

plot_temporal_comparison(
    real_test_data, timegan_synthetic, available_features, 'TimeGAN',
    save_path=os.path.join(CONFIG['output_dir'], 'timegan_temporal.png')
)

plot_temporal_comparison(
    real_test_data, lstmgan_synthetic, available_features, 'LSTM-GAN',
    save_path=os.path.join(CONFIG['output_dir'], 'lstmgan_temporal.png')
)

plot_temporal_comparison(
    real_test_data, doppelganger_synthetic, available_features, 'DoppelGANger',
    save_path=os.path.join(CONFIG['output_dir'], 'doppelganger_temporal.png')
)

## Part 10: Model Selection and Best Model

In [None]:
# Compare all models
print("="*60)
print("Model Comparison and Selection")
print("="*60)

all_metrics = [timegan_metrics, lstmgan_metrics, doppelganger_metrics]
model_names = ['TimeGAN', 'LSTM-GAN', 'DoppelGANger']

best_model_name = plot_model_comparison(
    all_metrics=all_metrics,
    model_names=model_names,
    save_path=os.path.join(CONFIG['output_dir'], 'model_comparison.png')
)

print(f"\n{'='*60}")
print(f"BEST MODEL: {best_model_name}")
print(f"{'='*60}")

In [None]:
# Summary table
summary_data = {
    'Model': model_names,
    'KS Statistic': [m['distribution']['ks_mean'] for m in all_metrics],
    'Wasserstein': [m['distribution']['wasserstein_mean'] for m in all_metrics],
    'ACF Error': [m['temporal']['acf_error'] for m in all_metrics],
    'Burstiness Error': [m['temporal']['burstiness_error'] for m in all_metrics],
    'Correlation Error': [m['correlation']['corr_matrix_error_normalized'] for m in all_metrics],
    'Overall Score': [m['overall_score'] for m in all_metrics]
}

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.round(4)
summary_df.to_csv(os.path.join(CONFIG['output_dir'], 'model_comparison_summary.csv'), index=False)

print("\nModel Comparison Summary:")
print(summary_df.to_string(index=False))

## Part 11: Generate Final Synthetic Normal Traffic

In [None]:
# Select best model
best_model_map = {
    'TimeGAN': timegan,
    'LSTM-GAN': lstmgan,
    'DoppelGANger': doppelganger
}
best_synthetic_map = {
    'TimeGAN': timegan_synthetic,
    'LSTM-GAN': lstmgan_synthetic,
    'DoppelGANger': doppelganger_synthetic
}

best_model = best_model_map[best_model_name]
best_synthetic = best_synthetic_map[best_model_name]

print(f"Using {best_model_name} for final synthetic traffic generation.")

In [None]:
# Generate larger set of synthetic normal traffic
print("\n" + "="*60)
print("Generating Final Synthetic Normal Traffic")
print("="*60)

n_final_samples = 5000  # Adjust as needed

if best_model_name == 'TimeGAN':
    final_synthetic = best_model.generate(n_final_samples, seq_len)
else:
    final_synthetic = best_model.generate(n_final_samples)

print(f"Generated synthetic data shape: {final_synthetic.shape}")

In [None]:
# Inverse transform to original scale
print("\nInverse transforming to original scale...")

# Reshape for inverse transform
final_synthetic_flat = final_synthetic.reshape(-1, num_features)
final_synthetic_original = scaler.inverse_transform(final_synthetic_flat)
final_synthetic_original = final_synthetic_original.reshape(final_synthetic.shape)

print(f"Final synthetic data shape (original scale): {final_synthetic_original.shape}")

In [None]:
# Save synthetic data to CSV
print("\n" + "="*60)
print("Saving Synthetic Normal Traffic")
print("="*60)

# Flatten sequences for saving (each row = one timestep)
n_samples, seq_len_out, n_features = final_synthetic_original.shape
synthetic_records = []

for i in range(n_samples):
    for t in range(seq_len_out):
        record = {'sequence_id': i, 'timestep': t, 'label': 'normal'}
        for j, feat in enumerate(available_features):
            record[feat] = final_synthetic_original[i, t, j]
        synthetic_records.append(record)

synthetic_df = pd.DataFrame(synthetic_records)
synthetic_output_path = os.path.join(CONFIG['output_dir'], 'synthetic_normal_traffic.csv')
synthetic_df.to_csv(synthetic_output_path, index=False)

print(f"Saved synthetic normal traffic to: {synthetic_output_path}")
print(f"Total records: {len(synthetic_df)}")
print(f"Unique sequences: {synthetic_df['sequence_id'].nunique()}")

In [None]:
# Save model checkpoints
print("\n" + "="*60)
print("Saving Model Checkpoints")
print("="*60)

# Save TimeGAN components
timegan_path = os.path.join(CONFIG['output_dir'], 'timegan_checkpoint.pt')
torch.save({
    'embedder': timegan.embedder.state_dict(),
    'recovery': timegan.recovery.state_dict(),
    'generator': timegan.generator.state_dict(),
    'supervisor': timegan.supervisor.state_dict(),
    'discriminator': timegan.discriminator.state_dict(),
}, timegan_path)
print(f"Saved TimeGAN checkpoint to: {timegan_path}")

# Save LSTM-GAN
lstmgan_path = os.path.join(CONFIG['output_dir'], 'lstmgan_checkpoint.pt')
torch.save({
    'generator': lstmgan.generator.state_dict(),
    'discriminator': lstmgan.discriminator.state_dict(),
}, lstmgan_path)
print(f"Saved LSTM-GAN checkpoint to: {lstmgan_path}")

# Save DoppelGANger
doppelganger_path = os.path.join(CONFIG['output_dir'], 'doppelganger_checkpoint.pt')
torch.save({
    'feature_gen': doppelganger.feature_gen.state_dict(),
    'attr_gen': doppelganger.attr_gen.state_dict(),
    'discriminator': doppelganger.discriminator.state_dict(),
}, doppelganger_path)
print(f"Saved DoppelGANger checkpoint to: {doppelganger_path}")

In [None]:
# Save scaler for future use
import pickle

scaler_path = os.path.join(CONFIG['output_dir'], 'scaler.pkl')
with open(scaler_path, 'wb') as f:
    pickle.dump(scaler, f)
print(f"Saved scaler to: {scaler_path}")

In [None]:
# Save configuration and results
results_summary = {
    'config': CONFIG,
    'best_model': best_model_name,
    'metrics': {
        'TimeGAN': {
            'overall_score': timegan_metrics['overall_score'],
            'ks_mean': timegan_metrics['distribution']['ks_mean'],
            'acf_error': timegan_metrics['temporal']['acf_error']
        },
        'LSTM-GAN': {
            'overall_score': lstmgan_metrics['overall_score'],
            'ks_mean': lstmgan_metrics['distribution']['ks_mean'],
            'acf_error': lstmgan_metrics['temporal']['acf_error']
        },
        'DoppelGANger': {
            'overall_score': doppelganger_metrics['overall_score'],
            'ks_mean': doppelganger_metrics['distribution']['ks_mean'],
            'acf_error': doppelganger_metrics['temporal']['acf_error']
        }
    },
    'data_info': {
        'original_samples': len(df),
        'normal_samples': len(df_normal),
        'features': available_features,
        'sequence_length': seq_len,
        'train_sequences': len(X_train),
        'val_sequences': len(X_val),
        'test_sequences': len(X_test),
        'synthetic_sequences': n_final_samples
    },
    'timestamp': datetime.now().isoformat()
}

results_path = os.path.join(CONFIG['output_dir'], 'training_results.json')
with open(results_path, 'w') as f:
    json.dump(results_summary, f, indent=2, default=str)
print(f"Saved results summary to: {results_path}")

## Part 12: Summary and Next Steps

In [None]:
# Final Summary
print("\n" + "="*80)
print("BGP NORMAL TRAFFIC GAN GENERATION - COMPLETE")
print("="*80)

print(f"""
SUMMARY
-------
1. Data Processing:
   - Original samples: {len(df)}
   - Normal traffic samples: {len(df_normal)}
   - Features used: {len(available_features)}
   - Sequence length: {seq_len} timesteps

2. Train/Test Split:
   - Training sequences: {len(X_train)}
   - Validation sequences: {len(X_val)}
   - Test sequences: {len(X_test)}

3. Models Trained:
   - TimeGAN (Overall Score: {timegan_metrics['overall_score']:.4f})
   - LSTM-GAN (Overall Score: {lstmgan_metrics['overall_score']:.4f})
   - DoppelGANger (Overall Score: {doppelganger_metrics['overall_score']:.4f})

4. Best Model: {best_model_name}

5. Generated Synthetic Normal Traffic:
   - Sequences: {n_final_samples}
   - Total records: {n_final_samples * seq_len}

OUTPUT FILES (in {CONFIG['output_dir']}):
------------------------------------------
- synthetic_normal_traffic.csv : Generated synthetic data
- timegan_checkpoint.pt        : TimeGAN model weights
- lstmgan_checkpoint.pt        : LSTM-GAN model weights
- doppelganger_checkpoint.pt   : DoppelGANger model weights
- scaler.pkl                   : StandardScaler for transformation
- training_results.json        : Training configuration and metrics
- model_comparison_summary.csv : Model comparison table
- Various .png visualization files

NEXT STEPS:
-----------
1. Use synthetic normal traffic for:
   - Training anomaly detectors
   - Data augmentation
   - Normal-only baseline comparisons

2. Extend to anomaly phase:
   - Add conditional labels (attack types)
   - Train conditional GAN (TimeGAN-c or LSTM-cGAN)
   - Generate synthetic attack traffic
""")

In [None]:
# List all output files
print("\nGenerated files:")
for f in os.listdir(CONFIG['output_dir']):
    filepath = os.path.join(CONFIG['output_dir'], f)
    size = os.path.getsize(filepath) / 1024  # KB
    print(f"  {f}: {size:.1f} KB")