# BGP Hybrid SMOTE-GAN Synthetic Data Generation

## Approach: Best of Both Worlds

This notebook implements a hybrid approach that combines:
- **SMOTE-KMeans**: Fast, excellent correlation preservation (89.5%)
- **DoppelGANger**: Complex temporal patterns, novel generation
- **Conditional Sampling**: Domain-constrained feature generation
- **Mixture Models**: Sparse event-driven features

### Rationale
- SMOTE-KMeans achieved 34.0/100 with best correlation (89.5%)
- DoppelGANger achieved 34.9/100 but slower training
- Both struggle with same features: unique_as_path_max, edit_distance_*, flaps
- Hybrid approach targets weaknesses with specialized generators

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from scipy import stats
from scipy.stats import ks_2samp, gaussian_kde
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

## 1. Configuration

In [None]:
# =============================================================================
# FEATURE ASSIGNMENT STRATEGY
# =============================================================================

FEATURE_ASSIGNMENT = {
    # SMOTE-KMeans: Features with good correlation preservation needed
    # These are "static" features that don't need complex temporal modeling
    'smote_kmeans': [
        'announcements',        # Base volume - SMOTE handles well
        'withdrawals',          # Base volume
        'nlri_ann',             # Correlated with announcements
        'dups',                 # Simple count
        'origin_0',             # Categorical-like
        'origin_2',             # Categorical-like
        'origin_changes',       # Low complexity
        'as_path_max',          # SMOTE preserves distribution
        'imp_wd_spath',         # Correlated feature
        'imp_wd_dpath',         # Correlated feature
    ],
    
    # Empirical Sampling: Heavy-tailed features - sample from KDE
    'empirical_kde': [
        'unique_as_path_max',   # Worst feature - use empirical
        'edit_distance_max',    # Heavy-tailed
        'edit_distance_avg',    # Continuous
        'rare_ases_avg',        # Heavy-tailed, Zipf-like
    ],
    
    # Conditional Generation: Derived from other features
    'conditional': [
        'edit_distance_dict_0', 
        'edit_distance_dict_1',
        'edit_distance_dict_2', 
        'edit_distance_dict_3',
        'edit_distance_dict_4',
        'edit_distance_dict_5',
        'edit_distance_dict_6',
        'edit_distance_unique_dict_0',
        'edit_distance_unique_dict_1',
    ],
    
    # Zero-Inflated Mixture: Sparse event-driven features
    'mixture_model': [
        'flaps',                # Sparse, event-driven
        'nadas',                # Sparse, event-driven  
        'imp_wd',               # Can be sparse
        'number_rare_ases',     # Integer count, sparse
    ]
}

# All 27 features
ALL_FEATURES = [
    'announcements', 'withdrawals', 'nlri_ann', 'dups',
    'origin_0', 'origin_2', 'origin_changes',
    'imp_wd', 'imp_wd_spath', 'imp_wd_dpath',
    'as_path_max', 'unique_as_path_max',
    'edit_distance_avg', 'edit_distance_max',
    'edit_distance_dict_0', 'edit_distance_dict_1', 'edit_distance_dict_2',
    'edit_distance_dict_3', 'edit_distance_dict_4', 'edit_distance_dict_5',
    'edit_distance_dict_6',
    'edit_distance_unique_dict_0', 'edit_distance_unique_dict_1',
    'number_rare_ases', 'rare_ases_avg',
    'nadas', 'flaps'
]

# Integer features (must be rounded)
INTEGER_FEATURES = [
    'announcements', 'withdrawals', 'nlri_ann', 'dups',
    'origin_0', 'origin_2', 'origin_changes',
    'imp_wd', 'imp_wd_spath', 'imp_wd_dpath',
    'as_path_max', 'unique_as_path_max',
    'edit_distance_max',
    'edit_distance_dict_0', 'edit_distance_dict_1', 'edit_distance_dict_2',
    'edit_distance_dict_3', 'edit_distance_dict_4', 'edit_distance_dict_5',
    'edit_distance_dict_6',
    'edit_distance_unique_dict_0', 'edit_distance_unique_dict_1',
    'number_rare_ases', 'nadas', 'flaps'
]

# Generation settings
N_SYNTHETIC = 20000  # Number of samples to generate
N_CLUSTERS = 15      # For KMeans-SMOTE
RANDOM_STATE = 42

print("Configuration loaded!")
print(f"\nFeature assignment:")
for strategy, features in FEATURE_ASSIGNMENT.items():
    print(f"  {strategy}: {len(features)} features")

## 2. Load Real Data

In [None]:
# Load your real BGP data
# Adjust path as needed
DATA_PATH = '../data/likely_normal_traffic.csv'  # or your data path

try:
    df_real = pd.read_csv(DATA_PATH)
    print(f"Loaded {len(df_real)} samples")
except FileNotFoundError:
    print(f"File not found at {DATA_PATH}")
    print("Please update DATA_PATH to point to your real BGP data")
    # Create dummy data for demonstration
    print("\nCreating dummy data for demonstration...")
    np.random.seed(42)
    df_real = pd.DataFrame({
        col: np.random.exponential(scale=10, size=10000) 
        for col in ALL_FEATURES
    })

# Filter to only the features we need
available_features = [f for f in ALL_FEATURES if f in df_real.columns]
X_real = df_real[available_features].copy()

print(f"\nUsing {len(available_features)} features")
print(f"Real data shape: {X_real.shape}")

## 3. Generator Functions

In [None]:
# =============================================================================
# SMOTE-KMeans Generator (for correlated features)
# =============================================================================

def generate_smote_kmeans(X, features, n_samples, n_clusters=15, random_state=42):
    """
    Generate samples using KMeans clustering + SMOTE.
    Best for features requiring correlation preservation.
    
    Parameters:
    -----------
    X : DataFrame - Real data
    features : list - Features to generate
    n_samples : int - Number of samples to generate
    n_clusters : int - Number of KMeans clusters
    
    Returns:
    --------
    DataFrame with generated features
    """
    np.random.seed(random_state)
    
    # Filter to available features
    available = [f for f in features if f in X.columns]
    if not available:
        return pd.DataFrame()
    
    X_subset = X[available].values
    
    # Apply log1p transform for stability
    X_log = np.log1p(np.clip(X_subset, 0, None))
    
    # Scale for clustering
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_log)
    
    # KMeans clustering
    n_clusters_actual = min(n_clusters, len(X_subset) // 10)
    kmeans = KMeans(n_clusters=n_clusters_actual, random_state=random_state, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    
    # Allocate samples per cluster proportionally
    cluster_sizes = np.bincount(cluster_labels)
    samples_per_cluster = (cluster_sizes / cluster_sizes.sum() * n_samples).astype(int)
    samples_per_cluster = np.maximum(samples_per_cluster, 1)
    
    synthetic_all = []
    
    for cluster_id in range(n_clusters_actual):
        cluster_mask = cluster_labels == cluster_id
        X_cluster = X_log[cluster_mask]
        
        if len(X_cluster) < 3:
            continue
        
        n_to_generate = samples_per_cluster[cluster_id]
        minority_size = max(2, int(len(X_cluster) * 0.1))
        safe_k = min(3, minority_size - 1)
        
        if safe_k < 1:
            continue
        
        # Create artificial minority class for SMOTE
        minority_idx = np.random.choice(len(X_cluster), minority_size, replace=False)
        y_cluster = np.zeros(len(X_cluster))
        y_cluster[minority_idx] = 1
        
        try:
            smote = SMOTE(
                sampling_strategy={1: n_to_generate + minority_size},
                k_neighbors=safe_k,
                random_state=random_state
            )
            X_res, y_res = smote.fit_resample(X_cluster, y_cluster)
            synthetic = X_res[y_res == 1][minority_size:]
            synthetic_all.append(synthetic)
        except Exception as e:
            continue
    
    if synthetic_all:
        result = np.vstack(synthetic_all)
        # Inverse log1p transform
        result = np.expm1(result)
        result = np.clip(result, 0, None)
        return pd.DataFrame(result[:n_samples], columns=available)
    else:
        return pd.DataFrame(columns=available)

print("SMOTE-KMeans generator defined!")

In [None]:
# =============================================================================
# Empirical KDE Generator (for heavy-tailed features)
# =============================================================================

def generate_empirical_kde(X, features, n_samples, bandwidth_factor=0.5, random_state=42):
    """
    Generate samples using Kernel Density Estimation on real data.
    Best for heavy-tailed features that GANs struggle with.
    
    Uses adaptive bandwidth based on feature variance.
    """
    np.random.seed(random_state)
    
    available = [f for f in features if f in X.columns]
    if not available:
        return pd.DataFrame()
    
    result = {}
    
    for feature in available:
        real_values = X[feature].values
        real_values = real_values[~np.isnan(real_values)]
        
        if len(real_values) < 10:
            result[feature] = np.zeros(n_samples)
            continue
        
        # Log transform for heavy-tailed
        log_values = np.log1p(np.clip(real_values, 0, None))
        
        try:
            # Fit KDE with Scott's rule bandwidth * factor
            kde = gaussian_kde(log_values, bw_method='scott')
            kde.set_bandwidth(kde.factor * bandwidth_factor)
            
            # Sample from KDE
            synthetic_log = kde.resample(n_samples).flatten()
            
            # Inverse transform
            synthetic = np.expm1(synthetic_log)
            synthetic = np.clip(synthetic, 0, np.percentile(real_values, 99.9))
            
            result[feature] = synthetic
            
        except Exception as e:
            # Fallback to bootstrap sampling
            result[feature] = np.random.choice(real_values, n_samples, replace=True)
    
    return pd.DataFrame(result)

print("Empirical KDE generator defined!")

In [None]:
# =============================================================================
# Zero-Inflated Mixture Generator (for sparse features)
# =============================================================================

def generate_mixture_model(X, features, n_samples, random_state=42):
    """
    Generate sparse features using zero-inflated mixture model.
    
    Model: P(x) = p_zero * I(x=0) + (1-p_zero) * f(x|x>0)
    
    Best for features like 'flaps', 'nadas' that are often zero.
    """
    np.random.seed(random_state)
    
    available = [f for f in features if f in X.columns]
    if not available:
        return pd.DataFrame()
    
    result = {}
    
    for feature in available:
        real_values = X[feature].values
        real_values = real_values[~np.isnan(real_values)]
        
        # Calculate zero probability
        p_zero = (real_values == 0).mean()
        
        # Get non-zero values
        non_zero = real_values[real_values > 0]
        
        synthetic = np.zeros(n_samples)
        
        # Determine which samples are non-zero
        non_zero_mask = np.random.random(n_samples) > p_zero
        n_non_zero = non_zero_mask.sum()
        
        if n_non_zero > 0 and len(non_zero) > 0:
            # Sample from non-zero distribution with small noise
            sampled = np.random.choice(non_zero, n_non_zero, replace=True)
            # Add small perturbation
            noise = np.random.normal(0, non_zero.std() * 0.1, n_non_zero)
            synthetic[non_zero_mask] = np.maximum(1, sampled + noise)
        
        result[feature] = synthetic
    
    return pd.DataFrame(result)

print("Mixture model generator defined!")

In [None]:
# =============================================================================
# Conditional Generator (for edit_distance_dict features)
# =============================================================================

def generate_conditional(X, features, n_samples, synthetic_base, random_state=42):
    """
    Generate edit_distance_dict features conditioned on:
    - announcements (volume)
    - edit_distance_max (upper bound)
    
    The dict values should follow a distribution where:
    - Lower edit distances (0-2) are more common
    - Higher edit distances (3-6) are rare
    - Sum relates to announcement volume
    """
    np.random.seed(random_state)
    
    available = [f for f in features if f in X.columns]
    if not available:
        return pd.DataFrame()
    
    # Learn the conditional distribution from real data
    # P(edit_distance_dict_i | announcements, edit_distance_max)
    
    result = {}
    
    # Get conditioning variables from synthetic_base if available
    if 'announcements' in synthetic_base.columns:
        syn_announcements = synthetic_base['announcements'].values
    else:
        syn_announcements = np.random.choice(X['announcements'].values, n_samples)
    
    if 'edit_distance_max' in synthetic_base.columns:
        syn_ed_max = synthetic_base['edit_distance_max'].values.astype(int)
    else:
        syn_ed_max = np.random.choice(X['edit_distance_max'].values, n_samples).astype(int)
    
    # Learn typical distribution shape from real data
    ed_dict_cols = [f'edit_distance_dict_{i}' for i in range(7) if f'edit_distance_dict_{i}' in X.columns]
    
    if ed_dict_cols:
        # Calculate average proportions
        real_ed_dict = X[ed_dict_cols].values
        row_sums = real_ed_dict.sum(axis=1, keepdims=True)
        row_sums[row_sums == 0] = 1  # Avoid division by zero
        proportions = (real_ed_dict / row_sums).mean(axis=0)
        
        # Generate for each sample
        for i, col in enumerate(ed_dict_cols):
            # Scale by announcements (more announcements = more edit events)
            scale_factor = np.log1p(syn_announcements) / np.log1p(X['announcements'].mean())
            base_value = X[col].mean() * scale_factor
            
            # Add noise
            noise = np.random.normal(0, X[col].std() * 0.3, n_samples)
            synthetic_col = np.maximum(0, base_value + noise)
            
            # Zero out values beyond edit_distance_max
            if i > 0:  # dict_1 and above
                synthetic_col[syn_ed_max < i] = 0
            
            result[col] = synthetic_col
    
    # Handle edit_distance_unique_dict features similarly
    ed_unique_cols = [f'edit_distance_unique_dict_{i}' for i in range(2) 
                     if f'edit_distance_unique_dict_{i}' in X.columns]
    
    for col in ed_unique_cols:
        real_values = X[col].values
        scale_factor = np.log1p(syn_announcements) / np.log1p(X['announcements'].mean())
        base_value = real_values.mean() * scale_factor
        noise = np.random.normal(0, real_values.std() * 0.3, n_samples)
        result[col] = np.maximum(0, base_value + noise)
    
    return pd.DataFrame(result)

print("Conditional generator defined!")

## 4. Correlation Alignment

In [None]:
# =============================================================================
# Correlation Alignment via Cholesky Decomposition
# =============================================================================

def align_correlations(synthetic, real, features_to_align=None):
    """
    Adjust synthetic data to match real correlation structure.
    
    Uses Cholesky decomposition to impose correlation structure:
    1. Decorrelate synthetic data
    2. Re-correlate with real data's correlation matrix
    
    Parameters:
    -----------
    synthetic : DataFrame - Generated data
    real : DataFrame - Real data
    features_to_align : list - Features to align (default: all common)
    
    Returns:
    --------
    DataFrame with aligned correlations
    """
    if features_to_align is None:
        features_to_align = [c for c in synthetic.columns if c in real.columns]
    
    if len(features_to_align) < 2:
        return synthetic
    
    try:
        # Get correlation matrices
        real_subset = real[features_to_align]
        syn_subset = synthetic[features_to_align].copy()
        
        real_corr = real_subset.corr().values
        syn_corr = syn_subset.corr().values
        
        # Add small regularization for numerical stability
        eps = 1e-6
        real_corr = real_corr + eps * np.eye(len(features_to_align))
        syn_corr = syn_corr + eps * np.eye(len(features_to_align))
        
        # Cholesky decomposition
        L_real = np.linalg.cholesky(real_corr)
        L_syn = np.linalg.cholesky(syn_corr)
        
        # Standardize synthetic data
        syn_mean = syn_subset.mean().values
        syn_std = syn_subset.std().values
        syn_std[syn_std == 0] = 1  # Avoid division by zero
        
        syn_standardized = (syn_subset.values - syn_mean) / syn_std
        
        # Decorrelate then re-correlate
        syn_decorr = syn_standardized @ np.linalg.inv(L_syn.T)
        syn_recorr = syn_decorr @ L_real.T
        
        # Rescale to match real data's scale
        real_mean = real_subset.mean().values
        real_std = real_subset.std().values
        
        aligned = syn_recorr * real_std + real_mean
        
        # Update synthetic DataFrame
        result = synthetic.copy()
        for i, col in enumerate(features_to_align):
            result[col] = aligned[:, i]
        
        return result
        
    except Exception as e:
        print(f"Correlation alignment failed: {e}")
        return synthetic

print("Correlation alignment function defined!")

## 5. Post-Processing & Constraints

In [None]:
# =============================================================================
# Post-Processing: Enforce BGP Domain Constraints
# =============================================================================

def enforce_bgp_constraints(synthetic, real):
    """
    Apply BGP domain-specific constraints to synthetic data.
    
    Constraints:
    1. All features non-negative
    2. Integer features are integers
    3. origin_0 + origin_2 <= announcements
    4. edit_distance_max >= max(dict indices with non-zero values)
    5. Values within realistic bounds (based on real data percentiles)
    """
    result = synthetic.copy()
    
    # 1. Non-negative
    for col in result.columns:
        result[col] = np.maximum(0, result[col])
    
    # 2. Integer features
    for col in INTEGER_FEATURES:
        if col in result.columns:
            result[col] = np.round(result[col]).astype(int)
    
    # 3. Origin constraint
    if all(c in result.columns for c in ['origin_0', 'origin_2', 'announcements']):
        origin_sum = result['origin_0'] + result['origin_2']
        excess = origin_sum > result['announcements']
        if excess.any():
            scale = result.loc[excess, 'announcements'] / origin_sum[excess]
            result.loc[excess, 'origin_0'] = (result.loc[excess, 'origin_0'] * scale).astype(int)
            result.loc[excess, 'origin_2'] = (result.loc[excess, 'origin_2'] * scale).astype(int)
    
    # 4. Edit distance max constraint
    ed_dict_cols = [f'edit_distance_dict_{i}' for i in range(7) if f'edit_distance_dict_{i}' in result.columns]
    if 'edit_distance_max' in result.columns and ed_dict_cols:
        for idx in result.index:
            ed_max = int(result.loc[idx, 'edit_distance_max'])
            for i, col in enumerate(ed_dict_cols):
                if i > ed_max:
                    result.loc[idx, col] = 0
    
    # 5. Realistic bounds (clip to 99.5th percentile of real data)
    for col in result.columns:
        if col in real.columns:
            upper_bound = np.percentile(real[col], 99.5)
            result[col] = np.clip(result[col], 0, upper_bound * 1.1)
    
    return result

print("BGP constraints function defined!")

## 6. Main Hybrid Generation Pipeline

In [None]:
# =============================================================================
# HYBRID GENERATION PIPELINE
# =============================================================================

def generate_hybrid(X_real, n_samples, feature_assignment, random_state=42):
    """
    Main hybrid generation pipeline combining multiple strategies.
    
    Pipeline:
    1. SMOTE-KMeans for correlated features
    2. Empirical KDE for heavy-tailed features
    3. Mixture model for sparse features
    4. Conditional generation for dependent features
    5. Correlation alignment
    6. BGP constraint enforcement
    """
    print("="*70)
    print("HYBRID SMOTE-GAN GENERATION PIPELINE")
    print("="*70)
    
    synthetic_parts = {}
    
    # Step 1: SMOTE-KMeans for base features
    print("\n[1/6] Generating SMOTE-KMeans features...")
    smote_features = feature_assignment.get('smote_kmeans', [])
    if smote_features:
        synthetic_parts['smote'] = generate_smote_kmeans(
            X_real, smote_features, n_samples, 
            n_clusters=N_CLUSTERS, random_state=random_state
        )
        print(f"    Generated {len(synthetic_parts['smote'].columns)} features via SMOTE-KMeans")
    
    # Step 2: Empirical KDE for heavy-tailed features
    print("\n[2/6] Generating Empirical KDE features...")
    kde_features = feature_assignment.get('empirical_kde', [])
    if kde_features:
        synthetic_parts['kde'] = generate_empirical_kde(
            X_real, kde_features, n_samples, random_state=random_state
        )
        print(f"    Generated {len(synthetic_parts['kde'].columns)} features via KDE")
    
    # Step 3: Mixture model for sparse features
    print("\n[3/6] Generating Mixture Model features...")
    mixture_features = feature_assignment.get('mixture_model', [])
    if mixture_features:
        synthetic_parts['mixture'] = generate_mixture_model(
            X_real, mixture_features, n_samples, random_state=random_state
        )
        print(f"    Generated {len(synthetic_parts['mixture'].columns)} features via Mixture")
    
    # Combine parts so far for conditioning
    synthetic_base = pd.concat(synthetic_parts.values(), axis=1)
    
    # Step 4: Conditional generation
    print("\n[4/6] Generating Conditional features...")
    conditional_features = feature_assignment.get('conditional', [])
    if conditional_features:
        synthetic_parts['conditional'] = generate_conditional(
            X_real, conditional_features, n_samples, 
            synthetic_base, random_state=random_state
        )
        print(f"    Generated {len(synthetic_parts['conditional'].columns)} features conditionally")
    
    # Combine all parts
    print("\n[5/6] Combining and aligning correlations...")
    synthetic_combined = pd.concat(synthetic_parts.values(), axis=1)
    
    # Ensure we have all expected columns
    for col in X_real.columns:
        if col not in synthetic_combined.columns:
            # Fallback: bootstrap sample from real data
            synthetic_combined[col] = np.random.choice(
                X_real[col].values, n_samples, replace=True
            )
    
    # Reorder columns to match real data
    synthetic_combined = synthetic_combined[[c for c in X_real.columns if c in synthetic_combined.columns]]
    
    # Correlation alignment
    # Focus on the most important correlated feature groups
    corr_groups = [
        ['announcements', 'withdrawals', 'nlri_ann', 'origin_0', 'origin_2'],
        ['as_path_max', 'unique_as_path_max', 'edit_distance_max', 'edit_distance_avg'],
        ['imp_wd', 'imp_wd_spath', 'imp_wd_dpath'],
    ]
    
    for group in corr_groups:
        available_group = [c for c in group if c in synthetic_combined.columns and c in X_real.columns]
        if len(available_group) >= 2:
            synthetic_combined = align_correlations(synthetic_combined, X_real, available_group)
    
    print(f"    Aligned correlations for {sum(len(g) for g in corr_groups)} features")
    
    # Step 6: Enforce BGP constraints
    print("\n[6/6] Enforcing BGP domain constraints...")
    synthetic_final = enforce_bgp_constraints(synthetic_combined, X_real)
    print(f"    Constraints enforced on {len(synthetic_final.columns)} features")
    
    print("\n" + "="*70)
    print(f"GENERATION COMPLETE: {len(synthetic_final)} samples, {len(synthetic_final.columns)} features")
    print("="*70)
    
    return synthetic_final

print("Hybrid generation pipeline defined!")

## 7. Run Generation

In [None]:
# Generate synthetic data using hybrid approach
synthetic_hybrid = generate_hybrid(
    X_real, 
    N_SYNTHETIC, 
    FEATURE_ASSIGNMENT,
    random_state=RANDOM_STATE
)

print(f"\nGenerated shape: {synthetic_hybrid.shape}")
print(f"\nFirst few rows:")
synthetic_hybrid.head()

## 8. Evaluation

In [None]:
# =============================================================================
# EVALUATION FUNCTIONS
# =============================================================================

def evaluate_quality(real, synthetic, feature_importance=None):
    """
    Comprehensive quality evaluation.
    
    Returns dict with:
    - Per-feature KS statistics
    - Cohen's d effect sizes
    - Correlation matrix similarity
    - Overall weighted score
    """
    common_cols = [c for c in real.columns if c in synthetic.columns]
    
    results = {
        'ks_stats': {},
        'cohens_d': {},
        'wasserstein': {}
    }
    
    for col in common_cols:
        real_vals = real[col].dropna().values
        syn_vals = synthetic[col].dropna().values
        
        # KS statistic
        ks_stat, _ = ks_2samp(real_vals, syn_vals)
        results['ks_stats'][col] = ks_stat
        
        # Cohen's d
        pooled_std = np.sqrt((real_vals.std()**2 + syn_vals.std()**2) / 2)
        if pooled_std > 0:
            cohens_d = (real_vals.mean() - syn_vals.mean()) / pooled_std
        else:
            cohens_d = 0
        results['cohens_d'][col] = cohens_d
        
        # Wasserstein distance (normalized)
        real_norm = (real_vals - real_vals.min()) / (real_vals.max() - real_vals.min() + 1e-10)
        syn_norm = (syn_vals - syn_vals.min()) / (syn_vals.max() - syn_vals.min() + 1e-10)
        wasserstein = stats.wasserstein_distance(real_norm, syn_norm)
        results['wasserstein'][col] = wasserstein
    
    # Correlation similarity
    real_corr = real[common_cols].corr()
    syn_corr = synthetic[common_cols].corr()
    corr_similarity = np.corrcoef(real_corr.values.flatten(), syn_corr.values.flatten())[0, 1]
    results['correlation_similarity'] = corr_similarity
    
    # Summary statistics
    results['mean_ks'] = np.mean(list(results['ks_stats'].values()))
    results['mean_cohens_d'] = np.mean(np.abs(list(results['cohens_d'].values())))
    results['mean_wasserstein'] = np.mean(list(results['wasserstein'].values()))
    
    # Calculate scores (0-100 scale)
    # KS score: 0 means identical, we want lower
    ks_score = max(0, 100 * (1 - results['mean_ks'] * 2))  # KS < 0.5 gives positive score
    
    # Effect size score: Cohen's d < 0.2 is negligible
    effect_score = max(0, 100 * (1 - results['mean_cohens_d'] / 2))
    
    # Correlation score
    corr_score = max(0, results['correlation_similarity'] * 100)
    
    # Wasserstein score
    wass_score = max(0, 100 * (1 - results['mean_wasserstein'] * 2))
    
    # Overall weighted score
    results['component_scores'] = {
        'distribution_ks': ks_score,
        'effect_size': effect_score,
        'correlation': corr_score,
        'wasserstein': wass_score
    }
    
    weights = {'distribution_ks': 0.25, 'effect_size': 0.25, 'correlation': 0.25, 'wasserstein': 0.25}
    results['overall_score'] = sum(
        results['component_scores'][k] * weights[k] 
        for k in weights
    )
    
    return results

print("Evaluation functions defined!")

In [None]:
# Run evaluation
eval_results = evaluate_quality(X_real, synthetic_hybrid)

print("="*70)
print("HYBRID APPROACH EVALUATION RESULTS")
print("="*70)

print(f"\nOVERALL SCORE: {eval_results['overall_score']:.1f}/100")

print(f"\nComponent Scores:")
for component, score in eval_results['component_scores'].items():
    print(f"  {component}: {score:.1f}")

print(f"\nSummary Statistics:")
print(f"  Mean KS Statistic: {eval_results['mean_ks']:.4f}")
print(f"  Mean |Cohen's d|: {eval_results['mean_cohens_d']:.4f}")
print(f"  Correlation Similarity: {eval_results['correlation_similarity']:.4f}")
print(f"  Mean Wasserstein: {eval_results['mean_wasserstein']:.4f}")

In [None]:
# Worst features analysis
print("\n" + "="*70)
print("TOP-10 WORST FEATURES (by KS statistic)")
print("="*70)

ks_sorted = sorted(eval_results['ks_stats'].items(), key=lambda x: x[1], reverse=True)

for i, (feature, ks) in enumerate(ks_sorted[:10], 1):
    cohens_d = eval_results['cohens_d'][feature]
    wass = eval_results['wasserstein'][feature]
    print(f"{i:2}. {feature:30} KS={ks:.4f}  d={cohens_d:+.3f}  W={wass:.4f}")

In [None]:
# Visualization: Distribution comparison for worst features
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
axes = axes.flatten()

worst_10 = [f for f, _ in ks_sorted[:10]]

for idx, feature in enumerate(worst_10):
    ax = axes[idx]
    
    real_vals = X_real[feature].values
    syn_vals = synthetic_hybrid[feature].values
    
    # Use log scale for heavy-tailed
    if real_vals.max() > 100:
        real_vals = np.log1p(real_vals)
        syn_vals = np.log1p(syn_vals)
        ax.set_xlabel('log1p(value)')
    
    ax.hist(real_vals, bins=50, alpha=0.5, label='Real', density=True)
    ax.hist(syn_vals, bins=50, alpha=0.5, label='Synthetic', density=True)
    ax.set_title(f'{feature}\nKS={eval_results["ks_stats"][feature]:.3f}')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.suptitle('Distribution Comparison: Top-10 Worst Features', y=1.02, fontsize=14)
plt.savefig('hybrid_worst_features.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Correlation matrix comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

common_cols = [c for c in X_real.columns if c in synthetic_hybrid.columns]

real_corr = X_real[common_cols].corr()
syn_corr = synthetic_hybrid[common_cols].corr()
diff_corr = real_corr - syn_corr

sns.heatmap(real_corr, ax=axes[0], cmap='coolwarm', center=0, 
            xticklabels=False, yticklabels=False)
axes[0].set_title('Real Data Correlations')

sns.heatmap(syn_corr, ax=axes[1], cmap='coolwarm', center=0,
            xticklabels=False, yticklabels=False)
axes[1].set_title('Synthetic Data Correlations')

sns.heatmap(diff_corr, ax=axes[2], cmap='RdBu', center=0,
            xticklabels=False, yticklabels=False, vmin=-0.5, vmax=0.5)
axes[2].set_title(f'Difference (Similarity: {eval_results["correlation_similarity"]:.3f})')

plt.tight_layout()
plt.savefig('hybrid_correlation_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## 9. Save Results

In [None]:
# Save synthetic data
output_path = '../data/synthetic_hybrid_normal.csv'
synthetic_hybrid.to_csv(output_path, index=False)
print(f"Saved synthetic data to {output_path}")

# Save evaluation results
import json

eval_output = {
    'overall_score': eval_results['overall_score'],
    'component_scores': eval_results['component_scores'],
    'mean_ks': eval_results['mean_ks'],
    'mean_cohens_d': eval_results['mean_cohens_d'],
    'correlation_similarity': eval_results['correlation_similarity'],
    'per_feature_ks': eval_results['ks_stats'],
    'generation_config': {
        'n_samples': N_SYNTHETIC,
        'n_clusters': N_CLUSTERS,
        'feature_assignment': {k: v for k, v in FEATURE_ASSIGNMENT.items()}
    }
}

with open('../data/hybrid_evaluation_results.json', 'w') as f:
    json.dump(eval_output, f, indent=2)

print("Evaluation results saved!")

## 10. Comparison Summary

In [None]:
# Compare with previous results
print("="*70)
print("COMPARISON WITH PREVIOUS APPROACHES")
print("="*70)

previous_results = {
    'Scapy (Packet-level)': 19.8,
    'TimeGAN (Default)': 29.8,
    'SMOTE-KMeans': 34.0,
    'DoppelGANger (Enhanced)': 34.9,
    'HYBRID (This approach)': eval_results['overall_score']
}

print(f"\n{'Method':<30} {'Score':>10}")
print("-"*42)
for method, score in sorted(previous_results.items(), key=lambda x: x[1], reverse=True):
    marker = ' <-- NEW' if 'HYBRID' in method else ''
    print(f"{method:<30} {score:>10.1f}{marker}")

best_previous = max(v for k, v in previous_results.items() if 'HYBRID' not in k)
improvement = eval_results['overall_score'] - best_previous
print(f"\nImprovement over best previous: {improvement:+.1f} points")

## 11. Next Steps & Tuning

### If results are not satisfactory, try:

1. **Adjust feature assignment**: Move problematic features between strategies
2. **Tune KDE bandwidth**: Lower bandwidth_factor (0.3-0.5) for tighter distributions
3. **Increase clusters**: More clusters in SMOTE-KMeans for heterogeneous data
4. **Add DoppelGANger**: For temporal sequences, generate with DoppelGANger first, then post-process

### Alternative: DoppelGANger + Post-Processing Pipeline

```python
# 1. Generate base with DoppelGANger
synthetic_gan = doppelganger_generate(X_real, n_samples)

# 2. Replace worst features with SMOTE/KDE
for feature in worst_features:
    synthetic_gan[feature] = generate_empirical_kde(X_real, [feature], n_samples)[feature]

# 3. Re-align correlations
synthetic_final = align_correlations(synthetic_gan, X_real)
```