# Privacy-Preserving Health Data Analysis

Este notebook consolidado executa todo o pipeline de análise de dados de saúde com técnicas de privacidade.

## O que este notebook faz:
1. **Setup e Mount da Drive** - Configuração inicial e montagem do Google Drive
2. **Pré-processamento** - Processamento dos datasets Sleep-EDF e WESAD
3. **Treino Baseline** - Modelos LSTM sem técnicas de privacidade
4. **Differential Privacy** - Modelos com privacidade diferencial
5. **Federated Learning** - Modelos com aprendizagem federada
6. **Análise Final** - Comparação e visualização de resultados

**⚠️ IMPORTANTE**: Este notebook é compatível com Google Colab e Deepnote (com integração da Drive).

## Estrutura do Projeto:
- **Sleep-EDF**: Classificação de estágios do sono (5 classes)
- **WESAD**: Detecção de stress (2 classes)
- **Técnicas**: Baseline, DP (ε=0.1,1.0,5.0,10.0), FL (3,5,10 clientes)


# ============================================================================
# PARTE 1: SETUP E MOUNT DA DRIVE
# ============================================================================

## Setup Inicial e Montagem do Google Drive

Esta seção configura o ambiente e monta o Google Drive para acesso aos dados.


In [None]:
# ============================================================================
# STEP 1: Clone Repository and Install Dependencies
# ============================================================================

import os
import subprocess
import sys

print("="*70)
print("SETUP INICIAL - PRIVACY-PRESERVING HEALTH DATA ANALYSIS")
print("="*70)

# Check if we're already in the project directory
if os.path.exists('src') and os.path.exists('requirements.txt'):
    print("✅ Already in project directory!")
else:
    print("📥 Cloning repository...")
    
    # Clone the repository
    try:
        result = subprocess.run([
            'git', 'clone', 
            'https://github.com/vasco-fernandes21/mhealth-data-privacy.git'
        ], capture_output=True, text=True, check=True)
        
        print("✅ Repository cloned successfully!")
        
        # Change to project directory
        os.chdir('mhealth-data-privacy')
        print("📁 Changed to project directory")
        
    except subprocess.CalledProcessError as e:
        print(f"❌ Error cloning repository: {e}")
        print("Please clone manually or check the repository URL")
    except FileNotFoundError:
        print("❌ Git not available. Please clone manually:")
        print("git clone https://github.com/vasco-fernandes21/mhealth-data-privacy.git")
        print("cd mhealth-data-privacy")

print(f"Current directory: {os.getcwd()}")

# Install dependencies
print("\n📦 Installing dependencies...")
!pip install -r requirements.txt

# Install the project as an editable package
print("\n📦 Installing project package...")
!pip install -e .

print("\n✅ Dependencies installed successfully!")


In [None]:
# ============================================================================
# STEP 2: Mount Google Drive
# ============================================================================

print("="*70)
print("MOUNTING GOOGLE DRIVE")
print("="*70)

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("✅ Google Drive mounted successfully!")

# Define data paths in Google Drive
DRIVE_BASE = '/content/drive/MyDrive/mhealth-data'
RAW_DATA_PATH = f'{DRIVE_BASE}/raw'
PROCESSED_DATA_PATH = f'{DRIVE_BASE}/processed'
MODELS_PATH = f'{DRIVE_BASE}/models'
RESULTS_PATH = f'{DRIVE_BASE}/results'

# Create directories if they don't exist
os.makedirs(RAW_DATA_PATH, exist_ok=True)
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
os.makedirs(MODELS_PATH, exist_ok=True)
os.makedirs(RESULTS_PATH, exist_ok=True)

# Create subdirectories for datasets
for dataset in ['sleep-edf', 'wesad']:
    os.makedirs(f'{RAW_DATA_PATH}/{dataset}', exist_ok=True)
    os.makedirs(f'{PROCESSED_DATA_PATH}/{dataset}', exist_ok=True)
    os.makedirs(f'{MODELS_PATH}/{dataset}', exist_ok=True)
    os.makedirs(f'{RESULTS_PATH}/{dataset}', exist_ok=True)

print(f"\nData paths configured:")
print(f"  Drive base: {DRIVE_BASE}")
print(f"  Raw data: {RAW_DATA_PATH}")
print(f"  Processed data: {PROCESSED_DATA_PATH}")
print(f"  Models: {MODELS_PATH}")
print(f"  Results: {RESULTS_PATH}")

print(f"\n✅ Directory structure created successfully!")

# Make variables globally available
globals()['DRIVE_BASE'] = DRIVE_BASE
globals()['RAW_DATA_PATH'] = RAW_DATA_PATH
globals()['PROCESSED_DATA_PATH'] = PROCESSED_DATA_PATH
globals()['MODELS_PATH'] = MODELS_PATH
globals()['RESULTS_PATH'] = RESULTS_PATH


In [None]:
# ============================================================================
# STEP 3: Configure Environment
# ============================================================================

print("="*70)
print("CONFIGURING ENVIRONMENT")
print("="*70)

import tensorflow as tf
import numpy as np

# Check GPU availability
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print(f"✅ GPU available: {gpus[0].name}")
    print("   Using GPU for faster training")
    
    # Configure GPU memory growth
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
        print("✅ GPU memory growth configured")
    except RuntimeError as e:
        print(f"⚠️  GPU memory growth configuration failed: {e}")
else:
    print("🖥️  CPU-only environment detected")
    print("   Training will be slower but fully functional")
    print("   Expected times: Baseline ~30-60min, DP ~45-90min, FL ~20-40min")
    
    # Configure for CPU optimization
    tf.config.threading.set_inter_op_parallelism_threads(0)  # Use all available cores
    tf.config.threading.set_intra_op_parallelism_threads(0)  # Use all available cores
    print("✅ CPU threading optimized for all available cores")

print(f"\nTensorFlow version: {tf.__version__}")
print(f"CPU cores available: {os.cpu_count()}")

# Verify imports
try:
    from src.preprocessing import sleep_edf, wesad
    from src.models import lstm_baseline
    from src.privacy import dp_training, fl_training
    from src.evaluation import metrics, visualization
    print("✅ All modules imported successfully!")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("   Make sure you ran the installation step above")

print("\n" + "="*70)
print("SETUP COMPLETE!")
print("="*70)
print("\nYou can now proceed to data preprocessing.")
print("\nNext steps:")
print("1. Upload raw data to Google Drive")
print("2. Run the preprocessing section")
print("3. Run the training sections")
print("4. Run the analysis section")


# ============================================================================
# PARTE 2: PRÉ-PROCESSAMENTO DOS DADOS
# ============================================================================

## Processamento dos Datasets Sleep-EDF e WESAD

Esta seção executa o pré-processamento completo dos datasets.


In [None]:
# ============================================================================
# STEP 4: Sleep-EDF Dataset Preprocessing
# ============================================================================

print("="*70)
print("SLEEP-EDF PREPROCESSING")
print("="*70)

# Import our preprocessing module
from src.preprocessing.sleep_edf import preprocess_sleep_edf, load_processed_sleep_edf

# Define paths
SLEEP_RAW_PATH = f'{RAW_DATA_PATH}/sleep-edf'
SLEEP_PROCESSED_PATH = f'{PROCESSED_DATA_PATH}/sleep-edf'

print(f"Raw data path: {SLEEP_RAW_PATH}")
print(f"Processed data path: {SLEEP_PROCESSED_PATH}")

# Check if raw data exists
if not os.path.exists(SLEEP_RAW_PATH):
    print(f"❌ Raw data directory not found: {SLEEP_RAW_PATH}")
    print("Please download Sleep-EDF dataset and place it in the raw directory.")
    print("See data/README.md for download instructions.")
else:
    print("✅ Raw data directory found")
    
    # List available files
    files = os.listdir(SLEEP_RAW_PATH)
    edf_files = [f for f in files if f.endswith('.edf') and not f.endswith('.hyp.edf')]
    hyp_files = [f for f in files if f.endswith('.hyp.edf')]
    
    print(f"Found {len(edf_files)} recording files and {len(hyp_files)} hypnogram files")
    if edf_files:
        print("Sample files:", edf_files[:3])

# Run preprocessing if data is available
if os.path.exists(SLEEP_RAW_PATH) and len(edf_files) > 0:
    print("\nStarting Sleep-EDF preprocessing...")
    
    # Run preprocessing
    preprocessing_info = preprocess_sleep_edf(
        data_dir=SLEEP_RAW_PATH,
        output_dir=SLEEP_PROCESSED_PATH,
        test_size=0.15,
        val_size=0.15,
        random_state=42
    )
    
    print("\n" + "="*70)
    print("SLEEP-EDF PREPROCESSING COMPLETE!")
    print("="*70)
    print(f"Preprocessing info: {preprocessing_info}")
    
    # Verify processed data
    if os.path.exists(SLEEP_PROCESSED_PATH):
        print("\nVerifying processed data...")
        
        # Load processed data to verify
        X_train_sleep, X_val_sleep, X_test_sleep, y_train_sleep, y_val_sleep, y_test_sleep, scaler_sleep, label_encoder_sleep, sleep_info = load_processed_sleep_edf(SLEEP_PROCESSED_PATH)
        
        print(f"\nData shapes:")
        print(f"  Train: {X_train_sleep.shape}")
        print(f"  Val:   {X_val_sleep.shape}")
        print(f"  Test:  {X_test_sleep.shape}")
        
        print(f"\nLabel distribution:")
        print(f"  Train: {np.bincount(y_train_sleep)}")
        print(f"  Val:   {np.bincount(y_val_sleep)}")
        print(f"  Test:  {np.bincount(y_test_sleep)}")
        
        print(f"\nClass names: {sleep_info['class_names']}")
        print(f"Features per sample: {sleep_info['n_features']}")
        
        print("\n✅ Sleep-EDF preprocessing completed successfully!")
        print("Data is ready for training models.")
        
        # Make data globally available
        globals()['X_train_sleep'] = X_train_sleep
        globals()['X_val_sleep'] = X_val_sleep
        globals()['X_test_sleep'] = X_test_sleep
        globals()['y_train_sleep'] = y_train_sleep
        globals()['y_val_sleep'] = y_val_sleep
        globals()['y_test_sleep'] = y_test_sleep
        globals()['sleep_info'] = sleep_info
        
else:
    print("❌ Cannot proceed without raw data files")
    print("Please download the Sleep-EDF dataset first.")
    # Set to None for later checks
    X_train_sleep = None


In [None]:
# ============================================================================
# STEP 5: WESAD Dataset Preprocessing
# ============================================================================

print("="*70)
print("WESAD PREPROCESSING")
print("="*70)

# Import our preprocessing module
from src.preprocessing.wesad import preprocess_wesad, load_processed_wesad

# Define paths
WESAD_RAW_PATH = f'{RAW_DATA_PATH}/wesad'
WESAD_PROCESSED_PATH = f'{PROCESSED_DATA_PATH}/wesad'

print(f"Raw data path: {WESAD_RAW_PATH}")
print(f"Processed data path: {WESAD_PROCESSED_PATH}")

# Check if raw data exists
if not os.path.exists(WESAD_RAW_PATH):
    print(f"❌ Raw data directory not found: {WESAD_RAW_PATH}")
    print("Please download WESAD dataset and place it in the raw directory.")
    print("See data/README.md for download instructions.")
else:
    print("✅ Raw data directory found")
    
    # List available files
    files = os.listdir(WESAD_RAW_PATH)
    pkl_files = [f for f in files if f.endswith('.pkl')]
    
    print(f"Found {len(pkl_files)} pickle files")
    if pkl_files:
        print("Sample files:", pkl_files[:3])

# Run preprocessing if data is available
if os.path.exists(WESAD_RAW_PATH) and len(pkl_files) > 0:
    print("\nStarting WESAD preprocessing...")
    
    # Run preprocessing
    preprocessing_info = preprocess_wesad(
        data_dir=WESAD_RAW_PATH,
        output_dir=WESAD_PROCESSED_PATH,
        test_size=0.15,
        val_size=0.15,
        random_state=42
    )
    
    print("\n" + "="*70)
    print("WESAD PREPROCESSING COMPLETE!")
    print("="*70)
    print(f"Preprocessing info: {preprocessing_info}")
    
    # Verify processed data
    if os.path.exists(WESAD_PROCESSED_PATH):
        print("\nVerifying processed data...")
        
        # Load processed data to verify
        X_train_wesad, X_val_wesad, X_test_wesad, y_train_wesad, y_val_wesad, y_test_wesad, scaler_wesad, label_encoder_wesad, wesad_info = load_processed_wesad(WESAD_PROCESSED_PATH)
        
        print(f"\nData shapes:")
        print(f"  Train: {X_train_wesad.shape}")
        print(f"  Val:   {X_val_wesad.shape}")
        print(f"  Test:  {X_test_wesad.shape}")
        
        print(f"\nLabel distribution:")
        print(f"  Train: {np.bincount(y_train_wesad)}")
        print(f"  Val:   {np.bincount(y_val_wesad)}")
        print(f"  Test:  {np.bincount(y_test_wesad)}")
        
        print(f"\nClass names: {wesad_info['class_names']}")
        print(f"Features per sample: {wesad_info['n_features']}")
        print(f"Original labels: {wesad_info['original_labels']}")
        print(f"Filtered labels: {wesad_info['filtered_labels']}")
        
        print("\n✅ WESAD preprocessing completed successfully!")
        print("Data is ready for training models.")
        
        # Make data globally available
        globals()['X_train_wesad'] = X_train_wesad
        globals()['X_val_wesad'] = X_val_wesad
        globals()['X_test_wesad'] = X_test_wesad
        globals()['y_train_wesad'] = y_train_wesad
        globals()['y_val_wesad'] = y_val_wesad
        globals()['y_test_wesad'] = y_test_wesad
        globals()['wesad_info'] = wesad_info
        
else:
    print("❌ Cannot proceed without raw data files")
    print("Please download the WESAD dataset first.")
    # Set to None for later checks
    X_train_wesad = None


# ============================================================================
# PARTE 3: TREINO DE MODELOS BASELINE
# ============================================================================

## Modelos LSTM Baseline (sem técnicas de privacidade)

Esta seção treina modelos LSTM baseline para ambos os datasets.


In [None]:
# ============================================================================
# STEP 6: Train Sleep-EDF Baseline Model
# ============================================================================

print("="*70)
print("LSTM BASELINE TRAINING")
print("="*70)

# Import modules
from src.models.lstm_baseline import train_baseline, evaluate_model, save_model, get_default_config
from src.evaluation.visualization import plot_training_history

# Check if Sleep-EDF data is available
if 'X_train_sleep' in globals() and X_train_sleep is not None:
    print("\n" + "="*70)
    print("TRAINING SLEEP-EDF BASELINE MODEL")
    print("="*70)
    
    # Get default configuration
    config = get_default_config()
    config.update({
        'dataset': 'sleep_edf',
        'model_type': 'baseline',
        'privacy_technique': 'None'
    })
    
    print(f"Configuration: {config}")
    
    # Train model
    model_sleep, history_sleep = train_baseline(
        X_train_sleep, y_train_sleep,
        X_val_sleep, y_val_sleep,
        config
    )
    
    # Evaluate model
    results_sleep = evaluate_model(model_sleep, X_test_sleep, y_test_sleep, config['window_size'])
    results_sleep.update({
        'dataset': 'sleep_edf',
        'model_type': 'baseline',
        'privacy_technique': 'None',
        'privacy_parameter': 'N/A'
    })
    
    # Save model and results
    save_model(
        model_sleep, history_sleep, results_sleep,
        MODELS_PATH, 'sleep_edf_baseline'
    )
    
    # Save training history plot
    plot_training_history(
        history_sleep.history,
        save_path=f'{RESULTS_PATH}/training_history_sleep_edf_baseline.png',
        title='Sleep-EDF Baseline Training History'
    )
    
    print(f"\n✅ Sleep-EDF baseline model trained successfully!")
    print(f"Test Accuracy: {results_sleep['accuracy']:.4f}")
    print(f"Test F1-Score: {results_sleep['f1_score']:.4f}")
    
    # Make results globally available
    globals()['results_sleep'] = results_sleep
    
else:
    print("❌ Skipping Sleep-EDF training - data not available")
    print("Please run the preprocessing section first.")


In [None]:
# ============================================================================
# STEP 7: Train WESAD Baseline Model
# ============================================================================

# Check if WESAD data is available
if 'X_train_wesad' in globals() and X_train_wesad is not None:
    print("\n" + "="*70)
    print("TRAINING WESAD BASELINE MODEL")
    print("="*70)
    
    # Get default configuration
    config = get_default_config()
    config.update({
        'dataset': 'wesad',
        'model_type': 'baseline',
        'privacy_technique': 'None'
    })
    
    print(f"Configuration: {config}")
    
    # Train model
    model_wesad, history_wesad = train_baseline(
        X_train_wesad, y_train_wesad,
        X_val_wesad, y_val_wesad,
        config
    )
    
    # Evaluate model
    results_wesad = evaluate_model(model_wesad, X_test_wesad, y_test_wesad, config['window_size'])
    results_wesad.update({
        'dataset': 'wesad',
        'model_type': 'baseline',
        'privacy_technique': 'None',
        'privacy_parameter': 'N/A'
    })
    
    # Save model and results
    save_model(
        model_wesad, history_wesad, results_wesad,
        MODELS_PATH, 'wesad_baseline'
    )
    
    # Save training history plot
    plot_training_history(
        history_wesad.history,
        save_path=f'{RESULTS_PATH}/training_history_wesad_baseline.png',
        title='WESAD Baseline Training History'
    )
    
    print(f"\n✅ WESAD baseline model trained successfully!")
    print(f"Test Accuracy: {results_wesad['accuracy']:.4f}")
    print(f"Test F1-Score: {results_wesad['f1_score']:.4f}")
    
    # Make results globally available
    globals()['results_wesad'] = results_wesad
    
else:
    print("❌ Skipping WESAD training - data not available")
    print("Please run the preprocessing section first.")

# Summary of baseline training
print("\n" + "="*70)
print("BASELINE TRAINING COMPLETE!")
print("="*70)

# Collect all results
all_baseline_results = {}

if 'results_sleep' in globals():
    all_baseline_results['Sleep-EDF Baseline'] = results_sleep
    print(f"Sleep-EDF Baseline - Accuracy: {results_sleep['accuracy']:.4f}, F1: {results_sleep['f1_score']:.4f}")

if 'results_wesad' in globals():
    all_baseline_results['WESAD Baseline'] = results_wesad
    print(f"WESAD Baseline - Accuracy: {results_wesad['accuracy']:.4f}, F1: {results_wesad['f1_score']:.4f}")

# Save combined results
if all_baseline_results:
    from src.evaluation.metrics import save_evaluation_results
    save_evaluation_results(all_baseline_results, RESULTS_PATH, 'baseline_results.json')

print(f"\nModels saved to: {MODELS_PATH}")
print(f"Results saved to: {RESULTS_PATH}")

print("\n✅ Baseline models are ready for comparison with privacy-preserving techniques!")


# ============================================================================
# PARTE 4: DIFFERENTIAL PRIVACY TRAINING
# ============================================================================

## Modelos LSTM com Differential Privacy

Esta seção treina modelos LSTM com Differential Privacy para diferentes valores de epsilon.


In [None]:
# ============================================================================
# STEP 8: Train Sleep-EDF DP Models
# ============================================================================

print("="*70)
print("DIFFERENTIAL PRIVACY TRAINING")
print("="*70)

# Import modules
from src.privacy.dp_training import train_with_dp, evaluate_dp_model, save_dp_model, get_dp_configs
from src.evaluation.visualization import plot_tradeoff_curve

# Define epsilon values to test
epsilon_values = [0.1, 1.0, 5.0, 10.0]
print(f"\nEpsilon values to test: {epsilon_values}")

# Train Sleep-EDF DP models
if 'X_train_sleep' in globals() and X_train_sleep is not None:
    print("\n" + "="*70)
    print("TRAINING SLEEP-EDF DP MODELS")
    print("="*70)
    
    sleep_dp_results = {}
    
    for epsilon in epsilon_values:
        print(f"\n--- Training with epsilon = {epsilon} ---")
        
        # Get DP configuration
        dp_configs = get_dp_configs([epsilon])
        config = dp_configs[0]
        config.update({
            'dataset': 'sleep_edf',
            'model_type': 'dp',
            'privacy_technique': 'DP',
            'privacy_parameter': f'ε={epsilon}'
        })
        
        # Train DP model
        model_dp, history_dp, privacy_info = train_with_dp(
            X_train_sleep, y_train_sleep,
            X_val_sleep, y_val_sleep,
            config
        )
        
        # Evaluate model
        results_dp = evaluate_dp_model(model_dp, X_test_sleep, y_test_sleep, config['window_size'])
        results_dp.update({
            'dataset': 'sleep_edf',
            'model_type': 'dp',
            'privacy_technique': 'DP',
            'privacy_parameter': f'ε={epsilon}',
            'epsilon': epsilon,
            'epsilon_actual': privacy_info['epsilon_actual']
        })
        
        # Save model and results
        model_name = f'sleep_edf_dp_epsilon_{epsilon}'
        save_dp_model(
            model_dp, history_dp, results_dp, privacy_info,
            MODELS_PATH, model_name
        )
        
        # Store results
        sleep_dp_results[f'Sleep-EDF DP (ε={epsilon})'] = results_dp
        
        print(f"✅ Sleep-EDF DP model (ε={epsilon}) trained successfully!")
        print(f"Test Accuracy: {results_dp['accuracy']:.4f}")
        print(f"Actual Epsilon: {privacy_info['epsilon_actual']:.4f}")
    
    print(f"\n✅ All Sleep-EDF DP models trained!")
    
    # Make results globally available
    globals()['sleep_dp_results'] = sleep_dp_results
    
else:
    print("❌ Skipping Sleep-EDF DP training - data not available")
    sleep_dp_results = {}


In [None]:
# ============================================================================
# STEP 9: Train WESAD DP Models
# ============================================================================

# Train WESAD DP models
if 'X_train_wesad' in globals() and X_train_wesad is not None:
    print("\n" + "="*70)
    print("TRAINING WESAD DP MODELS")
    print("="*70)
    
    wesad_dp_results = {}
    
    for epsilon in epsilon_values:
        print(f"\n--- Training with epsilon = {epsilon} ---")
        
        # Get DP configuration
        dp_configs = get_dp_configs([epsilon])
        config = dp_configs[0]
        config.update({
            'dataset': 'wesad',
            'model_type': 'dp',
            'privacy_technique': 'DP',
            'privacy_parameter': f'ε={epsilon}'
        })
        
        # Train DP model
        model_dp, history_dp, privacy_info = train_with_dp(
            X_train_wesad, y_train_wesad,
            X_val_wesad, y_val_wesad,
            config
        )
        
        # Evaluate model
        results_dp = evaluate_dp_model(model_dp, X_test_wesad, y_test_wesad, config['window_size'])
        results_dp.update({
            'dataset': 'wesad',
            'model_type': 'dp',
            'privacy_technique': 'DP',
            'privacy_parameter': f'ε={epsilon}',
            'epsilon': epsilon,
            'epsilon_actual': privacy_info['epsilon_actual']
        })
        
        # Save model and results
        model_name = f'wesad_dp_epsilon_{epsilon}'
        save_dp_model(
            model_dp, history_dp, results_dp, privacy_info,
            MODELS_PATH, model_name
        )
        
        # Store results
        wesad_dp_results[f'WESAD DP (ε={epsilon})'] = results_dp
        
        print(f"✅ WESAD DP model (ε={epsilon}) trained successfully!")
        print(f"Test Accuracy: {results_dp['accuracy']:.4f}")
        print(f"Actual Epsilon: {privacy_info['epsilon_actual']:.4f}")
    
    print(f"\n✅ All WESAD DP models trained!")
    
    # Make results globally available
    globals()['wesad_dp_results'] = wesad_dp_results
    
else:
    print("❌ Skipping WESAD DP training - data not available")
    wesad_dp_results = {}

# Create DP trade-off visualizations
print("\n" + "="*70)
print("CREATING DP TRADE-OFF VISUALIZATIONS")
print("="*70)

# Combine all DP results
all_dp_results = {}
all_dp_results.update(sleep_dp_results)
all_dp_results.update(wesad_dp_results)

if all_dp_results:
    # Create trade-off curve for Sleep-EDF
    if sleep_dp_results:
        plot_tradeoff_curve(
            sleep_dp_results,
            metric='accuracy',
            privacy_param='epsilon',
            save_path=f'{RESULTS_PATH}/dp_tradeoff_sleep_edf.png',
            title='Sleep-EDF: Privacy vs. Performance Trade-off'
        )
    
    # Create trade-off curve for WESAD
    if wesad_dp_results:
        plot_tradeoff_curve(
            wesad_dp_results,
            metric='accuracy',
            privacy_param='epsilon',
            save_path=f'{RESULTS_PATH}/dp_tradeoff_wesad.png',
            title='WESAD: Privacy vs. Performance Trade-off'
        )
    
    # Save all DP results
    from src.evaluation.metrics import save_evaluation_results
    save_evaluation_results(all_dp_results, RESULTS_PATH, 'dp_results.json')
    
    print("✅ DP visualizations created and results saved!")
    
else:
    print("❌ No DP results to visualize")

# Summary of DP training
print("\n" + "="*70)
print("DP TRAINING COMPLETE!")
print("="*70)

# Print summary
if sleep_dp_results:
    print("\nSleep-EDF DP Results:")
    for model_name, results in sleep_dp_results.items():
        print(f"  {model_name}: Accuracy={results['accuracy']:.4f}, F1={results['f1_score']:.4f}, ε={results['epsilon_actual']:.4f}")

if wesad_dp_results:
    print("\nWESAD DP Results:")
    for model_name, results in wesad_dp_results.items():
        print(f"  {model_name}: Accuracy={results['accuracy']:.4f}, F1={results['f1_score']:.4f}, ε={results['epsilon_actual']:.4f}")

print(f"\nModels saved to: {MODELS_PATH}")
print(f"Results saved to: {RESULTS_PATH}")

print("\n✅ DP models are ready for comparison with baseline and FL approaches!")


# ============================================================================
# PARTE 5: FEDERATED LEARNING TRAINING
# ============================================================================

## Modelos LSTM com Federated Learning

Esta seção treina modelos LSTM com Federated Learning para diferentes números de clientes.


In [None]:
# ============================================================================
# STEP 10: Train Sleep-EDF FL Models
# ============================================================================

print("="*70)
print("FEDERATED LEARNING TRAINING")
print("="*70)

# Import modules
from src.privacy.fl_training import train_with_fl, evaluate_fl_model, save_fl_model, get_fl_configs
from src.evaluation.visualization import plot_fl_convergence, plot_comparison_bars

# Define client numbers to test
n_clients_list = [3, 5, 10]
print(f"\nClient numbers to test: {n_clients_list}")

# Train Sleep-EDF FL models
if 'X_train_sleep' in globals() and X_train_sleep is not None:
    print("\n" + "="*70)
    print("TRAINING SLEEP-EDF FL MODELS")
    print("="*70)
    
    sleep_fl_results = {}
    
    for n_clients in n_clients_list:
        print(f"\n--- Training with {n_clients} clients ---")
        
        # Get FL configuration
        fl_configs = get_fl_configs([n_clients])
        config = fl_configs[0]
        config.update({
            'dataset': 'sleep_edf',
            'model_type': 'fl',
            'privacy_technique': 'FL',
            'privacy_parameter': f'{n_clients} clients'
        })
        
        # Train FL model
        model_fl, history_fl, fl_info = train_with_fl(
            X_train_sleep, y_train_sleep,
            X_val_sleep, y_val_sleep,
            config
        )
        
        # Evaluate model
        results_fl = evaluate_fl_model(model_fl, X_test_sleep, y_test_sleep, config['window_size'])
        results_fl.update({
            'dataset': 'sleep_edf',
            'model_type': 'fl',
            'privacy_technique': 'FL',
            'privacy_parameter': f'{n_clients} clients',
            'n_clients': n_clients,
            'communication_cost': fl_info['communication_cost']
        })
        
        # Save model and results
        model_name = f'sleep_edf_fl_{n_clients}_clients'
        save_fl_model(
            model_fl, history_fl, results_fl, fl_info,
            MODELS_PATH, model_name
        )
        
        # Store results
        sleep_fl_results[f'Sleep-EDF FL ({n_clients} clients)'] = results_fl
        
        print(f"✅ Sleep-EDF FL model ({n_clients} clients) trained successfully!")
        print(f"Test Accuracy: {results_fl['accuracy']:.4f}")
        print(f"Communication Cost: {fl_info['communication_cost']} rounds")
    
    print(f"\n✅ All Sleep-EDF FL models trained!")
    
    # Make results globally available
    globals()['sleep_fl_results'] = sleep_fl_results
    
else:
    print("❌ Skipping Sleep-EDF FL training - data not available")
    sleep_fl_results = {}


In [None]:
# ============================================================================
# STEP 11: Train WESAD FL Models
# ============================================================================

# Train WESAD FL models
if 'X_train_wesad' in globals() and X_train_wesad is not None:
    print("\n" + "="*70)
    print("TRAINING WESAD FL MODELS")
    print("="*70)
    
    wesad_fl_results = {}
    
    for n_clients in n_clients_list:
        print(f"\n--- Training with {n_clients} clients ---")
        
        # Get FL configuration
        fl_configs = get_fl_configs([n_clients])
        config = fl_configs[0]
        config.update({
            'dataset': 'wesad',
            'model_type': 'fl',
            'privacy_technique': 'FL',
            'privacy_parameter': f'{n_clients} clients'
        })
        
        # Train FL model
        model_fl, history_fl, fl_info = train_with_fl(
            X_train_wesad, y_train_wesad,
            X_val_wesad, y_val_wesad,
            config
        )
        
        # Evaluate model
        results_fl = evaluate_fl_model(model_fl, X_test_wesad, y_test_wesad, config['window_size'])
        results_fl.update({
            'dataset': 'wesad',
            'model_type': 'fl',
            'privacy_technique': 'FL',
            'privacy_parameter': f'{n_clients} clients',
            'n_clients': n_clients,
            'communication_cost': fl_info['communication_cost']
        })
        
        # Save model and results
        model_name = f'wesad_fl_{n_clients}_clients'
        save_fl_model(
            model_fl, history_fl, results_fl, fl_info,
            MODELS_PATH, model_name
        )
        
        # Store results
        wesad_fl_results[f'WESAD FL ({n_clients} clients)'] = results_fl
        
        print(f"✅ WESAD FL model ({n_clients} clients) trained successfully!")
        print(f"Test Accuracy: {results_fl['accuracy']:.4f}")
        print(f"Communication Cost: {fl_info['communication_cost']} rounds")
    
    print(f"\n✅ All WESAD FL models trained!")
    
    # Make results globally available
    globals()['wesad_fl_results'] = wesad_fl_results
    
else:
    print("❌ Skipping WESAD FL training - data not available")
    wesad_fl_results = {}

# Create FL visualizations
print("\n" + "="*70)
print("CREATING FL VISUALIZATIONS")
print("="*70)

# Combine all FL results
all_fl_results = {}
all_fl_results.update(sleep_fl_results)
all_fl_results.update(wesad_fl_results)

if all_fl_results:
    # Create comparison bars for Sleep-EDF
    if sleep_fl_results:
        plot_comparison_bars(
            sleep_fl_results,
            metrics=['accuracy', 'f1_score'],
            save_path=f'{RESULTS_PATH}/fl_comparison_sleep_edf.png',
            title='Sleep-EDF: FL Performance Comparison'
        )
    
    # Create comparison bars for WESAD
    if wesad_fl_results:
        plot_comparison_bars(
            wesad_fl_results,
            metrics=['accuracy', 'f1_score'],
            save_path=f'{RESULTS_PATH}/fl_comparison_wesad.png',
            title='WESAD: FL Performance Comparison'
        )
    
    # Save all FL results
    from src.evaluation.metrics import save_evaluation_results
    save_evaluation_results(all_fl_results, RESULTS_PATH, 'fl_results.json')
    
    print("✅ FL visualizations created and results saved!")
    
else:
    print("❌ No FL results to visualize")

# Summary of FL training
print("\n" + "="*70)
print("FL TRAINING COMPLETE!")
print("="*70)

# Print summary
if sleep_fl_results:
    print("\nSleep-EDF FL Results:")
    for model_name, results in sleep_fl_results.items():
        print(f"  {model_name}: Accuracy={results['accuracy']:.4f}, F1={results['f1_score']:.4f}, Cost={results['communication_cost']}")

if wesad_fl_results:
    print("\nWESAD FL Results:")
    for model_name, results in wesad_fl_results.items():
        print(f"  {model_name}: Accuracy={results['accuracy']:.4f}, F1={results['f1_score']:.4f}, Cost={results['communication_cost']}")

print(f"\nModels saved to: {MODELS_PATH}")
print(f"Results saved to: {RESULTS_PATH}")

print("\n✅ FL models are ready for final analysis and comparison!")


# ============================================================================
# PARTE 6: ANÁLISE FINAL E COMPARAÇÃO
# ============================================================================

## Análise Abrangente de Resultados

Esta seção realiza a análise final comparando todas as abordagens: Baseline, Differential Privacy e Federated Learning.


In [None]:
# ============================================================================
# STEP 12: Load All Results and Create Comprehensive Analysis
# ============================================================================

print("="*70)
print("COMPREHENSIVE ANALYSIS")
print("="*70)

# Import modules
from src.evaluation.metrics import load_evaluation_results, compare_models, create_results_summary
from src.evaluation.visualization import (
    plot_tradeoff_curve, plot_comparison_bars, plot_privacy_analysis,
    create_summary_dashboard
)
import pandas as pd
import json

print("Loading all results...")

# Load baseline results
baseline_results = {}
if os.path.exists(f'{RESULTS_PATH}/baseline_results.json'):
    baseline_results = load_evaluation_results(f'{RESULTS_PATH}/baseline_results.json')
    print(f"✅ Baseline results loaded: {len(baseline_results)} models")

# Load DP results
dp_results = {}
if os.path.exists(f'{RESULTS_PATH}/dp_results.json'):
    dp_results = load_evaluation_results(f'{RESULTS_PATH}/dp_results.json')
    print(f"✅ DP results loaded: {len(dp_results)} models")

# Load FL results
fl_results = {}
if os.path.exists(f'{RESULTS_PATH}/fl_results.json'):
    fl_results = load_evaluation_results(f'{RESULTS_PATH}/fl_results.json')
    print(f"✅ FL results loaded: {len(fl_results)} models")

# Combine all results
all_results = {}
all_results.update(baseline_results)
all_results.update(dp_results)
all_results.update(fl_results)

print(f"\nTotal models analyzed: {len(all_results)}")
print(f"Techniques: {set([results.get('privacy_technique', 'Unknown') for results in all_results.values()])}")

# Create comprehensive summary
if all_results:
    print("\n" + "="*70)
    print("CREATING RESULTS SUMMARY")
    print("="*70)
    
    # Create comprehensive summary
    summary_df = create_results_summary(all_results, f'{RESULTS_PATH}/comprehensive_results_summary.csv')
    
    print("Results Summary Table:")
    print(summary_df.to_string(index=False))
    
    # Save detailed summary
    summary_df.to_csv(f'{RESULTS_PATH}/detailed_results_summary.csv', index=False)
    print(f"\n✅ Detailed summary saved to: {RESULTS_PATH}/detailed_results_summary.csv")
    
    # Create comparison by technique
    print("\n" + "-"*50)
    print("COMPARISON BY TECHNIQUE")
    print("-"*50)
    
    technique_comparison = summary_df.groupby('Privacy_Technique').agg({
        'Accuracy': ['mean', 'std'],
        'F1-Score': ['mean', 'std'],
        'Model': 'count'
    }).round(4)
    
    print(technique_comparison)
    
else:
    print("❌ No results found. Please run training sections first.")


In [None]:
# ============================================================================
# STEP 13: Create Comprehensive Visualizations
# ============================================================================

if all_results:
    print("\n" + "="*70)
    print("CREATING COMPREHENSIVE VISUALIZATIONS")
    print("="*70)
    
    # 1. Privacy vs. Performance Trade-off Analysis
    print("Creating privacy analysis...")
    plot_privacy_analysis(
        all_results,
        save_path=f'{RESULTS_PATH}/privacy_analysis_comprehensive.png',
        title='Comprehensive Privacy Analysis'
    )
    
    # 2. Model Comparison Bars
    print("Creating model comparison...")
    plot_comparison_bars(
        all_results,
        metrics=['accuracy', 'f1_score'],
        save_path=f'{RESULTS_PATH}/model_comparison_comprehensive.png',
        title='Comprehensive Model Comparison'
    )
    
    # 3. Trade-off Curves for DP
    dp_models = {k: v for k, v in all_results.items() if v.get('privacy_technique') == 'DP'}
    if dp_models:
        print("Creating DP trade-off curves...")
        plot_tradeoff_curve(
            dp_models,
            metric='accuracy',
            privacy_param='epsilon',
            save_path=f'{RESULTS_PATH}/dp_tradeoff_comprehensive.png',
            title='Differential Privacy: Privacy vs. Performance Trade-off'
        )
    
    # 4. Summary Dashboard
    print("Creating summary dashboard...")
    create_summary_dashboard(
        all_results,
        save_path=f'{RESULTS_PATH}/summary_dashboard.png'
    )
    
    print("✅ All visualizations created successfully!")
    
else:
    print("❌ No results to visualize")


In [None]:
# ============================================================================
# STEP 14: Statistical Analysis and Final Report
# ============================================================================

if all_results:
    print("\n" + "="*70)
    print("STATISTICAL ANALYSIS")
    print("="*70)
    
    # Extract baseline results for comparison
    baseline_models = {k: v for k, v in all_results.items() if v.get('privacy_technique') == 'None'}
    dp_models = {k: v for k, v in all_results.items() if v.get('privacy_technique') == 'DP'}
    fl_models = {k: v for k, v in all_results.items() if v.get('privacy_technique') == 'FL'}
    
    print("Statistical Analysis Results:")
    print("-" * 50)
    
    # Analyze each dataset
    for dataset in ['sleep_edf', 'wesad']:
        print(f"\n{dataset.upper()} Dataset Analysis:")
        
        # Find baseline for this dataset
        dataset_baseline = None
        for model_name, results in baseline_models.items():
            if results.get('dataset') == dataset:
                dataset_baseline = results
                break
        
        if dataset_baseline:
            baseline_acc = dataset_baseline['metrics']['accuracy']
            baseline_f1 = dataset_baseline['metrics']['f1_score']
            
            print(f"  Baseline: Accuracy={baseline_acc:.4f}, F1={baseline_f1:.4f}")
            
            # Compare DP models
            dataset_dp = {k: v for k, v in dp_models.items() if v.get('dataset') == dataset}
            if dataset_dp:
                print(f"  DP Models:")
                for model_name, results in dataset_dp.items():
                    acc = results['metrics']['accuracy']
                    f1 = results['metrics']['f1_score']
                    epsilon = results.get('epsilon', 'N/A')
                    acc_degradation = baseline_acc - acc
                    f1_degradation = baseline_f1 - f1
                    print(f"    {model_name}: Acc={acc:.4f} (-{acc_degradation:.4f}), F1={f1:.4f} (-{f1_degradation:.4f}), ε={epsilon}")
            
            # Compare FL models
            dataset_fl = {k: v for k, v in fl_models.items() if v.get('dataset') == dataset}
            if dataset_fl:
                print(f"  FL Models:")
                for model_name, results in dataset_fl.items():
                    acc = results['metrics']['accuracy']
                    f1 = results['metrics']['f1_score']
                    n_clients = results.get('n_clients', 'N/A')
                    acc_degradation = baseline_acc - acc
                    f1_degradation = baseline_f1 - f1
                    print(f"    {model_name}: Acc={acc:.4f} (-{acc_degradation:.4f}), F1={f1:.4f} (-{f1_degradation:.4f}), Clients={n_clients}")
    
    print("\n✅ Statistical analysis completed!")
    
    # Generate final report
    print("\n" + "="*70)
    print("GENERATING FINAL REPORT")
    print("="*70)
    
    # Create final report
    report = {
        'project': 'Privacy-Preserving Health Data Analysis',
        'datasets': ['Sleep-EDF', 'WESAD'],
        'techniques': ['Baseline', 'Differential Privacy', 'Federated Learning'],
        'total_models': len(all_results),
        'summary': summary_df.to_dict('records') if 'summary_df' in locals() else [],
        'key_findings': []
    }
    
    # Add key findings
    # Find best performing models
    best_accuracy = max(all_results.items(), key=lambda x: x[1]['metrics']['accuracy'])
    best_f1 = max(all_results.items(), key=lambda x: x[1]['metrics']['f1_score'])
    
    report['key_findings'].extend([
        f"Best Accuracy: {best_accuracy[0]} ({best_accuracy[1]['metrics']['accuracy']:.4f})",
        f"Best F1-Score: {best_f1[0]} ({best_f1[1]['metrics']['f1_score']:.4f})"
    ])
    
    # Analyze privacy trade-offs
    if dp_models:
        dp_accuracies = [results['metrics']['accuracy'] for results in dp_models.values()]
        avg_dp_accuracy = np.mean(dp_accuracies)
        report['key_findings'].append(f"Average DP Accuracy: {avg_dp_accuracy:.4f}")
    
    if fl_models:
        fl_accuracies = [results['metrics']['accuracy'] for results in fl_models.values()]
        avg_fl_accuracy = np.mean(fl_accuracies)
        report['key_findings'].append(f"Average FL Accuracy: {avg_fl_accuracy:.4f}")
    
    # Save final report
    with open(f'{RESULTS_PATH}/final_report.json', 'w') as f:
        json.dump(report, f, indent=2)
    
    print("Final Report Generated:")
    print("-" * 30)
    print(f"Project: {report['project']}")
    print(f"Datasets: {', '.join(report['datasets'])}")
    print(f"Techniques: {', '.join(report['techniques'])}")
    print(f"Total Models: {report['total_models']}")
    print("\nKey Findings:")
    for finding in report['key_findings']:
        print(f"  • {finding}")
    
    print(f"\n✅ Final report saved to: {RESULTS_PATH}/final_report.json")
    
else:
    print("❌ No results for statistical analysis")


In [None]:
# ============================================================================
# STEP 15: Project Completion Summary
# ============================================================================

print("\n" + "="*70)
print("PROJECT COMPLETION SUMMARY")
print("="*70)

print("🎉 Privacy-Preserving Health Data Analysis Project Complete!")
print("\nWhat was accomplished:")
print("✅ Preprocessed Sleep-EDF and WESAD datasets")
print("✅ Trained baseline LSTM models")
print("✅ Implemented Differential Privacy with multiple epsilon values")
print("✅ Implemented Federated Learning with multiple client configurations")
print("✅ Comprehensive evaluation and comparison")
print("✅ Statistical analysis of privacy-performance trade-offs")
print("✅ Generated visualizations and final report")

print(f"\nResults saved in: {RESULTS_PATH}")
print("Files generated:")
print("  • comprehensive_results_summary.csv")
print("  • detailed_results_summary.csv")
print("  • privacy_analysis_comprehensive.png")
print("  • model_comparison_comprehensive.png")
print("  • dp_tradeoff_comprehensive.png")
print("  • summary_dashboard.png")
print("  • final_report.json")

print("\nNext steps for your thesis:")
print("1. Review the results and visualizations")
print("2. Write the methodology section using the implemented code")
print("3. Analyze the trade-offs between privacy and performance")
print("4. Discuss implications for mobile health applications")
print("5. Create conclusions and recommendations")

print("\n🚀 Your project is ready for thesis writing!")
print("All code is modular and well-documented for reproducibility.")

print("\n" + "="*70)
print("NOTEBOOK EXECUTION COMPLETE!")
print("="*70)
print("\nThis consolidated notebook has successfully executed the entire pipeline:")
print("1. ✅ Setup and Drive Mount")
print("2. ✅ Data Preprocessing (Sleep-EDF & WESAD)")
print("3. ✅ Baseline Model Training")
print("4. ✅ Differential Privacy Training")
print("5. ✅ Federated Learning Training")
print("6. ✅ Comprehensive Analysis and Visualization")
print("\nAll results are saved to your Google Drive for future reference.")
