# Sleep-EDF PyTorch Training - Google Colab

This notebook trains an optimized PyTorch CNN+LSTM model for sleep stage classification using the Sleep-EDF dataset.

## Features:
- Optimized LSTM model (target: >=87% accuracy)
- Efficient data loading for large datasets
- Clean logs during training (no emojis)
- Early stopping and learning rate scheduling
- Memory and speed optimizations
- Automatic hardware acceleration (CUDA > MPS > CPU)

## Requirements:
- Google Colab with GPU enabled
- Data in Google Drive: `mydrive/mhealth-data/data/processed/sleep-edf/`


## 1. Initial Setup


In [None]:
# Install dependencies from requirements.txt
!pip install -r /content/mhealth-data-privacy/requirements.txt

# Clone the repository (if not already present)
import os
if not os.path.exists('/content/mhealth-data-privacy'):
    !git clone https://github.com/vasco-fernandes21/mhealth-data-privacy.git

import sys
sys.path.append('/content/mhealth-data-privacy')

print("Dependencies installed and repository cloned")


In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Check if data exists
import os
data_path = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'

if os.path.exists(data_path):
    print(f"Data found at: {data_path}")
    
    # List files
    files = os.listdir(data_path)
    print(f"Available files: {files}")
    
    # Check file sizes
    for file in ['X_train.npy', 'y_train.npy', 'X_val.npy', 'y_val.npy', 'X_test.npy', 'y_test.npy']:
        if file in files:
            size = os.path.getsize(os.path.join(data_path, file))
            print(f"  {file}: {size / (1024*1024):.1f} MB")
        else:
            print(f"  {file} not found")
else:
    print(f"Data not found at: {data_path}")
    print("Make sure the data is in the correct path on Google Drive")


## 2. Model Training

Choose your training option:


### Option A: Baseline LSTM Training


In [None]:
# Run training using the repository script
import os, shutil

repo_data_dir = '/content/mhealth-data-privacy/data/processed/sleep-edf'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'
models_dir = '/content/mhealth-data-privacy/models/sleep-edf/baseline'
results_dir = '/content/mhealth-data-privacy/results/sleep-edf/baseline'

# Ensure output directories
os.makedirs(models_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

# Create symbolic link from Drive data to expected path
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning when removing old destination: {e}")

# Create symlink
!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data referenced via symlink: {repo_data_dir} -> {drive_data_dir}")

print(f"Starting CNN+LSTM training with data from: {drive_data_dir}")
print("=" * 80)

# Run the baseline training script
!python /content/mhealth-data-privacy/src/train/sleep-edf/train_baseline.py

print("Training completed!")


### Option B: Differential Privacy Training


In [None]:
# Run differential privacy training
import os, shutil

repo_data_dir = '/content/mhealth-data-privacy/data/processed/sleep-edf'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'
models_dir = '/content/mhealth-data-privacy/models/sleep-edf/dp'
results_dir = '/content/mhealth-data-privacy/results/sleep-edf/dp'

# Ensure output directories
os.makedirs(models_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

# Create symbolic link from Drive data to expected path
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning when removing old destination: {e}")

# Create symlink
!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data referenced via symlink: {repo_data_dir} -> {drive_data_dir}")

print(f"Starting DP training with data from: {drive_data_dir}")
print("Using DPLSTM for Opacus compatibility")
print("=" * 80)

# Run the DP training script
!python /content/mhealth-data-privacy/src/train/sleep-edf/differential_privacy/train_dp.py

print("Differential privacy training completed!")


In [None]:
# Load and analyze results
import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Choose which results to load
training_type = 'baseline'  # Change to 'dp' for Differential Privacy results

if training_type == 'baseline':
    results_path = '/content/mhealth-data-privacy/models/sleep-edf/baseline/results_sleep_edf_optimized.json'
else:
    results_path = '/content/mhealth-data-privacy/models/sleep-edf/dp/results_sleep_edf_dp.json'

if os.path.exists(results_path):
    with open(results_path, 'r') as f:
        results = json.load(f)

    print(f"FINAL RESULTS ({training_type.upper()}):")
    print("=" * 50)
    print(f"Accuracy:  {results['accuracy']:.4f}")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall:    {results['recall']:.4f}")
    print(f"F1-Score:  {results['f1_score']:.4f}")
    
    # If DP, show privacy budget
    if training_type == 'dp' and 'dp_params' in results:
        print(f"\nPrivacy Budget:")
        print(f"  Epsilon: {results['dp_params']['final_epsilon']:.2f}")
        print(f"  Delta: {results['dp_params']['delta']:.0e}")
        print(f"  Noise Multiplier: {results['dp_params']['noise_multiplier']:.2f}")

    # Confusion matrix
    print("\nCONFUSION MATRIX:")
    cm = np.array(results['confusion_matrix'])
    class_names = results['class_names']

    print(f"{'':8s}", end="")
    for name in class_names:
        print(f"{name:8s}", end="")
    print(f"\n{'Real ↓':8s}", end="")

    for i, row in enumerate(cm):
        print(f"{class_names[i]:8s}", end="")
        for val in row:
            print(f"{val:8d}", end="")
        print()

    # Plot confusion matrix
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names)
    plt.title(f'Confusion Matrix - Sleep-EDF {training_type.upper()}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

else:
    print(f"Results not found at: {results_path}")
    print("Run the training cell first")


## 3. Results Analysis


## 4. Multiple Runs Experiments

This section runs multiple training experiments to ensure robust results:
- **Baseline**: 5 runs with different random seeds
- **Differential Privacy**: 5 runs with varying noise multipliers


### 4.1 Baseline Multiple Runs (5 runs)


In [None]:
import os, shutil
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Configuration
num_runs = 5
seeds = [42, 123, 456, 789, 999]
results_baseline = []

# Setup directories
repo_data_dir = '/content/mhealth-data-privacy/data/processed/sleep-edf'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'
base_models_dir = '/content/mhealth-data-privacy/models/sleep-edf'
base_results_dir = '/content/mhealth-data-privacy/results/sleep-edf'

# Create symlink once
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning: {e}")

!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data linked: {repo_data_dir} -> {drive_data_dir}\n")

print("Starting 5 baseline runs...")
print("=" * 50)

for i, seed in enumerate(seeds):
    print(f"\n{'='*60}")
    print(f"RUN {i+1}/5 - SEED {seed}")
    print("="*60)
    
    # Create run-specific directories
    run_dir = f'baseline_run{i+1}'
    models_dir = f'{base_models_dir}/{run_dir}'
    results_dir = f'{base_results_dir}/{run_dir}'
    os.makedirs(models_dir, exist_ok=True)
    os.makedirs(results_dir, exist_ok=True)
    
    # Set environment variables
    os.environ['TRAIN_SEED'] = str(seed)
    os.environ['MODEL_DIR'] = models_dir
    os.environ['RESULTS_DIR'] = results_dir
    
    # Run training
    !python /content/mhealth-data-privacy/src/train/sleep-edf/train_baseline.py
    
    # Load results
    results_path = f'{models_dir}/results_sleep_edf_optimized.json'
    if os.path.exists(results_path):
        with open(results_path, 'r') as f:
            run_results = json.load(f)
        
        results_baseline.append({
            'run': i+1,
            'seed': seed,
            'accuracy': run_results['accuracy'],
            'f1_score': run_results['f1_score'],
            'precision': run_results['precision'],
            'recall': run_results['recall'],
            'confusion_matrix': run_results['confusion_matrix'],
            'timestamp': datetime.now().isoformat()
        })
        
        print(f"\nRun {i+1} Results: Accuracy={run_results['accuracy']:.4f}, F1={run_results['f1_score']:.4f}")
    else:
        print(f"Warning: Results not found for run {i+1}")


In [None]:
# Calculate statistics and plot results
if results_baseline:
    # Extract metrics
    accuracies = [r['accuracy'] for r in results_baseline]
    f1_scores = [r['f1_score'] for r in results_baseline]
    precisions = [r['precision'] for r in results_baseline]
    recalls = [r['recall'] for r in results_baseline]
    
    # Calculate mean and std
    stats = {
        'accuracy': {'mean': np.mean(accuracies), 'std': np.std(accuracies)},
        'f1_score': {'mean': np.mean(f1_scores), 'std': np.std(f1_scores)},
        'precision': {'mean': np.mean(precisions), 'std': np.std(precisions)},
        'recall': {'mean': np.mean(recalls), 'std': np.std(recalls)}
    }
    
    # Plot accuracy and F1 for each run
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Accuracy plot
    runs = list(range(1, len(results_baseline) + 1))
    axes[0].plot(runs, accuracies, marker='o', linewidth=2, markersize=8, label='Run Accuracy')
    axes[0].axhline(stats['accuracy']['mean'], color='red', linestyle='--', label=f"Mean: {stats['accuracy']['mean']:.4f}")
    axes[0].fill_between(runs, 
                          stats['accuracy']['mean'] - stats['accuracy']['std'],
                          stats['accuracy']['mean'] + stats['accuracy']['std'],
                          alpha=0.2, color='red', label=f"±1 std: {stats['accuracy']['std']:.4f}")
    axes[0].set_xlabel('Run')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_title('Accuracy per Run - Sleep-EDF Baseline')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xticks(runs)
    
    # F1-Score plot
    axes[1].plot(runs, f1_scores, marker='s', linewidth=2, markersize=8, label='Run F1-Score', color='green')
    axes[1].axhline(stats['f1_score']['mean'], color='darkgreen', linestyle='--', label=f"Mean: {stats['f1_score']['mean']:.4f}")
    axes[1].fill_between(runs,
                          stats['f1_score']['mean'] - stats['f1_score']['std'],
                          stats['f1_score']['mean'] + stats['f1_score']['std'],
                          alpha=0.2, color='green', label=f"±1 std: {stats['f1_score']['std']:.4f}")
    axes[1].set_xlabel('Run')
    axes[1].set_ylabel('F1-Score')
    axes[1].set_title('F1-Score per Run - Sleep-EDF Baseline')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xticks(runs)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary table
    print("\n" + "="*70)
    print("BASELINE MULTIPLE RUNS SUMMARY")
    print("="*70)
    print(f"Number of runs: {len(results_baseline)}\n")
    print(f"Metric        Mean ± Std")
    print("-" * 35)
    for metric, values in stats.items():
        print(f"{metric:12s}  {values['mean']:.4f} ± {values['std']:.4f}")
    print("="*70)
    
    # Save summary
    summary_path = f'{base_results_dir}/baseline_5runs.json'
    with open(summary_path, 'w') as f:
        json.dump({'runs': results_baseline, 'statistics': stats}, f, indent=2)
    print(f"\nSummary saved to: {summary_path}")


### 4.2 Differential Privacy Multiple Runs (5 runs with different noise_multipliers)


In [None]:
# DP Multiple Runs Configuration
noise_multipliers = [0.5, 0.7, 0.9, 1.1, 1.3]
results_dp = []

print("Starting 5 DP runs with different noise multipliers...")
print("=" * 60)

for i, noise_mult in enumerate(noise_multipliers):
    print(f"\n{'='*60}")
    print(f"DP RUN {i+1}/5 - NOISE_MULTIPLIER {noise_mult}")
    print("="*60)
    
    # Create run-specific directories
    run_dir = f'dp_run{i+1}_noise{noise_mult}'
    models_dir = f'{base_models_dir}/{run_dir}'
    results_dir = f'{base_results_dir}/{run_dir}'
    os.makedirs(models_dir, exist_ok=True)
    os.makedirs(results_dir, exist_ok=True)
    
    # Set environment variables
    os.environ['TRAIN_SEED'] = '42'  # Same seed for all DP runs
    os.environ['MODEL_DIR'] = models_dir
    os.environ['RESULTS_DIR'] = results_dir
    os.environ['NOISE_MULTIPLIER'] = str(noise_mult)
    
    # Run DP training
    !python /content/mhealth-data-privacy/src/train/sleep-edf/differential_privacy/train_dp.py
    
    # Load results
    results_path = f'{models_dir}/results_sleep_edf_dp.json'
    if os.path.exists(results_path):
        with open(results_path, 'r') as f:
            run_results = json.load(f)
        
        # Extract epsilon
        epsilon = run_results.get('dp_params', {}).get('final_epsilon', 0.0)
        if epsilon == 0.0 and 'epsilon' in run_results:
            epsilon = run_results['epsilon']
        
        results_dp.append({
            'run': i+1,
            'noise_multiplier': noise_mult,
            'epsilon': epsilon,
            'accuracy': run_results['accuracy'],
            'f1_score': run_results['f1_score'],
            'precision': run_results['precision'],
            'recall': run_results['recall'],
            'confusion_matrix': run_results['confusion_matrix'],
            'timestamp': datetime.now().isoformat()
        })
        
        print(f"\nDP Run {i+1} Results: ε={epsilon:.2f}, Accuracy={run_results['accuracy']:.4f}, F1={run_results['f1_score']:.4f}")
    else:
        print(f"Warning: Results not found for DP run {i+1}")


In [None]:
# Calculate statistics and plot privacy-utility trade-off
if results_dp:
    # Extract metrics
    epsilons = [r['epsilon'] for r in results_dp]
    dp_accuracies = [r['accuracy'] for r in results_dp]
    dp_f1_scores = [r['f1_score'] for r in results_dp]
    dp_precisions = [r['precision'] for r in results_dp]
    dp_recalls = [r['recall'] for r in results_dp]
    noise_mults = [r['noise_multiplier'] for r in results_dp]
    
    # Statistics
    dp_stats = {
        'epsilon_range': {'min': min(epsilons), 'max': max(epsilons)},
        'accuracy': {'min': min(dp_accuracies), 'max': max(dp_accuracies)},
        'f1_score': {'min': min(dp_f1_scores), 'max': max(dp_f1_scores)},
        'precision': {'min': min(dp_precisions), 'max': max(dp_precisions)},
        'recall': {'min': min(dp_recalls), 'max': max(dp_recalls)}
    }
    
    # Plot privacy-utility trade-off
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Epsilon vs Accuracy
    axes[0].plot(epsilons, dp_accuracies, marker='o', linewidth=2, markersize=8, label='DP Accuracy')
    if results_baseline:
        axes[0].axhline(stats['accuracy']['mean'], color='red', linestyle='--', 
                       label=f"Baseline Mean: {stats['accuracy']['mean']:.4f}")
    for i, (eps, acc, nm) in enumerate(zip(epsilons, dp_accuracies, noise_mults)):
        axes[0].annotate(f'σ={nm}', (eps, acc), textcoords="offset points", 
                        xytext=(0,10), ha='center', fontsize=8)
    axes[0].set_xlabel('Privacy Budget (ε)')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_title('Privacy-Utility Trade-off: Accuracy - Sleep-EDF')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Epsilon vs F1-Score
    axes[1].plot(epsilons, dp_f1_scores, marker='s', linewidth=2, markersize=8, label='DP F1-Score', color='green')
    if results_baseline:
        axes[1].axhline(stats['f1_score']['mean'], color='darkgreen', linestyle='--',
                       label=f"Baseline Mean: {stats['f1_score']['mean']:.4f}")
    for i, (eps, f1, nm) in enumerate(zip(epsilons, dp_f1_scores, noise_mults)):
        axes[1].annotate(f'σ={nm}', (eps, f1), textcoords="offset points",
                        xytext=(0,10), ha='center', fontsize=8)
    axes[1].set_xlabel('Privacy Budget (ε)')
    axes[1].set_ylabel('F1-Score')
    axes[1].set_title('Privacy-Utility Trade-off: F1-Score - Sleep-EDF')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary table
    print("\n" + "="*70)
    print("DIFFERENTIAL PRIVACY MULTIPLE RUNS SUMMARY")
    print("="*70)
    print(f"Number of runs: {len(results_dp)}\n")
    print(f"Run  Noise σ   ε      Accuracy  F1-Score  Precision  Recall")
    print("-" * 70)
    for r in results_dp:
        print(f"{r['run']:3d}  {r['noise_multiplier']:5.1f}  {r['epsilon']:6.2f}  {r['accuracy']:.4f}    {r['f1_score']:.4f}    {r['precision']:.4f}     {r['recall']:.4f}")
    print("\nMetric Ranges:")
    print("-" * 35)
    for metric, values in dp_stats.items():
        if 'range' in metric:
            print(f"{metric:12s}  [{values['min']:.2f}, {values['max']:.2f}]")
        else:
            print(f"{metric:12s}  [{values['min']:.4f}, {values['max']:.4f}]")
    print("="*70)
    
    # Save summary
    summary_path = f'{base_results_dir}/dp_5runs.json'
    with open(summary_path, 'w') as f:
        json.dump({'runs': results_dp, 'statistics': dp_stats}, f, indent=2)
    print(f"\nSummary saved to: {summary_path}")


## Tips and Troubleshooting

### Common Issues:

1. **Data not found:**
   - Check if path `mydrive/mhealth-data/data/processed/sleep-edf/` is correct
   - Ensure all `.npy` files are present

2. **GPU not available:**
   - Go to Runtime → Change runtime type → Hardware accelerator → GPU
   - Model will work on CPU but will be slower
   - Automatic hardware detection: CUDA > MPS > CPU

3. **Insufficient memory:**
   - Sleep-EDF dataset is large (58MB for training)
   - CNN+LSTM model has ~311K parameters
   - Consider smaller batch_size if needed

4. **Long training time:**
   - First epoch always takes longer (initial loading)
   - Logs show progress every epoch
   - Early stopping after 8 epochs without improvement

### Model Architecture:
- **CNN**: 3 convolutional blocks for feature extraction
- **LSTM**: Bidirectional with 2 layers (64 hidden units)
- **Dense**: 3 fully connected layers for classification
- **Total**: ~311K parameters

### Resources Used:
- **GPU**: Recommended for fast training
- **RAM**: ~4GB for dataset + model
- **Disk**: ~200MB for code + results

### Next Steps:
1. Compare with simple LSTM baseline (87% accuracy)
2. Implement privacy techniques (DP-SGD)
3. Test with other physiological datasets

---

**Notebook created for SIDM - MHealth Data Privacy project**
