# WESAD PyTorch Training - Google Colab

This notebook trains PyTorch models for stress detection using the WESAD (Wearable Stress and Affect Detection) dataset.

## Features:
- Baseline LSTM model for binary stress classification
- Differential Privacy (DP-SGD) training option
- Efficient data loading for physiological signals
- Clean logs during training (no emojis)
- Early stopping and learning rate scheduling
- Memory and speed optimizations
- Automatic hardware acceleration (CUDA > MPS > CPU)

## Requirements:
- Google Colab with GPU enabled
- Data in Google Drive: `mydrive/mhealth-data/data/processed/wesad/`

## Training Options:
1. **Baseline**: Standard LSTM model (~80% accuracy)
2. **Differential Privacy**: DP-SGD with privacy guarantees


## 1. Initial Setup


In [None]:
# Install necessary dependencies
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install numpy pandas scikit-learn matplotlib seaborn
!pip install mne pyedflib opacus

# Clone the repository
!git clone https://github.com/vasco-fernandes21/mhealth-data-privacy.git
import sys
sys.path.append('/content/mhealth-data-privacy')

print("Dependencies installed and repository cloned")


In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Check if data exists
import os
data_path = '/content/drive/MyDrive/mhealth-data/data/processed/wesad'

if os.path.exists(data_path):
    print(f"Data found at: {data_path}")

    # List files
    files = os.listdir(data_path)
    print(f"Available files: {files}")

    # Check file sizes
    for file in ['X_train.npy', 'y_train.npy', 'X_val.npy', 'y_val.npy', 'X_test.npy', 'y_test.npy']:
        if file in files:
            size = os.path.getsize(os.path.join(data_path, file))
            print(f"  {file}: {size / (1024*1024):.1f} MB")
        else:
            print(f"  {file} not found")
else:
    print(f"Data not found at: {data_path}")
    print("Make sure the data is in the correct path on Google Drive")


## 2. Model Training

Choose your training option:


### Option A: Baseline LSTM Training


In [None]:
# Run baseline training using the repository script
import os, shutil

repo_data_dir = '/content/mhealth-data-privacy/data/processed/wesad'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/wesad'
models_dir = '/content/mhealth-data-privacy/models/wesad/baseline'
results_dir = '/content/mhealth-data-privacy/results/wesad/baseline'

# Ensure output directories
os.makedirs(models_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

# Create symbolic link from Drive data to expected path
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning when removing old destination: {e}")

# Create symlink
!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data referenced via symlink: {repo_data_dir} -> {drive_data_dir}")

print(f"Starting baseline LSTM training with data from: {drive_data_dir}")
print("=" * 80)

# Run the baseline training script
!python /content/mhealth-data-privacy/src/train/wesad/train_baseline.py

print("Baseline training completed!")


### Option B: Differential Privacy Training


In [None]:
# Run differential privacy training
import os, shutil

repo_data_dir = '/content/mhealth-data-privacy/data/processed/wesad'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/wesad'
models_dir = '/content/mhealth-data-privacy/models/wesad/dp'
results_dir = '/content/mhealth-data-privacy/results/wesad/dp'

# Ensure output directories
os.makedirs(models_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

# Create symbolic link from Drive data to expected path
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning when removing old destination: {e}")

# Create symlink
!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data referenced via symlink: {repo_data_dir} -> {drive_data_dir}")

print(f"Starting differential privacy training with data from: {drive_data_dir}")
print("=" * 80)
print("Using DPLSTM for Opacus compatibility")

# Run the differential privacy training script
!python /content/mhealth-data-privacy/src/train/wesad/differential_privacy/train_dp.py

print("Differential privacy training completed!")


In [None]:
# Load and analyze results
import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Check which results are available
baseline_results_path = '/content/mhealth-data-privacy/models/wesad/baseline/results_wesad_baseline.json'
dp_results_path = '/content/mhealth-data-privacy/models/wesad/dp/results_wesad_dp.json'

results_to_show = []

if os.path.exists(baseline_results_path):
    with open(baseline_results_path, 'r') as f:
        baseline_results = json.load(f)
    results_to_show.append(('Baseline LSTM', baseline_results))

if os.path.exists(dp_results_path):
    with open(dp_results_path, 'r') as f:
        dp_results = json.load(f)
    results_to_show.append(('Differential Privacy', dp_results))

if results_to_show:
    print("FINAL RESULTS:")
    print("=" * 60)
    
    for model_name, results in results_to_show:
        print(f"\n{model_name}:")
        print(f"   Accuracy:  {results['accuracy']:.4f}")
        print(f"   Precision: {results['precision']:.4f}")
        print(f"   Recall:    {results['recall']:.4f}")
        print(f"   F1-Score:  {results['f1_score']:.4f}")
        
        if 'epsilon' in results:
            print(f"   Privacy (ε): {results['epsilon']:.4f}")
        
        if 'training_time' in results:
            print(f"   Training Time: {results['training_time']:.1f}s")

    # Plot confusion matrices
    fig, axes = plt.subplots(1, len(results_to_show), figsize=(6*len(results_to_show), 5))
    if len(results_to_show) == 1:
        axes = [axes]
    
    for i, (model_name, results) in enumerate(results_to_show):
        cm = np.array(results['confusion_matrix'])
        class_names = results['class_names']
        
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                    xticklabels=class_names, yticklabels=class_names, ax=axes[i])
        axes[i].set_title(f'Confusion Matrix - {model_name}')
        axes[i].set_xlabel('Predicted')
        axes[i].set_ylabel('Actual')
    
    plt.tight_layout()
    plt.show()

else:
    print("No results found")
    print("Run the training cells first")


## 3. Results Analysis


In [None]:
# Run baseline training 5 times with different seeds
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os
import shutil

# Configuration
num_runs = 5
seeds = [42, 123, 456, 789, 999]
results_baseline = []

# Setup directories
repo_data_dir = '/content/mhealth-data-privacy/data/processed/wesad'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/wesad'
base_models_dir = '/content/mhealth-data-privacy/models/wesad'
base_results_dir = '/content/mhealth-data-privacy/results/wesad'

# Create symlink once
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning: {e}")

!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data linked: {repo_data_dir} -> {drive_data_dir}\n")

print("Starting 5 baseline runs...")
print("=" * 50)

for i, seed in enumerate(seeds):
    print(f"\n{'='*60}")
    print(f"RUN {i+1}/5 - SEED {seed}")
    print("="*60)
    
    # Create run-specific directories
    run_dir = f'baseline_run{i+1}'
    models_dir = f'{base_models_dir}/{run_dir}'
    results_dir = f'{base_results_dir}/{run_dir}'
    os.makedirs(models_dir, exist_ok=True)
    os.makedirs(results_dir, exist_ok=True)
    
    # Modify seed in training script by setting environment variable
    os.environ['TRAIN_SEED'] = str(seed)
    os.environ['MODEL_DIR'] = models_dir
    os.environ['RESULTS_DIR'] = results_dir
    
    # Run training
    !python /content/mhealth-data-privacy/src/train/wesad/train_baseline.py
    
    # Load results
    results_path = f'{models_dir}/results_wesad_baseline.json'
    if os.path.exists(results_path):
        with open(results_path, 'r') as f:
            run_results = json.load(f)
        
        results_baseline.append({
            'run': i+1,
            'seed': seed,
            'accuracy': run_results['accuracy'],
            'f1_score': run_results['f1_score'],
            'precision': run_results['precision'],
            'recall': run_results['recall'],
            'confusion_matrix': run_results['confusion_matrix'],
            'timestamp': datetime.now().isoformat()
        })
        
        print(f"\nRun {i+1} Results: Accuracy={run_results['accuracy']:.4f}, F1={run_results['f1_score']:.4f}")
    else:
        print(f"Warning: Results not found for run {i+1}")

# Calculate statistics
if results_baseline:
    accuracies = [r['accuracy'] for r in results_baseline]
    f1_scores = [r['f1_score'] for r in results_baseline]
    precisions = [r['precision'] for r in results_baseline]
    recalls = [r['recall'] for r in results_baseline]
    
    stats = {
        'mean_accuracy': float(np.mean(accuracies)),
        'std_accuracy': float(np.std(accuracies)),
        'mean_f1': float(np.mean(f1_scores)),
        'std_f1': float(np.std(f1_scores)),
        'mean_precision': float(np.mean(precisions)),
        'std_precision': float(np.std(precisions)),
        'mean_recall': float(np.mean(recalls)),
        'std_recall': float(np.std(recalls)),
        'runs': results_baseline
    }
    
    print(f"\n{'='*50}")
    print("BASELINE STATISTICS (5 RUNS)")
    print("="*50)
    print(f"Accuracy:  {stats['mean_accuracy']:.4f} ± {stats['std_accuracy']:.4f}")
    print(f"Precision: {stats['mean_precision']:.4f} ± {stats['std_precision']:.4f}")
    print(f"Recall:    {stats['mean_recall']:.4f} ± {stats['std_recall']:.4f}")
    print(f"F1-Score:  {stats['mean_f1']:.4f} ± {stats['std_f1']:.4f}")
    
    # Save results
    os.makedirs(f'{base_results_dir}/baseline', exist_ok=True)
    with open(f'{base_results_dir}/baseline/baseline_5runs.json', 'w') as f:
        json.dump(stats, f, indent=2)
    
    print(f"\nResults saved to: {base_results_dir}/baseline/baseline_5runs.json")
else:
    print("No results collected!")


## Tips and Troubleshooting

### Common Issues:

1. **Data not found:**
   - Check if path `mydrive/mhealth-data/data/processed/wesad/` is correct
   - Ensure all `.npy` files are present

2. **GPU not available:**
   - Go to Runtime → Change runtime type → Hardware accelerator → GPU
   - Model will work on CPU but will be slower
   - Automatic hardware detection: CUDA > MPS > CPU

3. **Insufficient memory:**
   - WESAD dataset is smaller than Sleep-EDF (~10MB for training)
   - LSTM model has ~200K parameters
   - Consider smaller batch_size if needed

4. **Long training time:**
   - First epoch always takes longer (initial loading)
   - Logs show progress every 3 batches for DP training
   - Early stopping after 10 epochs without improvement

5. **Opacus LSTM compatibility:**
   - Fixed: Uses DPLSTM instead of nn.LSTM for DP training
   - DPLSTM is a drop-in replacement compatible with Opacus
   - **Why DPLSTM?** Opacus requires special DP-compatible layers because nn.LSTM uses internal modules that break gradient sampling hooks needed for differential privacy

### Model Architecture:
- **LSTM**: Bidirectional with 2 layers (64 hidden units)
- **Dense**: 2 fully connected layers for classification
- **Total**: ~200K parameters

### Resources Used:
- **GPU**: Recommended for fast training
- **RAM**: ~2GB for dataset + model
- **Disk**: ~100MB for code + results

### Next Steps:
1. Compare baseline vs differential privacy performance
2. Analyze privacy-utility trade-offs
3. Test with other physiological datasets

---

**Notebook created for SIDM - MHealth Data Privacy project**
