# Sleep-EDF PyTorch Training - Google Colab

This notebook trains an optimized PyTorch CNN+LSTM model for sleep stage classification using the Sleep-EDF dataset.

## Features:
- Optimized CNN+LSTM model (target: >=87% accuracy)
- Efficient data loading for large datasets
- Clean logs during training (no emojis)
- Early stopping and learning rate scheduling
- Memory and speed optimizations
- Automatic hardware acceleration (CUDA > MPS > CPU)

## Requirements:
- Google Colab with GPU enabled
- Data in Google Drive: `mydrive/mhealth-data/data/processed/sleep-edf/`


## 1. Initial Setup


In [None]:
# Install necessary dependencies
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install numpy pandas scikit-learn matplotlib seaborn
!pip install pyedflib mne

# Clone the repository
!git clone https://github.com/vasco-fernandes21/mhealth-data-privacy.git
import sys
sys.path.append('/content/mhealth-data-privacy')

print("Dependencies installed and repository cloned")


Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting pyedflib
  Downloading pyedflib-0.1.42-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting mne
  Downloading mne-1.10.2-py3-none-any.whl.metadata (21 kB)
Downloading pyedflib-0.1.42-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading mne-1.10.2-py3-none-any.whl (7.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m149.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Check if data exists
import os
data_path = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'

if os.path.exists(data_path):
    print(f"Data found at: {data_path}")

    # List files
    files = os.listdir(data_path)
    print(f"Available files: {files}")

    # Check file sizes
    for file in ['X_train.npy', 'y_train.npy', 'X_val.npy', 'y_val.npy', 'X_test.npy', 'y_test.npy']:
        if file in files:
            size = os.path.getsize(os.path.join(data_path, file))
            print(f"  {file}: {size / (1024*1024):.1f} MB")
        else:
            print(f"  {file} not found")
else:
    print(f"Data not found at: {data_path}")
    print("Make sure the data is in the correct path on Google Drive")


In [None]:
# Run training using the repository script
import os, shutil

repo_data_dir = '/content/mhealth-data-privacy/data/processed/sleep-edf'
drive_data_dir = '/content/drive/MyDrive/mhealth-data/data/processed/sleep-edf'
models_dir = '/content/mhealth-data-privacy/models/sleep-edf/baseline'
results_dir = '/content/mhealth-data-privacy/results/sleep-edf/baseline'

# Ensure output directories
os.makedirs(models_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

# Create symbolic link from Drive data to expected path
os.makedirs('/content/mhealth-data-privacy/data/processed', exist_ok=True)
if os.path.islink(repo_data_dir) or os.path.exists(repo_data_dir):
    try:
        if os.path.islink(repo_data_dir):
            os.unlink(repo_data_dir)
        else:
            shutil.rmtree(repo_data_dir)
    except Exception as e:
        print(f"Warning when removing old destination: {e}")

# Create symlink
!ln -sf "$drive_data_dir" "$repo_data_dir"
print(f"Data referenced via symlink: {repo_data_dir} -> {drive_data_dir}")

print(f"Starting CNN+LSTM training with data from: {drive_data_dir}")
print("=" * 80)

# Run the baseline training script
!python /content/mhealth-data-privacy/src/train/sleep-edf/train_baseline.py

print("Training completed!")

## 3. Results Analysis


In [None]:
# Load and analyze results
import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load results
results_path = '/content/mhealth-data-privacy/models/sleep-edf/baseline/results_sleep_edf_optimized.json'

if os.path.exists(results_path):
    with open(results_path, 'r') as f:
        results = json.load(f)

    print("FINAL RESULTS:")
    print("=" * 50)
    print(f"Accuracy:  {results['accuracy']:.4f}")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall:    {results['recall']:.4f}")
    print(f"F1-Score:  {results['f1_score']:.4f}")

    # Confusion matrix
    print("\nCONFUSION MATRIX:")
    cm = np.array(results['confusion_matrix'])
    class_names = results['class_names']

    print(f"{'':8s}", end="")
    for name in class_names:
        print(f"{name:8s}", end="")
    print("\n{'Real ↓':8s}", end="")

    for i, row in enumerate(cm):
        print(f"{class_names[i]:8s}", end="")
        for val in row:
            print(f"{val:8d}", end="")
        print()

    # Plot confusion matrix
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names)
    plt.title('Confusion Matrix - Sleep-EDF PyTorch')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

else:
    print(f"Results not found at: {results_path}")
    print("Run the training cell first")


## Tips and Troubleshooting

### Common Issues:

1. **Data not found:**
   - Check if path `mydrive/mhealth-data/data/processed/sleep-edf/` is correct
   - Ensure all `.npy` files are present

2. **GPU not available:**
   - Go to Runtime → Change runtime type → Hardware accelerator → GPU
   - Model will work on CPU but will be slower
   - Automatic hardware detection: CUDA > MPS > CPU

3. **Insufficient memory:**
   - Sleep-EDF dataset is large (58MB for training)
   - CNN+LSTM model has ~311K parameters
   - Consider smaller batch_size if needed

4. **Long training time:**
   - First epoch always takes longer (initial loading)
   - Logs show progress every epoch
   - Early stopping after 8 epochs without improvement

### Model Architecture:
- **CNN**: 3 convolutional blocks for feature extraction
- **LSTM**: Bidirectional with 2 layers (64 hidden units)
- **Dense**: 3 fully connected layers for classification
- **Total**: ~311K parameters

### Resources Used:
- **GPU**: Recommended for fast training
- **RAM**: ~4GB for dataset + model
- **Disk**: ~200MB for code + results

### Next Steps:
1. Compare with simple LSTM baseline (87% accuracy)
2. Implement privacy techniques (DP-SGD)
3. Test with other physiological datasets

---

**Notebook created for SIDM - MHealth Data Privacy project**
