# hypopredict Package: New Structure Usage Guide

This notebook demonstrates the complete usage of the refactored hypopredict package.

**Contents:**
1. Setup and Imports
2. Data Loading Examples
3. Person Class Usage
4. Data Chunking
5. Feature Extraction
6. Cross-Validation Setup
7. Complete Training Pipeline
8. PyTorch Dataset Creation
9. Error Handling and Debugging
10. GCS/Cloud Considerations

## 1. Setup and Imports

First, let's import all necessary modules and configure our environment.

In [None]:
# Standard library imports
import os
import warnings
from pathlib import Path

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# hypopredict package imports
import hypopredict.compressor as comp
import hypopredict.train_test_split as tts
import hypopredict.feature_extraction as fe
from hypopredict.person import Person
from hypopredict.cv import CV_splitter
from hypopredict import params

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=RuntimeWarning)

print("✓ All imports successful!")

In [None]:
# Load environment variables
load_dotenv()

# Check environment configuration
glucose_path = os.getenv('GLUCOSE_PATH')
ecg_path = os.getenv('ECG_PATH')

print("Environment Configuration:")
print(f"GLUCOSE_PATH: {glucose_path}")
print(f"ECG_PATH: {ecg_path}")
print()

if glucose_path and Path(glucose_path).exists():
    print(f"✓ Glucose path exists")
else:
    print("⚠ Warning: Glucose path not configured or doesn't exist")
    
if ecg_path and Path(ecg_path).exists():
    print(f"✓ ECG path exists")
else:
    print("⚠ Warning: ECG path not configured or doesn't exist")

## 2. Data Loading Examples

The hypopredict package supports loading data from multiple sources:
- Google Drive (for quick sharing)
- Local filesystem (for production)
- GCS buckets (for cloud training)

### 2.1 Load Glucose Data from Google Drive

In [None]:
# Load glucose data for person 1 from Google Drive
# This is useful for quick prototyping and sharing

print("Loading glucose data from Google Drive...")
try:
    glucose_df = comp.gdrive_to_pandas(comp.GLUCOSE_ID_LINKS[0])
    print(f"✓ Loaded glucose data: {glucose_df.shape}")
    print(f"\nFirst few rows:")
    display(glucose_df.head())
except Exception as e:
    print(f"⚠ Could not load from Google Drive: {e}")
    print("This may require internet access or Google Drive permissions")

### 2.2 Identify Hypoglycemic (HG) Events

In [None]:
# Identify HG events: glucose < 3.9 mmol/L for at least 15 minutes

if 'glucose_df' in locals():
    print("Identifying HG events...")
    hg_events = comp.identify_hg_events(
        glucose_df,
        threshold=3.9,      # mmol/L
        min_duration=15,    # minutes
        cgm_only=True       # Use only CGM readings
    )
    
    print(f"✓ HG events identified: {hg_events.shape}")
    print(f"\nHG event statistics:")
    print(f"  Total time points: {len(hg_events)}")
    print(f"  HG time points: {hg_events['is_hg'].sum()}")
    print(f"  HG proportion: {hg_events['is_hg'].mean():.2%}")
    print(f"  Number of HG onsets: {hg_events['onset'].sum()}")
    
    display(hg_events[hg_events['onset'] == 1].head())
else:
    print("⚠ Glucose data not loaded. Skipping HG event identification.")

## 3. Person Class Usage

The `Person` class provides a clean interface for managing patient data.

### 3.1 Initialize and Load Patient Data

In [None]:
# Initialize a Person object
person_id = 1

if ecg_path:
    person = Person(ID=person_id, ecg_dir=ecg_path)
    print(f"✓ Person object created: ID={person.ID}")
    print(f"  ECG directory: {person.ecg_dir}")
else:
    print("⚠ ECG path not configured. Cannot create Person object.")
    person = None

### 3.2 Load Glucose Data and Identify HG Events

In [None]:
# Load glucose data using Person class
# This method handles data loading and HG event identification

if person and glucose_path:
    try:
        print("Loading HG data...")
        person.load_HG_data(
            glucose_src='local',    # or 'gdrive'
            min_duration=15,        # minutes
            threshold=3.9           # mmol/L
        )
        print(f"✓ HG data loaded successfully")
        print(f"  HG events shape: {person.hg_events.shape}")
        print(f"  HG proportion: {person.hg_events['is_hg'].mean():.2%}")
    except FileNotFoundError as e:
        print(f"⚠ Glucose data file not found: {e}")
    except Exception as e:
        print(f"⚠ Error loading glucose data: {e}")
elif person:
    print("⚠ Glucose path not configured. Using Google Drive instead.")
    try:
        person.load_HG_data(glucose_src='gdrive')
        print(f"✓ HG data loaded from Google Drive")
    except Exception as e:
        print(f"⚠ Error loading from Google Drive: {e}")

### 3.3 Load ECG Data for a Specific Day

In [None]:
# Load ECG data for day 4
day = 4

if person and ecg_path:
    try:
        print(f"Loading ECG data for day {day}...")
        person.load_ECG_day(day=day, warning=True)
        
        print(f"✓ ECG data loaded successfully")
        print(f"  Shape: {person.ecg[day].shape}")
        print(f"  Columns: {list(person.ecg[day].columns)}")
        print(f"  Time range: {person.ecg[day].index[0]} to {person.ecg[day].index[-1]}")
        print(f"  Duration: {person.ecg[day].index[-1] - person.ecg[day].index[0]}")
        
        display(person.ecg[day].head())
    except FileNotFoundError as e:
        print(f"⚠ ECG data file not found: {e}")
    except Exception as e:
        print(f"⚠ Error loading ECG data: {e}")
else:
    print("⚠ Person object not initialized or ECG path not configured")

## 4. Data Chunking

Split continuous time-series data into overlapping chunks for model training.

### 4.1 Define Chunk Parameters

In [None]:
# Define chunking parameters
chunk_size = pd.Timedelta(minutes=5)  # 5-minute chunks
step_size = pd.Timedelta(minutes=1)   # 1-minute step (4 minutes overlap)

print(f"Chunk configuration:")
print(f"  Chunk size: {chunk_size}")
print(f"  Step size: {step_size}")
print(f"  Overlap: {chunk_size - step_size} ({(1 - step_size/chunk_size)*100:.0f}%)")

### 4.2 Chunk a Single DataFrame

In [None]:
# Chunk the ECG data we loaded earlier

if person and day in person.ecg:
    print(f"Chunking ECG data for day {day}...")
    chunks = tts.chunkify_df(
        df=person.ecg[day],
        chunk_size=chunk_size,
        step_size=step_size
    )
    
    print(f"✓ Data chunked successfully")
    print(f"  Number of chunks: {len(chunks)}")
    print(f"  First chunk shape: {chunks[0].shape}")
    print(f"  First chunk time range: {chunks[0].index[0]} to {chunks[0].index[-1]}")
    print(f"  Last chunk shape: {chunks[-1].shape}")
else:
    print("⚠ ECG data not loaded. Skipping chunking.")
    chunks = []

### 4.3 Chunk a Single Person-Day

In [None]:
# Chunk using person_day notation (person_id * 10 + day)
person_day = 14  # Person 1, Day 4

if ecg_path:
    try:
        print(f"Chunking person-day {person_day}...")
        pd_result, chunks_pd = tts.chunkify_day(
            person_day=person_day,
            chunk_size=chunk_size,
            step_size=step_size,
            ecg_dir=ecg_path
        )
        
        print(f"✓ Person-day {pd_result} chunked successfully")
        print(f"  Number of chunks: {len(chunks_pd)}")
    except Exception as e:
        print(f"⚠ Error chunking person-day: {e}")
        chunks_pd = []
else:
    print("⚠ ECG path not configured. Skipping person-day chunking.")

### 4.4 Chunk Multiple Person-Days

In [None]:
# Chunk multiple person-days at once
# Note: This requires actual data files to exist

if ecg_path:
    # Use a small subset for demonstration
    demo_days = params.DEMO_DAYS  # Days with higher HG proportion
    print(f"Chunking {len(demo_days)} demo days: {demo_days}")
    
    try:
        chunks_all = tts.chunkify(
            person_days=demo_days,
            chunk_size=chunk_size,
            step_size=step_size,
            ecg_dir=ecg_path
        )
        
        print(f"✓ All days chunked successfully")
        for pd, chunks in chunks_all.items():
            print(f"  Day {pd}: {len(chunks)} chunks")
    except Exception as e:
        print(f"⚠ Error chunking multiple days: {e}")
        print("This is expected if data files don't exist locally.")
        chunks_all = {}
else:
    print("⚠ ECG path not configured. Skipping multiple-day chunking.")
    chunks_all = {}

## 5. Feature Extraction

Extract features from chunks for machine learning models.

### 5.1 Extract Statistical Features

In [None]:
# Extract statistical features (mean, std, min, max, quantiles, skew, kurtosis)

if chunks:
    print("Extracting statistical features...")
    stat_features = fe.extract_features(chunks[:10])  # Use first 10 chunks for demo
    
    print(f"✓ Statistical features extracted")
    print(f"  Feature matrix shape: {stat_features.shape}")
    print(f"  Number of features: {stat_features.shape[1]}")
    print(f"\nFeature columns (first 10):")
    print(list(stat_features.columns[:10]))
    
    display(stat_features.head())
else:
    print("⚠ No chunks available. Skipping feature extraction.")

### 5.2 Extract ECG Features

In [None]:
# Extract ECG-specific features (R-peaks, heart rate, RR intervals)

if chunks:
    print("Extracting ECG features...")
    ecg_features = fe.extract_ecg_features(
        chunks[:5],  # Use first 5 chunks for demo (ECG processing is slower)
        ecg_column='EcgWaveform',
        sampling_rate=250,
        verbose=True
    )
    
    print(f"\n✓ ECG features extracted")
    print(f"  Feature matrix shape: {ecg_features.shape}")
    print(f"  Feature columns: {list(ecg_features.columns)}")
    
    display(ecg_features)
else:
    print("⚠ No chunks available. Skipping ECG feature extraction.")

### 5.3 Extract HRV Features

In [None]:
# Extract Heart Rate Variability (HRV) features
# Note: HRV extraction may produce warnings for chunks with insufficient R-peaks

if chunks:
    print("Extracting HRV features...")
    try:
        hrv_features_list = fe.extract_hrv_features(
            chunks[:3],  # Use first 3 chunks for demo (HRV processing is slow)
            ecg_column='EcgWaveform',
            sampling_rate=250
        )
        
        # Combine HRV features into a single DataFrame
        hrv_features = pd.concat(hrv_features_list, ignore_index=True)
        
        print(f"✓ HRV features extracted")
        print(f"  Feature matrix shape: {hrv_features.shape}")
        print(f"  Number of HRV features: {hrv_features.shape[1]}")
        
        display(hrv_features.head())
    except Exception as e:
        print(f"⚠ Error extracting HRV features: {e}")
else:
    print("⚠ No chunks available. Skipping HRV feature extraction.")

### 5.4 Extract Combined Features (Recommended)

In [None]:
# Extract all features at once (more efficient)

if chunks:
    print("Extracting combined features...")
    combined_features = fe.extract_combined_features_sequential(
        chunks[:5],  # Use first 5 chunks for demo
        ecg_column='EcgWaveform',
        sampling_rate=250,
        verbose=True
    )
    
    print(f"\n✓ Combined features extracted")
    print(f"  Feature matrix shape: {combined_features.shape}")
    print(f"  Number of features: {combined_features.shape[1]}")
    print(f"  NaN values: {combined_features.isna().sum().sum()}")
    
    # Handle NaN values
    combined_features_clean = combined_features.fillna(0)
    print(f"  After filling NaN with 0: {combined_features_clean.isna().sum().sum()} NaN values")
    
    display(combined_features_clean.head())
else:
    print("⚠ No chunks available. Skipping combined feature extraction.")

## 6. Configuration Management (params.py)

The `params` module provides centralized dataset configuration.

In [None]:
# Explore dataset parameters

print("Dataset Configuration:")
print(f"\nTotal days available: {len(params.ALL_DAYS)}")
print(f"Training days: {len(params.TRAIN_DAYS)}")
print(f"Test days: {len(params.TEST_DAYS)}")
print(f"Demo days: {len(params.DEMO_DAYS)}")
print(f"Invalid days: {len(params.INVALID_DAYS)}")
print(f"\nDays with HG events: {len(params.HG_DAYS)}")
print(f"Days without HG events: {len(params.ZERO_DAYS)}")

print(f"\nTraining days: {params.TRAIN_DAYS}")
print(f"Test days: {params.TEST_DAYS}")
print(f"Demo days: {params.DEMO_DAYS}")
print(f"Invalid days (skip these): {params.INVALID_DAYS}")

## 7. Cross-Validation Setup

Use the `CV_splitter` class for proper train-validation splits.

In [None]:
# Initialize cross-validation splitter

if ecg_path and glucose_path:
    cv_splitter = CV_splitter(
        ecg_dir=ecg_path,
        glucose_src='local',
        n_splits=5,
        random_state=17
    )
    
    print("✓ CV splitter initialized")
    print(f"  Number of splits: {cv_splitter.n_splits}")
    print(f"  Random state: {cv_splitter.random_state}")
else:
    print("⚠ Paths not configured. Cannot initialize CV splitter.")
    cv_splitter = None

In [None]:
# Generate cross-validation splits

if cv_splitter:
    print("Generating CV splits...")
    splits = cv_splitter.get_splits(params.TRAIN_DAYS)
    
    print(f"✓ Splits generated")
    print(f"  Number of splits: {len(splits)}")
    for i, split in enumerate(splits):
        print(f"  Fold {i+1}: {len(split)} days - {split}")
else:
    print("⚠ CV splitter not initialized. Skipping split generation.")

In [None]:
# Validate splits (check for HG events)
# Note: This requires loading glucose data for each day

if cv_splitter and 'splits' in locals():
    print("Validating splits...")
    try:
        valid, hg_proportions = cv_splitter.validate(splits, verbose=True, warning=False)
        
        print(f"\n✓ Validation complete")
        print(f"\nSplit HG proportions:")
        for i, prop in enumerate(hg_proportions):
            print(f"  Fold {i+1}: {prop:.2%} HG")
    except Exception as e:
        print(f"⚠ Error validating splits: {e}")
        print("This is expected if glucose data files don't exist locally.")
else:
    print("⚠ Splits not generated. Skipping validation.")

## 8. Complete Training Pipeline Example

Putting it all together: chunk data, extract features, and prepare for training.

In [None]:
# Complete pipeline function

def prepare_training_data(person_days, ecg_dir, chunk_size, step_size, verbose=True):
    """
    Complete pipeline: chunk data and extract features.
    
    Args:
        person_days: List of person-day identifiers
        ecg_dir: Path to ECG data directory
        chunk_size: Size of each chunk
        step_size: Step size for chunking
        verbose: Print progress
    
    Returns:
        DataFrame with features for all chunks
    """
    all_features = []
    
    for person_day in person_days:
        try:
            if verbose:
                print(f"Processing day {person_day}...")
            
            # Chunk the day
            pd_result, chunks = tts.chunkify_day(
                person_day=person_day,
                chunk_size=chunk_size,
                step_size=step_size,
                ecg_dir=ecg_dir
            )
            
            if verbose:
                print(f"  Chunked into {len(chunks)} chunks")
            
            # Extract features
            features = fe.extract_combined_features_sequential(
                chunks,
                ecg_column='EcgWaveform',
                sampling_rate=250,
                verbose=False
            )
            
            # Add metadata
            features['person_day'] = person_day
            features['person_id'] = person_day // 10
            features['day'] = person_day % 10
            
            all_features.append(features)
            
            if verbose:
                print(f"  Extracted {features.shape[1]-3} features")
            
        except Exception as e:
            if verbose:
                print(f"  ⚠ Error processing day {person_day}: {e}")
    
    # Combine all features
    if all_features:
        combined = pd.concat(all_features, ignore_index=True)
        # Handle NaN values
        combined = combined.fillna(0)
        return combined
    else:
        return pd.DataFrame()

print("✓ Pipeline function defined")

In [None]:
# Run the pipeline on demo days

if ecg_path:
    print("Running complete pipeline on demo days...\n")
    try:
        training_features = prepare_training_data(
            person_days=params.DEMO_DAYS,
            ecg_dir=ecg_path,
            chunk_size=pd.Timedelta(minutes=5),
            step_size=pd.Timedelta(minutes=1),
            verbose=True
        )
        
        print(f"\n✓ Pipeline complete")
        print(f"  Final feature matrix shape: {training_features.shape}")
        print(f"  Total chunks: {len(training_features)}")
        print(f"  Feature columns: {training_features.shape[1]}")
        
        display(training_features.head())
    except Exception as e:
        print(f"⚠ Pipeline error: {e}")
        print("This is expected if data files don't exist locally.")
else:
    print("⚠ ECG path not configured. Skipping pipeline.")

## 9. PyTorch Dataset Creation

Create PyTorch datasets for deep learning models.

In [None]:
# Define PyTorch Dataset class

import torch
from torch.utils.data import Dataset, DataLoader

class HypoglycemiaDataset(Dataset):
    """
    PyTorch Dataset for hypoglycemia prediction.
    
    Args:
        features: DataFrame or array of features
        labels: Array of labels
        transform: Optional transform to apply
    """
    
    def __init__(self, features, labels, transform=None):
        # Convert to numpy if DataFrame
        if isinstance(features, pd.DataFrame):
            self.features = features.values
        else:
            self.features = features
            
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        # Get features and labels
        features = self.features[idx]
        label = self.labels[idx]
        
        # Convert to tensors
        features_tensor = torch.tensor(features, dtype=torch.float32)
        label_tensor = torch.tensor(label, dtype=torch.float32)
        
        # Apply transform if provided
        if self.transform:
            features_tensor = self.transform(features_tensor)
        
        return features_tensor, label_tensor

print("✓ PyTorch Dataset class defined")

In [None]:
# Create a sample dataset

if 'training_features' in locals() and not training_features.empty:
    # Create dummy labels for demonstration
    # In real use, these would come from glucose data
    dummy_labels = np.random.randint(0, 2, size=len(training_features))
    
    # Select only numeric features (exclude metadata)
    feature_columns = [col for col in training_features.columns 
                      if col not in ['person_day', 'person_id', 'day']]
    features_numeric = training_features[feature_columns]
    
    # Create dataset
    dataset = HypoglycemiaDataset(features_numeric, dummy_labels)
    
    print(f"✓ Dataset created")
    print(f"  Number of samples: {len(dataset)}")
    print(f"  Feature dimension: {features_numeric.shape[1]}")
    
    # Create DataLoader
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    print(f"  Batch size: 32")
    print(f"  Number of batches: {len(dataloader)}")
    
    # Test the dataloader
    for batch_features, batch_labels in dataloader:
        print(f"\nFirst batch:")
        print(f"  Features shape: {batch_features.shape}")
        print(f"  Labels shape: {batch_labels.shape}")
        break
else:
    print("⚠ Training features not available. Skipping dataset creation.")

## 10. Error Handling and Debugging Tips

In [None]:
# Common debugging checks

print("=" * 60)
print("DEBUGGING CHECKLIST")
print("=" * 60)

# Check environment variables
print("\n1. Environment Variables:")
print(f"   GLUCOSE_PATH: {os.getenv('GLUCOSE_PATH')}")
print(f"   ECG_PATH: {os.getenv('ECG_PATH')}")

# Check paths exist
print("\n2. Path Validation:")
if os.getenv('GLUCOSE_PATH'):
    print(f"   Glucose path exists: {Path(os.getenv('GLUCOSE_PATH')).exists()}")
if os.getenv('ECG_PATH'):
    print(f"   ECG path exists: {Path(os.getenv('ECG_PATH')).exists()}")

# Check module imports
print("\n3. Module Imports:")
print("   ✓ hypopredict.compressor")
print("   ✓ hypopredict.train_test_split")
print("   ✓ hypopredict.feature_extraction")
print("   ✓ hypopredict.person")
print("   ✓ hypopredict.cv")
print("   ✓ hypopredict.params")

# Check data availability
print("\n4. Data Status:")
print(f"   Person object: {'✓ Created' if 'person' in locals() and person else '✗ Not created'}")
print(f"   Chunks available: {'✓ Yes' if 'chunks' in locals() and chunks else '✗ No'}")
print(f"   Features extracted: {'✓ Yes' if 'training_features' in locals() else '✗ No'}")

print("\n" + "=" * 60)

### Common Issues and Solutions

In [None]:
# Display troubleshooting guide

troubleshooting = """
COMMON ISSUES AND SOLUTIONS:

1. ImportError: No module named 'hypopredict'
   Solution: Install package in editable mode: pip install -e .

2. Environment variables not loading
   Solution: Ensure .env file exists and run: load_dotenv()

3. FileNotFoundError when loading data
   Solution: 
   - Check .env paths are correct and absolute
   - Verify data files exist in specified directories
   - Try loading from Google Drive instead: glucose_src='gdrive'

4. NeuroKit2 RuntimeWarnings during feature extraction
   Solution: This is normal for chunks with insufficient R-peaks
   - Suppress warnings: warnings.filterwarnings('ignore', category=RuntimeWarning)
   - Handle NaN values in features: features.fillna(0)

5. Memory issues with large datasets
   Solution:
   - Process days one at a time
   - Save features to disk incrementally
   - Use smaller chunk sizes or fewer days

6. Jupyter kernel crashes
   Solution:
   - Restart kernel: Kernel → Restart Kernel
   - Install ipykernel: pip install ipykernel
   - Select correct kernel: Python (hypopredict)

7. Package changes not reflected
   Solution:
   - Reinstall package: pip uninstall hypopredict && pip install -e .
   - Or use auto-reload in Jupyter:
     %load_ext autoreload
     %autoreload 2
"""

print(troubleshooting)

## 11. GCS/Cloud Training Considerations

Tips for training on cloud platforms (Google Cloud, AWS, etc.)

In [None]:
# Cloud training configuration example

cloud_training_guide = """
CLOUD TRAINING GUIDE:

1. Upload Data to GCS Bucket:
   gsutil cp -r /local/glucose/data gs://your-bucket/hypopredict-data/glucose/
   gsutil cp -r /local/ecg/data gs://your-bucket/hypopredict-data/ecg/

2. Set Environment Variables in Training Script:
   import os
   os.environ['GLUCOSE_PATH'] = '/gcs/your-bucket/hypopredict-data/glucose/'
   os.environ['ECG_PATH'] = '/gcs/your-bucket/hypopredict-data/ecg/'

3. Use the Same Code:
   # Everything works the same way!
   from hypopredict.person import Person
   person = Person(ID=1, ecg_dir=os.getenv('ECG_PATH'))
   person.load_HG_data(glucose_src='local')

4. Alternative: Use Google Drive Links (for smaller datasets):
   import hypopredict.compressor as comp
   glucose_df = comp.gdrive_to_pandas(comp.GLUCOSE_ID_LINKS[0])

5. Save Models to GCS:
   import joblib
   joblib.dump(model, '/gcs/your-bucket/models/model.pkl')

6. Track Experiments with MLflow:
   import mlflow
   mlflow.set_tracking_uri('gs://your-bucket/mlflow')
   mlflow.log_params({...})
   mlflow.log_metrics({...})
"""

print(cloud_training_guide)

## Summary

This notebook demonstrated:

1. ✓ Package setup and imports
2. ✓ Data loading from multiple sources (Google Drive, local, GCS)
3. ✓ Person class usage for patient data management
4. ✓ Data chunking for time-series processing
5. ✓ Feature extraction (statistical, ECG, HRV)
6. ✓ Cross-validation setup
7. ✓ Complete training pipeline
8. ✓ PyTorch dataset creation
9. ✓ Error handling and debugging
10. ✓ Cloud training considerations

**Next Steps:**
- Review `MIGRATION.md` for detailed migration guide
- Check `QUICK_REFERENCE.md` for quick code snippets
- Read `TESTING_LOCALLY.md` for local setup instructions
- Start building your models with the prepared features!