# Phase 7.1: Deep Learning with PyTorch

This notebook builds neural networks for diabetes prediction from scratch, explaining each concept along the way.

## Learning Objectives
1. Understand PyTorch tensors and data loading
2. Build a neural network architecture
3. Implement training loop with backpropagation
4. Apply regularization (dropout, batch normalization)
5. Evaluate and compare with our LightGBM baseline
6. **Build a regression model** for HbA1c prediction
7. **Use learning rate scheduling** to improve training
8. **Apply Optuna** for hyperparameter optimization

## Part 1: Setup and Data Loading

### 1.1 Imports

PyTorch is organized into several key modules:
- `torch`: Core tensor operations (like NumPy)
- `torch.nn`: Neural network layers and loss functions
- `torch.optim`: Optimizers (SGD, Adam, etc.)
- `torch.utils.data`: Data loading utilities

In [None]:
# Standard imports
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, f1_score, roc_auc_score

# PyTorch imports
import torch
import torch.nn as nn  # Neural network modules
import torch.optim as optim  # Optimizers
from torch.utils.data import DataLoader, TensorDataset  # Data utilities

# Project setup
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Check device (CPU vs GPU)
# MPS = Metal Performance Shaders (Apple Silicon GPU)
if torch.backends.mps.is_available():
    device = torch.device('mps')
    print("Using Apple Silicon GPU (MPS)")
elif torch.cuda.is_available():
    device = torch.device('cuda')
    print("Using NVIDIA GPU (CUDA)")
else:
    device = torch.device('cpu')
    print("Using CPU")

print(f"PyTorch version: {torch.__version__}")

### 1.2 Understanding Tensors

**Tensors** are PyTorch's fundamental data structure - like NumPy arrays but with:
- GPU acceleration support
- Automatic differentiation (for backpropagation)

Let's see a quick example:

In [None]:
# Creating tensors
# From Python list
x = torch.tensor([1, 2, 3])
print(f"Tensor from list: {x}")
print(f"Shape: {x.shape}, Dtype: {x.dtype}")

# From NumPy (common in ML)
np_array = np.array([[1.0, 2.0], [3.0, 4.0]])
tensor_from_np = torch.from_numpy(np_array)
print(f"\nTensor from NumPy:\n{tensor_from_np}")

# Key difference: tensors track gradients for backpropagation
x_with_grad = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"\nTensor with gradient tracking: {x_with_grad}")
print(f"requires_grad: {x_with_grad.requires_grad}")

### 1.3 Load Our Data

We'll use the **full imputation** dataset (no NaN) since neural networks can't handle missing values natively.

In [None]:
# Load data
DATA_DIR = project_root / 'data' / 'processed'

# Using 'with_labs_full' - complete data with lab values
X = pd.read_parquet(DATA_DIR / 'X_with_labs_full.parquet')
y = pd.read_parquet(DATA_DIR / 'y_with_labs_minimal.parquet')['DIABETES_STATUS']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts().sort_index())

### 1.4 Data Splitting and Scaling

**Why scale for neural networks?**

Neural networks use gradient descent to learn. If features have very different scales:
- Feature A: 0-1 (e.g., gender)
- Feature B: 0-10000 (e.g., calories)

The gradients for Feature B will dominate, making training unstable. StandardScaler transforms all features to mean=0, std=1.

In [None]:
from sklearn.model_selection import train_test_split

# Split: 70% train, 15% val, 15% test (same as Phase 7)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp  # 0.176 * 0.85 ≈ 0.15
)

print(f"Train: {len(X_train)} samples")
print(f"Val:   {len(X_val)} samples")
print(f"Test:  {len(X_test)} samples")

# Scale features
# IMPORTANT: Fit scaler on training data only, then transform all sets
# This prevents "data leakage" - test data shouldn't influence training
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_val_scaled = scaler.transform(X_val)          # transform only
X_test_scaled = scaler.transform(X_test)        # transform only

print(f"\nAfter scaling:")
print(f"Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.6f}")

### 1.5 Convert to PyTorch Tensors and Create DataLoaders

**DataLoader** is PyTorch's way of:
1. Batching data (processing chunks at a time)
2. Shuffling (randomizing order each epoch)
3. Efficient memory management

**Why batching?**
- Using all 8000+ samples at once requires lots of memory
- Smaller batches = noisier gradients = can help escape local minima
- Typical batch sizes: 32, 64, 128, 256

In [None]:
# Convert to PyTorch tensors
# Note: PyTorch expects float32 for features, long (int64) for class labels
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)

X_val_tensor = torch.tensor(X_val_scaled, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.long)

X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

print(f"Train tensor shape: {X_train_tensor.shape}")
print(f"Train tensor dtype: {X_train_tensor.dtype}")

# Create TensorDatasets (pairs features with labels)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create DataLoaders
BATCH_SIZE = 64

# drop_last=True for training: drops incomplete final batch
# This prevents BatchNorm errors when last batch has only 1 sample
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"\nNumber of batches per epoch: {len(train_loader)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Total samples: {len(train_loader) * BATCH_SIZE} (approx)")

In [None]:
# Let's see what one batch looks like
sample_batch_X, sample_batch_y = next(iter(train_loader))
print(f"Batch features shape: {sample_batch_X.shape}")  # [batch_size, n_features]
print(f"Batch labels shape: {sample_batch_y.shape}")    # [batch_size]
print(f"Batch labels: {sample_batch_y[:10]}")           # First 10 labels

## Part 2: Building the Neural Network

### 2.1 Neural Network Architecture Basics

A neural network is a series of **layers** that transform input data:

```
Input (96 features)
    ↓
Linear Layer (96 → 128) + ReLU activation
    ↓
Linear Layer (128 → 64) + ReLU activation
    ↓
Linear Layer (64 → 3)   # 3 output classes
    ↓
Output (3 class probabilities)
```

**Key components:**
- **Linear layer**: y = Wx + b (matrix multiplication + bias)
- **Activation function**: Adds non-linearity (ReLU, tanh, sigmoid)
- **Why non-linearity?** Without it, stacking linear layers = one big linear layer

In [None]:
# Visualize activation functions
x = torch.linspace(-5, 5, 100)

fig, axes = plt.subplots(1, 3, figsize=(12, 3))

# ReLU: max(0, x) - most common, simple and effective
axes[0].plot(x.numpy(), torch.relu(x).numpy())
axes[0].set_title('ReLU: max(0, x)')
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].axvline(x=0, color='k', linestyle='-', linewidth=0.5)
axes[0].grid(True, alpha=0.3)

# Sigmoid: 1/(1+e^-x) - squashes to (0,1), used for binary classification
axes[1].plot(x.numpy(), torch.sigmoid(x).numpy())
axes[1].set_title('Sigmoid: 1/(1+e^-x)')
axes[1].axhline(y=0.5, color='r', linestyle='--', linewidth=0.5)
axes[1].grid(True, alpha=0.3)

# Tanh: (e^x - e^-x)/(e^x + e^-x) - squashes to (-1,1)
axes[2].plot(x.numpy(), torch.tanh(x).numpy())
axes[2].set_title('Tanh: (e^x - e^-x)/(e^x + e^-x)')
axes[2].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.2 Define Our Network Architecture

In PyTorch, we define networks as classes that inherit from `nn.Module`:
- `__init__`: Define the layers
- `forward`: Define how data flows through layers

In [None]:
class DiabetesClassifier(nn.Module):
    """
    A simple feedforward neural network for diabetes classification.
    
    Architecture:
        Input (n_features)
        → Linear(128) → BatchNorm → ReLU → Dropout
        → Linear(64) → BatchNorm → ReLU → Dropout  
        → Linear(32) → BatchNorm → ReLU → Dropout
        → Linear(3) → Output (class logits)
    """
    
    def __init__(self, n_features, n_classes=3, dropout_rate=0.3):
        super().__init__()  # Initialize parent class
        
        # Layer 1: Input → 128 neurons
        self.layer1 = nn.Sequential(
            nn.Linear(n_features, 128),  # Linear transformation
            nn.BatchNorm1d(128),          # Normalize activations (stabilizes training)
            nn.ReLU(),                    # Activation function
            nn.Dropout(dropout_rate)      # Randomly zero out neurons (prevents overfitting)
        )
        
        # Layer 2: 128 → 64 neurons
        self.layer2 = nn.Sequential(
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        # Layer 3: 64 → 32 neurons
        self.layer3 = nn.Sequential(
            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        # Output layer: 32 → 3 classes (no activation - CrossEntropyLoss handles softmax)
        self.output = nn.Linear(32, n_classes)
    
    def forward(self, x):
        """
        Forward pass: define how data flows through the network.
        
        Args:
            x: Input tensor of shape [batch_size, n_features]
        
        Returns:
            Tensor of shape [batch_size, n_classes] (logits, not probabilities)
        """
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.output(x)
        return x

# Create the model
n_features = X_train_scaled.shape[1]
model = DiabetesClassifier(n_features=n_features, n_classes=3, dropout_rate=0.3)

# Move model to GPU if available
model = model.to(device)

print(model)
print(f"\nModel is on: {next(model.parameters()).device}")

In [None]:
# Count parameters (weights and biases)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Let's break this down:
print("\nParameter breakdown:")
for name, param in model.named_parameters():
    print(f"  {name}: {param.shape} = {param.numel():,} params")

### 2.3 Understanding the Components

**BatchNorm (Batch Normalization):**
- Normalizes layer inputs to have mean≈0, std≈1
- Speeds up training and allows higher learning rates
- Acts as mild regularization

**Dropout:**
- During training: randomly sets neurons to 0 with probability p
- Forces network to not rely on any single neuron
- Prevents overfitting (like ensemble of smaller networks)
- During evaluation: all neurons active (scaled by 1-p)

In [None]:
# Demonstrate dropout
dropout = nn.Dropout(p=0.5)  # 50% dropout

test_input = torch.ones(1, 10)  # 10 values, all 1.0

# Training mode: some values become 0
dropout.train()
print("Training mode (50% dropout):")
print(dropout(test_input))

# Eval mode: all values kept (scaled)
dropout.eval()
print("\nEval mode (no dropout):")
print(dropout(test_input))

## Part 3: Training the Network

### 3.1 Loss Function and Optimizer

**Loss Function (CrossEntropyLoss):**
- Measures how wrong our predictions are
- For classification: combines LogSoftmax + NLLLoss
- Lower loss = better predictions

**Optimizer (Adam):**
- Updates weights to minimize loss
- Adam = Adaptive Moment Estimation
- Combines momentum + per-parameter learning rates
- Usually works well with default settings

In [None]:
# Loss function
# CrossEntropyLoss expects:
#   - Input: [batch_size, n_classes] (raw logits)
#   - Target: [batch_size] (class indices 0, 1, 2)

# Handle class imbalance with weights
# More weight on minority classes = larger penalty for getting them wrong
class_counts = y_train.value_counts().sort_index()
class_weights = 1.0 / class_counts.values
class_weights = class_weights / class_weights.sum() * len(class_weights)  # Normalize
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(device)

print("Class distribution:")
print(class_counts)
print(f"\nClass weights: {class_weights_tensor}")

criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)

# Optimizer
# Learning rate: how big steps to take when updating weights
# Too high: overshoot optimal values
# Too low: slow convergence
LEARNING_RATE = 0.001

optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

### 3.2 The Training Loop

Neural network training follows this pattern for each batch:

1. **Forward pass**: Run data through network → get predictions
2. **Calculate loss**: Compare predictions to true labels
3. **Backward pass**: Calculate gradients (how each weight affects loss)
4. **Update weights**: Adjust weights to reduce loss
5. **Zero gradients**: Reset for next iteration

One pass through all batches = one **epoch**.

In [None]:
def train_one_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train the model for one epoch.
    
    Returns:
        Average loss for the epoch
    """
    model.train()  # Set to training mode (enables dropout, updates BatchNorm)
    total_loss = 0
    
    for batch_X, batch_y in train_loader:
        # Move data to device (GPU if available)
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        
        # Step 1: Zero gradients from previous iteration
        optimizer.zero_grad()
        
        # Step 2: Forward pass
        outputs = model(batch_X)  # [batch_size, 3]
        
        # Step 3: Calculate loss
        loss = criterion(outputs, batch_y)
        
        # Step 4: Backward pass (compute gradients)
        loss.backward()
        
        # Step 5: Update weights
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)


def evaluate(model, data_loader, criterion, device):
    """
    Evaluate the model on a dataset.
    
    Returns:
        loss, accuracy, predictions, true_labels
    """
    model.eval()  # Set to evaluation mode (disables dropout, freezes BatchNorm)
    total_loss = 0
    all_preds = []
    all_labels = []
    all_probs = []
    
    with torch.no_grad():  # Disable gradient computation (faster, less memory)
        for batch_X, batch_y in data_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            total_loss += loss.item()
            
            # Get predictions (class with highest logit)
            probs = torch.softmax(outputs, dim=1)
            preds = outputs.argmax(dim=1)
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch_y.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())
    
    avg_loss = total_loss / len(data_loader)
    accuracy = np.mean(np.array(all_preds) == np.array(all_labels))
    
    return avg_loss, accuracy, np.array(all_preds), np.array(all_labels), np.array(all_probs)

### 3.3 Training with Early Stopping

**Early stopping** prevents overfitting by:
1. Monitoring validation loss after each epoch
2. Saving the model when validation loss improves
3. Stopping if no improvement for N epochs ("patience")

In [None]:
# Training configuration
N_EPOCHS = 100
PATIENCE = 10  # Stop if no improvement for 10 epochs

# Track metrics
train_losses = []
val_losses = []
val_accuracies = []

# Early stopping variables
best_val_loss = float('inf')
best_model_state = None
patience_counter = 0

print("Starting training...")
print(f"{'Epoch':>6} {'Train Loss':>12} {'Val Loss':>12} {'Val Acc':>10} {'Status':>10}")
print("-" * 55)

for epoch in range(N_EPOCHS):
    # Train
    train_loss = train_one_epoch(model, train_loader, criterion, optimizer, device)
    train_losses.append(train_loss)
    
    # Evaluate
    val_loss, val_acc, _, _, _ = evaluate(model, val_loader, criterion, device)
    val_losses.append(val_loss)
    val_accuracies.append(val_acc)
    
    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model_state = model.state_dict().copy()  # Save best model
        patience_counter = 0
        status = "✓ Best"
    else:
        patience_counter += 1
        status = f"({patience_counter}/{PATIENCE})"
    
    # Print progress every 5 epochs
    if (epoch + 1) % 5 == 0 or patience_counter == 0:
        print(f"{epoch+1:>6} {train_loss:>12.4f} {val_loss:>12.4f} {val_acc:>10.4f} {status:>10}")
    
    # Stop if no improvement
    if patience_counter >= PATIENCE:
        print(f"\nEarly stopping at epoch {epoch+1}")
        break

# Load best model
model.load_state_dict(best_model_state)
print(f"\nLoaded best model (val_loss: {best_val_loss:.4f})")

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss curves
axes[0].plot(train_losses, label='Train Loss')
axes[0].plot(val_losses, label='Val Loss')
axes[0].axvline(x=np.argmin(val_losses), color='r', linestyle='--', label='Best Model')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(val_accuracies, label='Val Accuracy', color='green')
axes[1].axvline(x=np.argmin(val_losses), color='r', linestyle='--', label='Best Model')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 4: Evaluation

### 4.1 Test Set Evaluation

In [None]:
# Evaluate on test set
test_loss, test_acc, y_pred, y_true, y_probs = evaluate(model, test_loader, criterion, device)

# Calculate metrics
f1_macro = f1_score(y_true, y_pred, average='macro')
roc_auc = roc_auc_score(y_true, y_probs, multi_class='ovr')

print("=" * 50)
print("TEST SET RESULTS")
print("=" * 50)
print(f"Accuracy:  {test_acc:.4f}")
print(f"F1 Macro:  {f1_macro:.4f}")
print(f"ROC AUC:   {roc_auc:.4f}")
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['No Diabetes', 'Prediabetes', 'Diabetes']))

In [None]:
# Confusion Matrix
from src.models.evaluate import plot_confusion_matrix

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_confusion_matrix(y_true, y_pred, ax=axes[0], title='PyTorch Model - Confusion Matrix')
plot_confusion_matrix(y_true, y_pred, ax=axes[1], normalize=True, title='Normalized')

plt.tight_layout()
plt.show()

### 4.2 Compare with LightGBM Baseline

In [None]:
# Load Phase 7 results for comparison
import json

results_path = project_root / 'models' / 'advanced' / 'results_summary.json'
with open(results_path) as f:
    phase7_results = json.load(f)

print("Model Comparison (Test Set):")
print("=" * 60)
print(f"{'Model':<30} {'F1 Macro':>10} {'ROC AUC':>10} {'Accuracy':>10}")
print("-" * 60)

# LightGBM results
lgb_results = phase7_results['classification']['test']['LightGBM (with labs)']
print(f"{'LightGBM (with labs)':<30} {lgb_results['f1_macro']:>10.4f} {lgb_results['roc_auc_ovr']:>10.4f} {lgb_results['accuracy']:>10.4f}")

# MLP (sklearn) results
mlp_results = phase7_results['classification']['test']['MLP (with labs)']
print(f"{'MLP sklearn (with labs)':<30} {mlp_results['f1_macro']:>10.4f} {mlp_results['roc_auc_ovr']:>10.4f} {mlp_results['accuracy']:>10.4f}")

# PyTorch results
print(f"{'PyTorch NN (with labs)':<30} {f1_macro:>10.4f} {roc_auc:>10.4f} {test_acc:>10.4f}")
print("-" * 60)

## Part 5: Save the Model

In [None]:
# Save model
import joblib

save_dir = project_root / 'models' / 'advanced' / 'classification'

# Save PyTorch model state
torch.save({
    'model_state_dict': model.state_dict(),
    'n_features': n_features,
    'n_classes': 3,
    'dropout_rate': 0.3,
    'scaler': scaler,
    'metrics': {
        'test_accuracy': test_acc,
        'test_f1_macro': f1_macro,
        'test_roc_auc': roc_auc
    }
}, save_dir / 'pytorch_with_labs.pt')

print(f"Model saved to {save_dir / 'pytorch_with_labs.pt'}")

---

## Part 6: Regression - Predicting HbA1c

Now let's tackle **regression**: predicting the actual HbA1c value (continuous) instead of diabetes class (categorical).

### Key Differences from Classification:
| Aspect | Classification | Regression |
|--------|---------------|------------|
| Output | 3 class probabilities | 1 continuous value |
| Loss function | CrossEntropyLoss | MSELoss (Mean Squared Error) |
| Output activation | None (softmax in loss) | None (direct prediction) |
| Metrics | F1, Accuracy, ROC AUC | RMSE, MAE, R² |

### 6.1 Load Regression Target (HbA1c)

In [None]:
# Load HbA1c target for regression
# HbA1c is stored in study_population, not in processed y files
INTERIM_DIR = project_root / 'data' / 'interim'
study_pop = pd.read_parquet(INTERIM_DIR / 'study_population.parquet')

# Get HbA1c values aligned with our feature indices
y_reg = study_pop.loc[y.index, 'LBXGH']

# Check for missing values
print(f"HbA1c (LBXGH) statistics:")
print(f"  Shape: {y_reg.shape}")
print(f"  Missing: {y_reg.isna().sum()} ({y_reg.isna().mean()*100:.1f}%)")
print(f"  Range: {y_reg.min():.1f} - {y_reg.max():.1f}%")
print(f"  Mean: {y_reg.mean():.2f}%, Median: {y_reg.median():.2f}%")

# Distribution plot
fig, ax = plt.subplots(figsize=(8, 4))
y_reg.dropna().hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(x=5.7, color='orange', linestyle='--', label='Prediabetes threshold (5.7%)')
ax.axvline(x=6.5, color='red', linestyle='--', label='Diabetes threshold (6.5%)')
ax.set_xlabel('HbA1c (%)')
ax.set_ylabel('Count')
ax.set_title('HbA1c Distribution')
ax.legend()
plt.tight_layout()
plt.show()

### 6.2 Prepare Regression Data

We need to:
1. Remove samples with missing HbA1c values
2. Split data (using same indices as classification for fair comparison)
3. Create new DataLoaders with float32 targets (not long/int64)

In [None]:
# Create mask for valid HbA1c values
valid_mask = ~y_reg.isna()
print(f"Samples with valid HbA1c: {valid_mask.sum()} / {len(valid_mask)}")

# Get aligned data
X_reg = X.loc[valid_mask]
y_reg_valid = y_reg.loc[valid_mask]

# Split regression data (same random state for comparability)
X_reg_temp, X_reg_test, y_reg_temp, y_reg_test = train_test_split(
    X_reg, y_reg_valid, test_size=0.15, random_state=42
)
X_reg_train, X_reg_val, y_reg_train, y_reg_val = train_test_split(
    X_reg_temp, y_reg_temp, test_size=0.176, random_state=42
)

print(f"\nRegression splits:")
print(f"  Train: {len(X_reg_train)}")
print(f"  Val:   {len(X_reg_val)}")
print(f"  Test:  {len(X_reg_test)}")

# Scale features (fit new scaler on regression training data)
scaler_reg = StandardScaler()
X_reg_train_scaled = scaler_reg.fit_transform(X_reg_train)
X_reg_val_scaled = scaler_reg.transform(X_reg_val)
X_reg_test_scaled = scaler_reg.transform(X_reg_test)

# Convert to tensors - NOTE: y is float32 for regression, not long
X_reg_train_t = torch.tensor(X_reg_train_scaled, dtype=torch.float32)
y_reg_train_t = torch.tensor(y_reg_train.values, dtype=torch.float32).unsqueeze(1)  # [N, 1]

X_reg_val_t = torch.tensor(X_reg_val_scaled, dtype=torch.float32)
y_reg_val_t = torch.tensor(y_reg_val.values, dtype=torch.float32).unsqueeze(1)

X_reg_test_t = torch.tensor(X_reg_test_scaled, dtype=torch.float32)
y_reg_test_t = torch.tensor(y_reg_test.values, dtype=torch.float32).unsqueeze(1)

print(f"\nTensor shapes:")
print(f"  X_train: {X_reg_train_t.shape}, y_train: {y_reg_train_t.shape}")

# Create DataLoaders
reg_train_dataset = TensorDataset(X_reg_train_t, y_reg_train_t)
reg_val_dataset = TensorDataset(X_reg_val_t, y_reg_val_t)
reg_test_dataset = TensorDataset(X_reg_test_t, y_reg_test_t)

reg_train_loader = DataLoader(reg_train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
reg_val_loader = DataLoader(reg_val_dataset, batch_size=BATCH_SIZE, shuffle=False)
reg_test_loader = DataLoader(reg_test_dataset, batch_size=BATCH_SIZE, shuffle=False)

### 6.3 Define Regression Model

The architecture is nearly identical to classification, but:
- Output layer has **1 neuron** (not 3)
- No activation on output (we want raw predicted value)
- Loss function is **MSELoss** (not CrossEntropyLoss)

In [None]:
class DiabetesRegressor(nn.Module):
    """
    Neural network for HbA1c prediction (regression).
    
    Same architecture as classifier, but with 1 output instead of 3.
    """
    
    def __init__(self, n_features, dropout_rate=0.3):
        super().__init__()
        
        self.layer1 = nn.Sequential(
            nn.Linear(n_features, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        self.layer2 = nn.Sequential(
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        self.layer3 = nn.Sequential(
            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        # Single output for regression (no activation)
        self.output = nn.Linear(32, 1)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.output(x)
        return x

# Create model
reg_model = DiabetesRegressor(n_features=n_features, dropout_rate=0.3).to(device)

# Loss function: Mean Squared Error
# Measures average squared difference between predictions and targets
reg_criterion = nn.MSELoss()

# Optimizer
reg_optimizer = optim.Adam(reg_model.parameters(), lr=0.001)

print(reg_model)
print(f"\nTotal parameters: {sum(p.numel() for p in reg_model.parameters()):,}")

### 6.4 Train Regression Model

In [None]:
def train_one_epoch_reg(model, train_loader, criterion, optimizer, device):
    """Train regression model for one epoch."""
    model.train()
    total_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)


def evaluate_reg(model, data_loader, criterion, device):
    """Evaluate regression model."""
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for batch_X, batch_y in data_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            total_loss += loss.item()
            
            all_preds.extend(outputs.cpu().numpy().flatten())
            all_labels.extend(batch_y.cpu().numpy().flatten())
    
    avg_loss = total_loss / len(data_loader)
    
    # Calculate RMSE
    preds = np.array(all_preds)
    labels = np.array(all_labels)
    rmse = np.sqrt(np.mean((preds - labels) ** 2))
    
    return avg_loss, rmse, preds, labels

In [None]:
# Training loop for regression
reg_train_losses = []
reg_val_losses = []
reg_val_rmses = []

best_reg_val_loss = float('inf')
best_reg_model_state = None
reg_patience_counter = 0

print("Training Regression Model...")
print(f"{'Epoch':>6} {'Train Loss':>12} {'Val Loss':>12} {'Val RMSE':>10} {'Status':>10}")
print("-" * 55)

for epoch in range(N_EPOCHS):
    train_loss = train_one_epoch_reg(reg_model, reg_train_loader, reg_criterion, reg_optimizer, device)
    reg_train_losses.append(train_loss)
    
    val_loss, val_rmse, _, _ = evaluate_reg(reg_model, reg_val_loader, reg_criterion, device)
    reg_val_losses.append(val_loss)
    reg_val_rmses.append(val_rmse)
    
    if val_loss < best_reg_val_loss:
        best_reg_val_loss = val_loss
        best_reg_model_state = reg_model.state_dict().copy()
        reg_patience_counter = 0
        status = "✓ Best"
    else:
        reg_patience_counter += 1
        status = f"({reg_patience_counter}/{PATIENCE})"
    
    if (epoch + 1) % 5 == 0 or reg_patience_counter == 0:
        print(f"{epoch+1:>6} {train_loss:>12.4f} {val_loss:>12.4f} {val_rmse:>10.4f} {status:>10}")
    
    if reg_patience_counter >= PATIENCE:
        print(f"\nEarly stopping at epoch {epoch+1}")
        break

reg_model.load_state_dict(best_reg_model_state)
print(f"\nLoaded best model (val_loss: {best_reg_val_loss:.4f})")

### 6.5 Evaluate Regression Model

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Evaluate on test set
test_loss, test_rmse, y_reg_pred, y_reg_true = evaluate_reg(reg_model, reg_test_loader, reg_criterion, device)

# Calculate additional metrics
test_mae = mean_absolute_error(y_reg_true, y_reg_pred)
test_r2 = r2_score(y_reg_true, y_reg_pred)

print("=" * 50)
print("REGRESSION TEST SET RESULTS")
print("=" * 50)
print(f"RMSE:  {test_rmse:.4f}")
print(f"MAE:   {test_mae:.4f}")
print(f"R²:    {test_r2:.4f}")

# Compare with LightGBM
lgb_reg_results = phase7_results['regression']['test']['LightGBM (with labs)']
mlp_reg_results = phase7_results['regression']['test']['MLP (with labs)']

print("\n" + "=" * 60)
print("Model Comparison - Regression (Test Set):")
print("=" * 60)
print(f"{'Model':<30} {'RMSE':>10} {'MAE':>10} {'R²':>10}")
print("-" * 60)
print(f"{'LightGBM (with labs)':<30} {lgb_reg_results['rmse']:>10.4f} {lgb_reg_results['mae']:>10.4f} {lgb_reg_results['r2']:>10.4f}")
print(f"{'MLP sklearn (with labs)':<30} {mlp_reg_results['rmse']:>10.4f} {mlp_reg_results['mae']:>10.4f} {mlp_reg_results['r2']:>10.4f}")
print(f"{'PyTorch NN (with labs)':<30} {test_rmse:>10.4f} {test_mae:>10.4f} {test_r2:>10.4f}")
print("-" * 60)

In [None]:
# Visualize regression predictions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Predicted vs Actual
axes[0].scatter(y_reg_true, y_reg_pred, alpha=0.3, s=10)
axes[0].plot([4, 14], [4, 14], 'r--', label='Perfect prediction')
axes[0].set_xlabel('Actual HbA1c (%)')
axes[0].set_ylabel('Predicted HbA1c (%)')
axes[0].set_title(f'Predicted vs Actual (R² = {test_r2:.3f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residual plot
residuals = y_reg_pred - y_reg_true
axes[1].scatter(y_reg_pred, residuals, alpha=0.3, s=10)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted HbA1c (%)')
axes[1].set_ylabel('Residual (Predicted - Actual)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nResidual statistics:")
print(f"  Mean: {residuals.mean():.4f} (should be ~0)")
print(f"  Std:  {residuals.std():.4f}")

---

## Part 7: Learning Rate Scheduling

**Learning rate scheduling** adjusts the learning rate during training to improve convergence.

### Why Schedule Learning Rate?
- **High LR early**: Take big steps to quickly find good regions
- **Low LR later**: Take small steps to fine-tune and not overshoot

### Common Schedulers:
| Scheduler | Description |
|-----------|-------------|
| **StepLR** | Multiply LR by gamma every N epochs |
| **ReduceLROnPlateau** | Reduce LR when metric stops improving |
| **CosineAnnealingLR** | Smoothly decrease LR following cosine curve |
| **OneCycleLR** | Ramp up then down (often best for fast training) |

We'll use **OneCycleLR** - it's been shown to achieve faster convergence and often better results.

In [None]:
# Visualize different learning rate schedules
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR

# Create dummy model and optimizer for visualization
dummy_model = nn.Linear(10, 1)
epochs = 50
steps_per_epoch = len(train_loader)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# StepLR: decrease by 0.1 every 15 epochs
dummy_opt = optim.Adam(dummy_model.parameters(), lr=0.01)
scheduler = StepLR(dummy_opt, step_size=15, gamma=0.1)
lrs = []
for _ in range(epochs):
    lrs.append(dummy_opt.param_groups[0]['lr'])
    scheduler.step()
axes[0].plot(lrs)
axes[0].set_title('StepLR (step=15, gamma=0.1)')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Learning Rate')
axes[0].grid(True, alpha=0.3)

# CosineAnnealingLR: smooth cosine decay
dummy_opt = optim.Adam(dummy_model.parameters(), lr=0.01)
scheduler = CosineAnnealingLR(dummy_opt, T_max=epochs)
lrs = []
for _ in range(epochs):
    lrs.append(dummy_opt.param_groups[0]['lr'])
    scheduler.step()
axes[1].plot(lrs)
axes[1].set_title('CosineAnnealingLR')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Learning Rate')
axes[1].grid(True, alpha=0.3)

# OneCycleLR: warmup then decay
dummy_opt = optim.Adam(dummy_model.parameters(), lr=0.01)
scheduler = OneCycleLR(dummy_opt, max_lr=0.01, epochs=epochs, steps_per_epoch=steps_per_epoch)
lrs = []
for _ in range(epochs * steps_per_epoch):
    lrs.append(dummy_opt.param_groups[0]['lr'])
    scheduler.step()
# Plot per epoch (average)
lrs_epoch = [np.mean(lrs[i*steps_per_epoch:(i+1)*steps_per_epoch]) for i in range(epochs)]
axes[2].plot(lrs_epoch)
axes[2].set_title('OneCycleLR (warmup + decay)')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Learning Rate')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 7.1 Train Classification Model with OneCycleLR

Let's retrain our classification model with learning rate scheduling and compare results.

In [None]:
def train_with_scheduler(model, train_loader, val_loader, criterion, optimizer, scheduler, 
                         device, n_epochs, patience, task='classification'):
    """
    Train model with learning rate scheduling.
    
    Key difference: scheduler.step() called after each BATCH for OneCycleLR.
    """
    train_losses = []
    val_losses = []
    val_metrics = []
    lrs = []
    
    best_val_loss = float('inf')
    best_model_state = None
    patience_counter = 0
    
    for epoch in range(n_epochs):
        # Training
        model.train()
        epoch_loss = 0
        
        for batch_X, batch_y in train_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            
            # Step scheduler after each batch for OneCycleLR
            scheduler.step()
            
            epoch_loss += loss.item()
            lrs.append(optimizer.param_groups[0]['lr'])
        
        train_losses.append(epoch_loss / len(train_loader))
        
        # Validation
        model.eval()
        val_loss = 0
        all_preds = []
        all_labels = []
        
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                batch_X = batch_X.to(device)
                batch_y = batch_y.to(device)
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()
                
                if task == 'classification':
                    preds = outputs.argmax(dim=1)
                else:
                    preds = outputs.flatten()
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(batch_y.cpu().numpy() if task == 'classification' 
                                  else batch_y.cpu().numpy().flatten())
        
        val_losses.append(val_loss / len(val_loader))
        
        # Calculate metric
        if task == 'classification':
            metric = f1_score(all_labels, all_preds, average='macro')
        else:
            metric = np.sqrt(np.mean((np.array(all_preds) - np.array(all_labels)) ** 2))
        val_metrics.append(metric)
        
        # Early stopping
        if val_losses[-1] < best_val_loss:
            best_val_loss = val_losses[-1]
            best_model_state = model.state_dict().copy()
            patience_counter = 0
            status = "✓ Best"
        else:
            patience_counter += 1
            status = f"({patience_counter}/{patience})"
        
        if (epoch + 1) % 5 == 0 or patience_counter == 0:
            metric_name = 'F1' if task == 'classification' else 'RMSE'
            print(f"{epoch+1:>4} | Train: {train_losses[-1]:.4f} | Val: {val_losses[-1]:.4f} | "
                  f"{metric_name}: {metric:.4f} | LR: {lrs[-1]:.6f} | {status}")
        
        if patience_counter >= patience:
            print(f"\nEarly stopping at epoch {epoch+1}")
            break
    
    model.load_state_dict(best_model_state)
    return train_losses, val_losses, val_metrics, lrs

In [None]:
# Create fresh model for fair comparison
model_with_sched = DiabetesClassifier(n_features=n_features, n_classes=3, dropout_rate=0.3).to(device)

# Optimizer
optimizer_sched = optim.Adam(model_with_sched.parameters(), lr=0.001)

# OneCycleLR scheduler
# max_lr: peak learning rate (reached midway)
# total_steps: total number of optimizer steps
scheduler = OneCycleLR(
    optimizer_sched,
    max_lr=0.01,  # 10x base LR at peak
    epochs=N_EPOCHS,
    steps_per_epoch=len(train_loader),
    pct_start=0.3,  # 30% warmup
    anneal_strategy='cos'  # Cosine annealing
)

print("Training with OneCycleLR...")
print("=" * 70)

sched_train_losses, sched_val_losses, sched_val_f1s, sched_lrs = train_with_scheduler(
    model_with_sched, train_loader, val_loader, criterion, optimizer_sched, scheduler,
    device, N_EPOCHS, PATIENCE, task='classification'
)

In [None]:
# Compare training curves
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss comparison
axes[0].plot(val_losses, label='No Scheduler', alpha=0.7)
axes[0].plot(sched_val_losses, label='OneCycleLR', alpha=0.7)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Validation Loss')
axes[0].set_title('Validation Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# F1 comparison
axes[1].plot(val_accuracies, label='No Scheduler (Acc)', alpha=0.7)
axes[1].plot(sched_val_f1s, label='OneCycleLR (F1)', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Metric')
axes[1].set_title('Validation Metric Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Learning rate
axes[2].plot(sched_lrs)
axes[2].set_xlabel('Training Step')
axes[2].set_ylabel('Learning Rate')
axes[2].set_title('OneCycleLR Learning Rate')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate model with scheduler on test set
_, _, y_pred_sched, y_true_sched, y_probs_sched = evaluate(model_with_sched, test_loader, criterion, device)

f1_sched = f1_score(y_true_sched, y_pred_sched, average='macro')
roc_auc_sched = roc_auc_score(y_true_sched, y_probs_sched, multi_class='ovr')
acc_sched = np.mean(y_pred_sched == y_true_sched)

print("=" * 60)
print("Classification Comparison: With vs Without LR Scheduling")
print("=" * 60)
print(f"{'Model':<35} {'F1 Macro':>10} {'ROC AUC':>10} {'Accuracy':>10}")
print("-" * 65)
print(f"{'PyTorch (no scheduler)':<35} {f1_macro:>10.4f} {roc_auc:>10.4f} {test_acc:>10.4f}")
print(f"{'PyTorch (OneCycleLR)':<35} {f1_sched:>10.4f} {roc_auc_sched:>10.4f} {acc_sched:>10.4f}")
print(f"{'LightGBM (baseline)':<35} {lgb_results['f1_macro']:>10.4f} {lgb_results['roc_auc_ovr']:>10.4f} {lgb_results['accuracy']:>10.4f}")
print("-" * 65)

improvement = (f1_sched - f1_macro) / f1_macro * 100
print(f"\nLR Scheduling impact: {improvement:+.1f}% F1 change")

---

## Part 8: Hyperparameter Tuning with Optuna

**Optuna** is a hyperparameter optimization framework that uses Bayesian optimization (TPE sampler) to efficiently search the hyperparameter space.

### Why Optuna?
- **Smarter than grid search**: Learns from previous trials to focus on promising regions
- **Early pruning**: Can stop unpromising trials early, saving time
- **Easy to use**: Define search space with simple `trial.suggest_*` calls

### Hyperparameters we'll tune:
| Parameter | Range | Description |
|-----------|-------|-------------|
| `n_layers` | 1-4 | Number of hidden layers |
| `hidden_size` | 32-256 | Neurons per layer |
| `dropout_rate` | 0.1-0.5 | Dropout probability |
| `learning_rate` | 1e-4 to 1e-2 | Initial learning rate |
| `batch_size` | 32, 64, 128 | Samples per batch |

In [None]:
import optuna
from optuna.trial import TrialState

# Suppress Optuna logging for cleaner output
optuna.logging.set_verbosity(optuna.logging.WARNING)

class FlexibleNN(nn.Module):
    """
    Flexible neural network where architecture is determined by hyperparameters.
    """
    
    def __init__(self, n_features, n_classes, hidden_sizes, dropout_rate):
        super().__init__()
        
        layers = []
        in_size = n_features
        
        for hidden_size in hidden_sizes:
            layers.extend([
                nn.Linear(in_size, hidden_size),
                nn.BatchNorm1d(hidden_size),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            in_size = hidden_size
        
        layers.append(nn.Linear(in_size, n_classes))
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)


def create_objective(X_train_t, y_train_t, X_val_t, y_val_t, n_features, n_classes, 
                     class_weights_tensor, device, n_epochs=30):
    """
    Create Optuna objective function for hyperparameter tuning.
    """
    
    def objective(trial):
        # Suggest hyperparameters
        n_layers = trial.suggest_int('n_layers', 1, 4)
        hidden_sizes = []
        for i in range(n_layers):
            hidden_sizes.append(trial.suggest_int(f'hidden_size_l{i}', 32, 256, step=32))
        
        dropout_rate = trial.suggest_float('dropout_rate', 0.1, 0.5)
        lr = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
        batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
        
        # Create data loaders with suggested batch size
        train_dataset = TensorDataset(X_train_t, y_train_t)
        val_dataset = TensorDataset(X_val_t, y_val_t)
        
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
        
        # Create model
        model = FlexibleNN(n_features, n_classes, hidden_sizes, dropout_rate).to(device)
        
        criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)
        optimizer = optim.Adam(model.parameters(), lr=lr)
        
        # Training loop with early stopping
        best_val_f1 = 0
        patience_counter = 0
        
        for epoch in range(n_epochs):
            # Train
            model.train()
            for batch_X, batch_y in train_loader:
                batch_X = batch_X.to(device)
                batch_y = batch_y.to(device)
                
                optimizer.zero_grad()
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()
            
            # Validate
            model.eval()
            all_preds = []
            all_labels = []
            
            with torch.no_grad():
                for batch_X, batch_y in val_loader:
                    batch_X = batch_X.to(device)
                    batch_y = batch_y.to(device)
                    outputs = model(batch_X)
                    preds = outputs.argmax(dim=1)
                    all_preds.extend(preds.cpu().numpy())
                    all_labels.extend(batch_y.cpu().numpy())
            
            val_f1 = f1_score(all_labels, all_preds, average='macro')
            
            # Report intermediate value for pruning
            trial.report(val_f1, epoch)
            
            # Prune if needed
            if trial.should_prune():
                raise optuna.TrialPruned()
            
            # Track best
            if val_f1 > best_val_f1:
                best_val_f1 = val_f1
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= 5:  # Shorter patience for tuning
                    break
        
        return best_val_f1
    
    return objective

In [None]:
# Run Optuna hyperparameter search
N_TRIALS = 20  # Increase for better results (50-100 recommended)

print(f"Running Optuna hyperparameter search ({N_TRIALS} trials)...")
print("This may take a few minutes...\n")

# Create study
study = optuna.create_study(
    direction='maximize',  # Maximize F1 score
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
)

# Create objective function
objective = create_objective(
    X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor,
    n_features, 3, class_weights_tensor, device, n_epochs=30
)

# Optimize
study.optimize(objective, n_trials=N_TRIALS, show_progress_bar=True)

print(f"\nOptimization complete!")
print(f"Best trial F1: {study.best_trial.value:.4f}")
print(f"\nBest hyperparameters:")
for key, value in study.best_trial.params.items():
    print(f"  {key}: {value}")

In [None]:
# Visualize optimization history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Optimization history
trials_df = study.trials_dataframe()
completed = trials_df[trials_df['state'] == 'COMPLETE']

axes[0].scatter(completed.index, completed['value'], alpha=0.6)
axes[0].axhline(y=study.best_trial.value, color='r', linestyle='--', label=f'Best: {study.best_trial.value:.4f}')
axes[0].set_xlabel('Trial')
axes[0].set_ylabel('Validation F1')
axes[0].set_title('Optuna Optimization History')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Parameter importance (if enough trials)
if len(completed) >= 10:
    try:
        importances = optuna.importance.get_param_importances(study)
        params = list(importances.keys())[:6]  # Top 6
        values = [importances[p] for p in params]
        axes[1].barh(params, values)
        axes[1].set_xlabel('Importance')
        axes[1].set_title('Hyperparameter Importance')
    except:
        axes[1].text(0.5, 0.5, 'Not enough trials\nfor importance analysis', 
                     ha='center', va='center', transform=axes[1].transAxes)
else:
    axes[1].text(0.5, 0.5, 'Need more trials\nfor importance analysis', 
                 ha='center', va='center', transform=axes[1].transAxes)

plt.tight_layout()
plt.show()

### 8.1 Train Final Model with Best Hyperparameters

In [None]:
# Extract best hyperparameters
best_params = study.best_trial.params
print("Training final model with best hyperparameters:")
print("-" * 40)

# Build hidden sizes list
n_layers = best_params['n_layers']
hidden_sizes = [best_params[f'hidden_size_l{i}'] for i in range(n_layers)]
print(f"Architecture: {n_features} -> {' -> '.join(map(str, hidden_sizes))} -> 3")
print(f"Dropout: {best_params['dropout_rate']:.2f}")
print(f"Learning rate: {best_params['learning_rate']:.6f}")
print(f"Batch size: {best_params['batch_size']}")
print("-" * 40)

# Create best model
best_model = FlexibleNN(n_features, 3, hidden_sizes, best_params['dropout_rate']).to(device)

# Create data loaders with best batch size
best_batch_size = best_params['batch_size']
best_train_loader = DataLoader(train_dataset, batch_size=best_batch_size, shuffle=True, drop_last=True)
best_val_loader = DataLoader(val_dataset, batch_size=best_batch_size, shuffle=False)
best_test_loader = DataLoader(test_dataset, batch_size=best_batch_size, shuffle=False)

# Training setup
best_criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)
best_optimizer = optim.Adam(best_model.parameters(), lr=best_params['learning_rate'])

# Train with early stopping (longer training for final model)
best_train_losses = []
best_val_losses = []
best_val_f1s = []
best_model_state = None
best_f1 = 0
patience_counter = 0

print("\nTraining...")
for epoch in range(100):  # More epochs for final model
    # Train
    best_model.train()
    epoch_loss = 0
    for batch_X, batch_y in best_train_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        
        best_optimizer.zero_grad()
        outputs = best_model(batch_X)
        loss = best_criterion(outputs, batch_y)
        loss.backward()
        best_optimizer.step()
        epoch_loss += loss.item()
    
    best_train_losses.append(epoch_loss / len(best_train_loader))
    
    # Validate
    best_model.eval()
    val_loss = 0
    all_preds, all_labels = [], []
    
    with torch.no_grad():
        for batch_X, batch_y in best_val_loader:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            outputs = best_model(batch_X)
            val_loss += best_criterion(outputs, batch_y).item()
            all_preds.extend(outputs.argmax(dim=1).cpu().numpy())
            all_labels.extend(batch_y.cpu().numpy())
    
    best_val_losses.append(val_loss / len(best_val_loader))
    val_f1 = f1_score(all_labels, all_preds, average='macro')
    best_val_f1s.append(val_f1)
    
    if val_f1 > best_f1:
        best_f1 = val_f1
        best_model_state = best_model.state_dict().copy()
        patience_counter = 0
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: F1 = {val_f1:.4f} ✓")
    else:
        patience_counter += 1
        if patience_counter >= 15:
            print(f"Early stopping at epoch {epoch+1}")
            break

best_model.load_state_dict(best_model_state)
print(f"\nBest validation F1: {best_f1:.4f}")

In [None]:
# Final evaluation on test set
best_model.eval()
all_preds, all_labels, all_probs = [], [], []

with torch.no_grad():
    for batch_X, batch_y in best_test_loader:
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        outputs = best_model(batch_X)
        probs = torch.softmax(outputs, dim=1)
        all_preds.extend(outputs.argmax(dim=1).cpu().numpy())
        all_labels.extend(batch_y.cpu().numpy())
        all_probs.extend(probs.cpu().numpy())

y_pred_best = np.array(all_preds)
y_true_best = np.array(all_labels)
y_probs_best = np.array(all_probs)

# Calculate metrics
f1_best = f1_score(y_true_best, y_pred_best, average='macro')
roc_auc_best = roc_auc_score(y_true_best, y_probs_best, multi_class='ovr')
acc_best = np.mean(y_pred_best == y_true_best)

# Final comparison table
print("=" * 70)
print("FINAL MODEL COMPARISON (Test Set)")
print("=" * 70)
print(f"{'Model':<40} {'F1 Macro':>10} {'ROC AUC':>10} {'Accuracy':>10}")
print("-" * 70)
print(f"{'LightGBM (with labs)':<40} {lgb_results['f1_macro']:>10.4f} {lgb_results['roc_auc_ovr']:>10.4f} {lgb_results['accuracy']:>10.4f}")
print(f"{'MLP sklearn (with labs)':<40} {mlp_results['f1_macro']:>10.4f} {mlp_results['roc_auc_ovr']:>10.4f} {mlp_results['accuracy']:>10.4f}")
print("-" * 70)
print(f"{'PyTorch (manual architecture)':<40} {f1_macro:>10.4f} {roc_auc:>10.4f} {test_acc:>10.4f}")
print(f"{'PyTorch (with OneCycleLR)':<40} {f1_sched:>10.4f} {roc_auc_sched:>10.4f} {acc_sched:>10.4f}")
print(f"{'PyTorch (Optuna-tuned)':<40} {f1_best:>10.4f} {roc_auc_best:>10.4f} {acc_best:>10.4f}")
print("=" * 70)

print("\nClassification Report (Optuna-tuned model):")
print(classification_report(y_true_best, y_pred_best, target_names=['No Diabetes', 'Prediabetes', 'Diabetes']))

---

## Summary

### What We Built

| Part | Topic | Key Concepts |
|------|-------|--------------|
| 1-3 | Basic Classification | Tensors, DataLoaders, nn.Module, training loop |
| 4-5 | Evaluation & Saving | Metrics, confusion matrix, model serialization |
| 6 | Regression | MSELoss, continuous predictions, R² metric |
| 7 | LR Scheduling | OneCycleLR, warmup, annealing |
| 8 | Hyperparameter Tuning | Optuna, TPE sampler, pruning, search space |

### Key Takeaways

1. **LightGBM still wins for tabular data** - This is expected and well-documented in ML literature
2. **Neural networks benefit from tuning** - Manual architecture vs Optuna-tuned can differ significantly
3. **Learning rate scheduling helps** - OneCycleLR often improves convergence
4. **Regression is harder than classification** - Lower R² indicates HbA1c is difficult to predict
5. **Deep learning needs more data** - With ~8K samples, gradient boosting has advantage

### Techniques Learned

| Technique | Why It Helps |
|-----------|--------------|
| **BatchNorm** | Stabilizes training, allows higher LR |
| **Dropout** | Prevents overfitting |
| **Early stopping** | Prevents overfitting, saves time |
| **Class weights** | Handles imbalanced classes |
| **OneCycleLR** | Better convergence, often better final performance |
| **Optuna** | Efficient hyperparameter search |

### When to Use Neural Networks vs Gradient Boosting

| Use Case | Recommendation |
|----------|----------------|
| Tabular data, <50K samples | Gradient Boosting (LightGBM, XGBoost) |
| Tabular data, >100K samples | Can try neural networks |
| Images, video | Neural networks (CNNs) |
| Text, sequences | Neural networks (Transformers, RNNs) |
| Audio | Neural networks |
| Multi-modal (text+images) | Neural networks |

### Next Steps (if continuing)

- **Ensemble methods**: Combine LightGBM + PyTorch predictions
- **Advanced architectures**: TabNet, FT-Transformer
- **More data**: Expand to 1999-2018 NHANES
- **Feature engineering**: Create more domain-specific features