# Building a Multi Layer Perceptron for Diabetes Risk Prediction

Welcome to your Deep Learning project! In this notebook, you’ll take on the role of a data scientist building a neural network that predicts diabetes risk, helping healthcare providers prioritize patients for diagnostic testing.

### What you'll build

By completing this notebook, you will _(demonstrate ability to)_:

- Design MLP architectures with appropriate depth, width, and activations
- Preprocess data with proper splitting, normalization, and batching
- Implement training loops with forward/backward passes and optimization
- Apply evaluation metrics suited for specialized applications
- Diagnose overfitting/underfitting using loss curves
- Improve models systematically through hyperparameter tuning and regularization
- Interpret performance in context with actionable recommendations

**Dataset**: [CDC Diabetes Health Indicators](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) (balanced subset of 50,000 patients, 21 features)  
**Estimated time**: 5-6 hours  

> *Have more questions about the project?* Review the [README.md](../README.md) for full project context, setup instructions, and deliverables.

### Ready to get started?

This notebook is divided into 8 sections: 

```Setup → Data Loading → Data Preprocessing → Model Design → Model Training → Model Evaluation → Model Improvements → Conclusion```

Follow them in sequence to complete your project!

---
## Step 0: Set up the environment

Let's begin by importing the necessary libraries and setting up our environment for reproducibility.

In [None]:
# Standard library imports
import numpy as np
import pandas as pd
import numbers
import matplotlib.pyplot as plt
from typing import Dict, Tuple, List, Optional

# PyTorch imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset

# Scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

In [None]:
# Set random seeds for reproducibility
# This ensures that your results are consistent across runs
RANDOM_SEED = 42

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
    
print(f"Random seed set to {RANDOM_SEED}")

> **Why reproducibility matters**: Setting random seeds ensures that your results are consistent across different runs. This is critical for debugging and comparing different model configurations. 
> <br>In production systems, reproducibility helps with model versioning and validation.

In [None]:
# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    # Set cuDNN to deterministic mode for reproducibility
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

> **Moving execution to device**: Transferring your model and tensors to the GPU can drastically improve performance.  
> <br>Use `.to(device)` for models and tensors to ensure everything runs on the same hardware.  
> <br>*Example:*  
> ```model = model.to(device)```   
> ```inputs = inputs.to(device)```  
> <br>**Note**: Mixing CPU and GPU tensors in operations will raise errors.

---
## Step 1: Load and explore the dataset

Understanding your data is the first critical step in any deep learning project. In this step, you will load the dataset and perform exploratory data analysis to understand your dataset.

The CDC diabetes dataset is provided as a CSV file in the [`data/`](../data/) directory. You'll use [pandas](https://pandas.pydata.org/) to load it into a DataFrame for analysis.

### 1.1 Load the dataset

The dataset contains health indicators from the CDC's Behavioral Risk Factor Surveillance System. Each row represents a patient with 21 health-related features and a binary diabetes diagnosis.

In [None]:
# Load the diabetes dataset
df = pd.read_csv('data/diabetes_data.csv')

print(f"Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# TODO 1: Display basic dataset information
# HINT: You need to inspect the DataFrame's structure to understand what columns exist, 
#       their data types, and whether any values are missing. Pandas DataFrames have 
#       methods that provide a concise summary of this information in a single call.
#       Think about what you'd need to know before preprocessing the data.
# REFERENCE: https://pandas.pydata.org/docs/reference/frame.html
# This will help you understand the structure of your data before preprocessing

# ------- Add your code here

### 1.2 Understanding the features

The dataset contains 21 health and lifestyle indicators. Let's examine what each feature represents and its clinical significance:

| Feature | Description | Type | Clinical Significance |
|---------|-------------|------|----------------------|
| **Diabetes_binary** | Diabetes diagnosis | Binary (0/1) | **Target variable** |
| HighBP | High blood pressure diagnosis | Binary | Strong risk factor |
| HighChol | High cholesterol diagnosis | Binary | Cardiovascular risk |
| CholCheck | Cholesterol check in past 5 years | Binary | Preventive care indicator |
| BMI | Body Mass Index | Continuous | Key obesity indicator |
| Smoker | Smoking history | Binary | Lifestyle risk factor |
| Stroke | History of stroke | Binary | Cardiovascular complication |
| HeartDiseaseorAttack | Coronary heart disease or MI | Binary | Cardiovascular complication |
| PhysActivity | Physical activity in past 30 days | Binary | Protective factor |
| Fruits | Fruit consumption (1+ per day) | Binary | Dietary indicator |
| Veggies | Vegetable consumption (1+ per day) | Binary | Dietary indicator |
| HvyAlcoholConsump | Heavy alcohol consumption | Binary | Lifestyle risk factor |
| AnyHealthcare | Any healthcare coverage | Binary | Access to care |
| NoDocbcCost | Could not see doctor due to cost | Binary | Healthcare barrier |
| GenHlth | General health (1-5 scale) | Ordinal | Self-reported health status |
| MentHlth | Mental health (days not good in past 30) | Continuous | Mental health indicator |
| PhysHlth | Physical health (days not good in past 30) | Continuous | Physical health indicator |
| DiffWalk | Difficulty walking or climbing stairs | Binary | Mobility indicator |
| Sex | Biological sex | Binary | Demographic factor |
| Age | Age category (1-13) | Ordinal | Strong demographic predictor |
| Education | Education level (1-6) | Ordinal | Socioeconomic indicator |
| Income | Income level (1-8) | Ordinal | Socioeconomic indicator |

For a complete overview of the dataset, refer to the [`data/data_dictionary.md`](data/data_dictionary.md) file.

> **Feature dictionaries in the real-world**: In practice, you'll often create these yourself by interviewing domain experts. Always document: feature meaning, type, measurement method, and known limitations.
> <br> Think of the feature dictionary as your model's "instruction manual"; without it, even brilliant ML work becomes unusable in production.


### 1.3 Perform exploratory data analysis (EDA)

Understanding your data involves checking for missing values, reviewing feature distributions, understanding target variable distribution, and identifying patterns that will inform your modeling decisions.

In [None]:
# TODO 2: Check for missing values in the dataset
# HINT: Missing values can break model training. You need to check each column for null/NaN values.
#       Pandas has methods to detect null values, and you can combine these with aggregation 
#       functions to count them. Consider chaining methods together.
# REFERENCE: https://pandas.pydata.org/docs/user_guide/missing_data.html
# Missing values require special handling and can impact model performance

missing_values =  # ------- Add your code here
print("Missing values per column:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

> **Handling missing values** Even though this dataset has no missing values, it's crucial to check for them in any data workflow since they can disrupt model training, bias results, or break computations. Common strategies to handle missing values include:
>   - *Removal:* Drop rows or columns if the proportion of missing data is small.  
>   - *Imputation:* Fill with mean, median, mode, or domain-specific defaults.  
>   - *Model-based methods:* Predict missing values using other features.  
> Always make sure to also analyze *why* data is missing: it can reveal data collection issues or hidden patterns!  
>
> *Reference:* [ Working with missing data in Pandas](https://pandas.pydata.org/docs/user_guide/missing_data.html)


In [None]:
# TODO 3: Analyze and visualize the target variable distribution
# HINT: For classification tasks, you need to understand class balance. How many samples 
#       are in each class (0 = no diabetes, 1 = diabetes)? Pandas Series (single columns) 
#       have methods to count the frequency of each unique value. You can also calculate 
#       percentages by normalizing these counts. Use matplotlib to visualize your key findings.
# REFERENCE: https://pandas.pydata.org/docs/reference/series.html
# Understanding class balance is critical for choosing appropriate metrics

target_distribution =  # ------- Add your code here
target_percentage =  # ------- Add your code here

print("Target Variable Distribution:")
print(f"No Diabetes (0): {target_distribution[0]:,} ({target_percentage[0]:.2f}%)")
print(f"Diabetes (1): {target_distribution[1]:,} ({target_percentage[1]:.2f}%)")

# Visualize the distribution
fig, ax = plt.subplots(figsize=(8, 5))
target_distribution.plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'])
ax.set_xlabel('Diabetes Status', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Distribution of Target Variable', fontsize=14, fontweight='bold')
ax.set_xticklabels(['No Diabetes', 'Diabetes'], rotation=0)
plt.tight_layout()
plt.show()

> **Understanding the balanced dataset:** You have noticed this dataset has a 50-50 split between diabetic and non-diabetic patients. The original CDC data actually has ~14% diabetes prevalence, reflecting real population statistics.
>
> Why did we balance it?
> - *Simplifies learning:* Lets you focus on neural network fundamentals without juggling class imbalance complexity
> - *Valid strategy:* Downsampling the majority class is common in production when you have enough minority samples (we have 7,000+ diabetic cases!)
> - *Practical benefits:* Simpler code, faster training, often performs just as well
>
> Alternatives to deal with class imbalance involve upsampling and/or cost-sensitive learning.
>
> *Production note:* You'd typically train on balanced data but evaluate on the real distribution. Here we use balanced for both to keep things simple, but remember to test on realistic proportions before deployment!

In [None]:
# TODO 4: Examine basic statistical properties of the dataset
# HINT: Statistical summaries (mean, std, min, max, quartiles) help you understand feature 
#       distributions and identify potential outliers. DataFrames have built-in methods that 
#       compute these statistics across all numerical columns at once.
# REFERENCE: https://pandas.pydata.org/docs/reference/frame.html
# Pay attention to the ranges of different features since this informs normalization needs

# ------- Add your code here

> **Understanding statistical summaries** Statistical summaries provide a quick snapshot of your dataset’s structure and variability.  
>
> - *Central tendency:* Metrics like mean and median reveal typical values but can be distorted by skewed data or outliers.  
> - *Spread:* Standard deviation and quartiles highlight how dispersed the data is, where wide spreads may indicate inconsistent scales or high variability.  
> - *Range and extremes:* Min/max values help spot potential data entry errors or outliers needing review.  
> - *Feature type awareness:*  
>   - Continuous features benefit from scale checks and normalization.  
>   - Ordinal or categorical features should be interpreted by the distribution of categories, not numeric statistics.  
>
> *Why this matters:* Understanding the shape, spread, and nature of your data guides preprocessing decisions like scaling, encoding, and handling outliers for reliable modeling.


In [None]:
# Visualize distributions of key continuous features
continuous_features = ['BMI', 'MentHlth', 'PhysHlth', 'Age', 'GenHlth']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

for idx, feature in enumerate(continuous_features):
    axes[idx].hist(df[feature], bins=30, edgecolor='black', alpha=0.7)
    axes[idx].set_xlabel(feature, fontsize=11)
    axes[idx].set_ylabel('Frequency', fontsize=11)
    axes[idx].set_title(f'Distribution of {feature}', fontsize=12)
    axes[idx].grid(True, alpha=0.3)

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

> **Interpreting the distribution plots** These histograms reveal how each continuous feature is distributed:  
> - *Shape:* Skewed or symmetric distributions indicate how values are spread.  
> - *Spread:* Wide distributions suggest high variability; scaling will be needed.  
> - *Peaks:* Multiple peaks can hint at subgroups or hidden patterns in the data.  
> - *Outliers:* Extreme values can distort model learning; worth investigating.  
> - *Continuity:* Features with few distinct values might behave more like categories.

In [None]:
# TODO 5: Analyze correlation between features and the target variable
# HINT: Correlation coefficients measure linear relationships between variables (-1 to +1).
#       You need to compute pairwise correlations between all features, then focus on 
#       correlations between each feature and your target variable. DataFrames can compute 
#       correlation matrices, and you can select specific columns from the result.
# REFERENCE: https://pandas.pydata.org/docs/reference/frame.html
# Strong correlations indicate features that may be predictive of diabetes

# Compute correlation with target
correlations =  # ------- Add your code here

print("Feature Correlations with Diabetes:")
print(correlations)

# Visualize top correlations
fig, ax = plt.subplots(figsize=(10, 8))
correlations[1:].plot(kind='barh', ax=ax)  # Exclude self-correlation
ax.set_xlabel('Correlation Coefficient', fontsize=12)
ax.set_title('Feature Correlations with Diabetes Status', fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

> **Understanding correlations** Correlations show how features move together, not cause and effect.  
> - *Positive* → variables increase together; *Negative* → one rises as the other falls.  
> - Strong values (±1) imply tighter linear links; near 0 means weak or no linear relation.  
> - Watch for *multicollinearity* and remember that *nonlinear patterns* won’t appear here.  
>  
> *IMPORTANT*: Use correlations to spot patterns, not to draw conclusions.

### 1.4 Collect key observations from EDA

Document 5-10 key observations from the exploratory data analysis. Consider: class balance, feature ranges, correlations, missing data, and data quality.

*TODO 6: Write your observations as bullet points here:*

<details open>
  <summary><h4>Checkpoint – Understanding the dataset</h4></summary>
  
  Before proceeding to preprocessing, ensure you understand:

  - [ ] The target variable distribution and its implications for evaluation
  - [ ] Which features show strong correlations with diabetes
  - [ ] The need for feature scaling due to different value ranges
  - [ ] The absence of missing values (simplifies preprocessing)

</details>

---
## Step 2: Preprocess the dataset

Proper data preprocessing is essential for neural network training. Raw data often needs to be split, normalized, and batched before it can be used effectively. This section transforms your dataset into a format optimized for PyTorch models.

### 2.1 Separate features and target

Machine learning models require splitting your data into input features (X) and target labels (y). The features are what the model uses to make predictions, while the target is what you're trying to predict. 

In [None]:
# TODO 7: Separate features (X) and target variable (y)
# HINT: Identify the column representing the diabetes outcome, and store it separately.
#       The remaining columns will form your feature set.
#       DataFrames allow you to exclude columns or select specific ones using built-in methods.
# REFERENCE: https://pandas.pydata.org/docs/reference/frame.html
# Features are used for predicting the target variable.

X =  # ------- Add your code here
y =  # ------- Add your code here

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")

### 2.2 Create train/validation/test splits

Following best practices, you'll split the data into three sets:

- Training set (60%): Used to train the model (learn patterns, update weights)
- Validation set (20%): Used during development to tune hyperparameters and monitor overfitting
- Test set (20%): Final evaluation on completely unseen data (simulates real-world deployment)

This is a best practice because separating validation from testing ensures that model tuning doesn’t “leak” information from your final evaluation, leading to more trustworthy results.

In [None]:
# TODO 8: Create train/validation/test splits with stratification by populating the empty variables below
# HINT: scikit-learn provides utilities for splitting data while preserving class proportions.
#       You'll need to split in two stages: first separate training from temp, then split 
#       temp into validation and test. Each split should maintain the same diabetes/no-diabetes 
#       ratio as the original dataset. Look for a parameter that handles stratification.
# REFERENCE https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
# While less critical with balanced data, stratification is still good practice and prevents any sampling bias.

X_train = y_train = X_val = y_val = X_test = y_test = None

# ------- Add your code here

In [None]:
# Verify split sizes and class distributions
print("Split Sizes:")
print(f"Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(df)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

print("\nClass Distribution:")
print(f"Training - Diabetes prevalence: {y_train.mean()*100:.2f}%")
print(f"Validation - Diabetes prevalence: {y_val.mean()*100:.2f}%")
print(f"Test - Diabetes prevalence: {y_test.mean()*100:.2f}%")

<details open>
  <summary><h4>Checkpoint – Validate the data splits</h4></summary>
  
  Before continuing, make sure your dataset splits meet the expected proportions:

  - [ ] Training set contains approximately 60% of the data
  - [ ] Validation and test sets each contain approximately 20% of the data
  - [ ] Class presence is as expected (same ratio as for the full dataset) in all three splits

  If they aren’t, revisit your splitting logic before moving forward.

</details>

### 2.3 Normalize features

Neural networks perform better when input features are on similar scales. Standardization transforms features to have zero mean and unit variance, which helps gradients flow properly during training and speeds up convergence.

In [None]:
# TODO 9: Normalize features with a scaler, and populate the empty variables below with the scaled feature sets
# HINT: Follow these general steps:
#         1. Create a scaling object from a preprocessing library
#         2. Fit it **only** on the training features to learn scaling parameters
#         3. Apply the same transformation to validation and test features
#       This maintains data consistency and prevents information leakage from validation or test sets.
# REFERENCE: https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
# Normalization helps gradients converge more efficiently and prevents bias toward features with larger scales.

X_train_scaled = X_val_scaled = X_test_scaled = None

# ------- Add your code here

> **Don't forget the golden rule!** Only fit the scaler on training data to prevent data leakage, then use that fitted scaler to transform all three splits.   
> If you fit the scaler on validation or test data, their statistics (mean, standard deviation) would influence the training process. This is called *data leakage*, and leads to overly optimistic performance estimates. 
> <br><br>Think of it this way: In real deployment, you only have access to training data when building your model. The test set represents future, unseen entries that you should not have visibility on!

### 2.4 Convert to PyTorch tensors

PyTorch requires data to be in tensor format and batched for efficient GPU processing. DataLoaders handle batching, shuffling, and parallel data loading automatically.

In [None]:
# TODO 10: Convert scaled NumPy arrays to PyTorch tensors by populating the missing variables below
# HINT: Features should be floating-point tensors for 
#       neural network computations, while binary classification targets work with long tensors.
#       PyTorch provides tensor creation functions for different data types.
# REFERENCE: https://pytorch.org/docs/stable/torch.html#tensors
# Why different tensor types? Cross-entropy loss expects Long (integer) labels for class indices

X_train_tensor = X_val_tensor = X_test_tensor = None
y_train_tensor = y_val_tensor = y_test_tensor = None

# ------- Add your code here

print("Conversion to PyTorch tensors completed!")
print(f"\nTraining features: {X_train_tensor.shape}, dtype: {X_train_tensor.dtype}")
print(f"Training labels: {y_train_tensor.shape}, dtype: {y_train_tensor.dtype}")

> **Why convert to tensors?** Neural networks run on *tensors*, not NumPy arrays, because tensors can live on the GPU and support automatic differentiation.  
> Converting ensures your data can be used efficiently during training and lets PyTorch handle gradients and fast parallel math.  
>  
> *IMPORTANT*: Avoid switching back and forth between NumPy and tensors: each conversion moves data between memory spaces and slows things down.  

In [None]:
# TODO 11: Create DataLoaders for train, validation, and test sets
# HINT: PyTorch provides utilities in torch.utils.data to wrap tensors into datasets and create batching iterators.
#       You'll need to: (1) combine feature and label tensors into dataset objects, and 
#       (2) wrap datasets in loaders with appropriate batch sizes. Training data should be 
#       shuffled, but validation/test should not.
# REFERENCE: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
# Think about whether to apply difference DataLoader parameters for the three splits.

train_loader = val_loader = test_loader = None

# ------- Add your code here

print(f"Number of batches:")
print(f"Training: {len(train_loader)} batches")
print(f"Validation: {len(val_loader)} batches")
print(f"Test: {len(test_loader)} batches")

In [None]:
# Verify DataLoader setup by examining one batch
for X_batch, y_batch in train_loader:
    print("Sample batch from training DataLoader:")
    print(f"Features shape: {X_batch.shape}")
    print(f"Labels shape: {y_batch.shape}")
    print(f"\nFeature sample (first 5 values): {X_batch[0, :5]}")
    print(f"Label sample (first 10): {y_batch[:10]}")
    break  # Only examine first batch

> **Batch processing efficiency**: DataLoaders enable efficient mini-batch gradient descent by automatically batching your data. This allows for more stable gradient estimates than single-sample updates (SGD) while being more memory-efficient than using the entire dataset at once (batch gradient descent). Shuffling the training data each epoch prevents the model from learning the order of examples.

<details open>
  <summary><h4>Checkpoint – Validate the dataLoader output</h4></summary>
  
  Before training, confirm that your batches are correctly structured and preprocessed:
  
  - [ ] **Feature batch shape:** `(64, 21)` → 64 samples × 21 features  
  - [ ] **Label batch shape:** `(64,)` → 64 labels  
  - [ ] **Feature values:** normalized 
  - [ ] **Labels:** binary values (**0 or 1**)  
  
  If any shapes or values look off, revisit your preprocessing or batching steps before proceeding.
</details>

---
## Step 3: Design the model architecture

Now it's time to design your neural network! You'll build a multi-layer perceptron (MLP): a feed-forward neural network that learns from health indicators to predict diabetes risk. The goal is to define an architecture that balances **expressiveness** (ability to learn complex patterns) and **efficiency** (training speed and generalization).


### 3.1 Design considerations

Before implementing the model, consider these architectural decisions:

**Input Layer**:
- Size: Must match the number of features in the dataset (21 for our diabetes data)

**Hidden Layers**:
- **Depth**: Deeper networks can capture more complex patterns but increase the risk of overfitting and slow down training
- **Width**: Choose enough neurons to represent useful feature interactions without overcomplicating the model  
- **Activation**: Non-linear activations (like ReLU) allow the model to learn complex relationships beyond simple linear boundaries  

**Output Layer**:
- Size: Single neuron for binary classification (outputs probability of diabetes)
- Activation: Sigmoid function squashes output to range [0, 1], representing probability

### 3.2 Implement the neural network

You'll create a Multi-Layer Perceptron (MLP) using PyTorch's `nn.Module` class. This approach lets you define each layer and the forward pass explicitly, giving you full control over your model’s behavior.

In [None]:
# TODO 12: Define your DiabetesClassifier neural network class
# HINT: In PyTorch, models are defined as classes that inherit from nn.Module.
#       Within the constructor, specify your linear layers and activations.
#       Then implement the forward() method to describe how data flows through them.
#       Remember: Add non-linear activations between layers to help the network learn complex patterns;
#       also think about the loss function you'll use as it will inform your model output.
# REFERENCE: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html
# Define your architecture thoughtfully based on design considerations.

class DiabetesClassifier(nn.Module):

    # ------- Add your code here

### 3.3 Instantiate and inspect the model

Once the architecture is defined, create an instance of your model and inspect its structure. Understanding your model's parameters and layers helps ensure it's built as intended and gives you insight into its complexity.

In [None]:
# TODO 13: Create an instance of your model, and set it to run on your device
# HINT: Initialize your class just like any Python object. You can use the default
#       hidden layer sizes or adjust them for experimentation.
#       After instantiation, set the model on the device. Then, printing the model will display its structure.
# REFERENCE: https://discuss.pytorch.org/t/how-to-print-a-model-after-load-it/9879/7

model = # ------- Add your code here

print("Model Architecture:")
print(model)
print("\n" + "="*50)

In [None]:
# Count total parameters in the model
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Display parameter breakdown by layer
print("\nParameter breakdown:")
for name, param in model.named_parameters():
    print(f"{name:20s} - Shape: {str(param.shape):20s} - Parameters: {param.numel():,}")

> **Understanding model inspection** Inspecting your model helps confirm that the architecture matches your design intent, both in structure and in parameter count.  
>  
> - *Check layer flow:* Verify inputs and outputs connect as expected, especially when stacking layers or combining modules.  
> - *Parameter awareness:* Knowing where parameters concentrate helps spot over- or under-parameterized designs early.  
> - *Debugging aid:* If training behaves unexpectedly, inspection can reveal mismatched layer sizes, missing activations, or frozen parameters.  
> - *Efficiency check:* Smaller models train faster but may underfit; larger ones capture complexity but risk overfitting.  
>  
> Regular inspection builds intuition about how architectural choices impact learning and performance.


### 3.4 Test the forward pass

Before training, confirm that your model processes input data correctly. This step ensures that tensor shapes are compatible and that the output has the expected dimensions for binary classification.

In [None]:
# TODO 14: Test the forward pass with a sample batch
# HINT: Retrieve one batch from your DataLoader and pass it through the model, after moving it to device.
#       Check that the output tensor’s shape matches expectations (one probability per sample).
#       This step validates that your forward() method and layer dimensions are correct.
# REFERENCE: Review "forward propagation in PyTorch" and "DataLoader iteration examples".
# Expected output shape: (batch_size, 1)

# Get a sample batch
for X_batch, y_batch in train_loader:
    # Move data to device
    X_batch = X_batch.to(device)

    # Forward pass
    output =  # ------- Add your code here
    
    print("Forward Pass Test:")
    print(f"Input shape: {X_batch.shape}")
    print(f"Output shape: {output.shape}")
    print(f"\nSample output (first 10):")
    print(output[:10].squeeze().detach().cpu().numpy())
    print(f"\nActual labels (first 10):")
    print(y_batch[:10].numpy())
    
    break  # Only test with first batch

> **Interpreting the forward pass test** If the model runs without errors and outputs a tensor with the expected shape, your architecture and data pipeline are aligned.  
> Unexpected shapes, NaNs, or all-identical predictions can signal setup issues worth fixing before training.  
>  
> Passing this check means you’re ready to move on to loss computation and optimization.


<details open>
  <summary><h4>Checkpoint – Verify Model Architecture</h4></summary>
  
  Before training, ensure your model is correctly defined:
  
  - [ ] **Model instantiated** without errors  
  - [ ] **Architecture matches expectations**: Input size = # features, hidden layers = right complexity for task, output = supports binary classification  
  - [ ] **Model moved to correct device** (GPU if available)  
  
  If anything looks incorrect, revisit your model definition before proceeding to training.
</details>

---
## Step 4: Train the model

Training a neural network involves repeatedly cycling through the data, computing predictions, calculating loss, and updating weights through backpropagation. This section implements the complete training loop with proper validation monitoring.

### 4.1  Define loss function and optimizer

The loss function measures how wrong your model's predictions are, while the optimizer determines how to update weights to reduce this error.

For binary classification with probabilistic outputs (values between 0 and 1), which loss function is most appropriate? Which optimizer is commonly recommended as a strong default for neural networks?

In [None]:
# TODO 15: Define loss function and optimizer
# HINT: Choose a binary classification loss that measures the difference 
#       between predicted probabilities and true labels.
#       Then, initialize an optimizer that updates model weights efficiently.
#       Remember to pass model parameters to the optimizer and select a reasonable learning rate.
# REFERENCE: https://docs.pytorch.org/docs/stable/nn.html#loss-functions, https://docs.pytorch.org/docs/stable/optim.html
# Tip: Adaptive optimizers adjust learning rates automatically, which makes them reliable defaults.

criterion =   # ------- Add your code here
optimizer =   # ------- Add your code here

print("Training configuration:")
print(f"Loss function: {criterion}")
print(f"Optimizer: {optimizer.__class__.__name__}")
print(f"Learning rate: {optimizer.param_groups[0]['lr']}")
print(f"Number of parameter groups: {len(optimizer.param_groups)}")

> **Defining the training setup for binary classification:** For binary classification, you need a loss that compares predicted probabilities against binary targets.  
> When choosing an optimizer, consider:  
> - How *stable* you want learning to be (adaptive methods help when tuning is tricky).  
> - How much *control* you need over learning rates or momentum.  
>  
> The “best” choice often depends on your data size, feature scale, and how smoothly the model learns; experiment and observe training behavior.


### 4.2 Implement the training loop

The training loop orchestrates the entire learning process: forward pass, loss computation, backward pass, and weight updates. Validation during training helps detect overfitting early.

In [None]:
# TODO 16: Complete the training function by implementing the complete training loop
# HINT: A complete training loop has two phases per epoch:
#       
#       TRAINING PHASE: For each epoch:
#       - Iterate through training batches
#       - Move batches to device
#       - For each batch: clear old gradients → forward pass → compute loss → 
#         backward pass → update weights
#       - Track average training loss and other relevant performance metrics
#       
#       VALIDATION PHASE:
#       - Set model to evaluation mode (disables dropout, batchnorm updates)
#       - Iterate through validation batches WITHOUT computing gradients
#       - Calculate average validation loss and other relevant performance metrics
#       
#       The PyTorch training pattern involves specific method calls on the model, optimizer, 
#       and loss. 
#
#       LOGGING:
#       - Use the `print_every` argument to conditionally print progress (epoch index, avg train/val loss).
#
# REFERENCE: https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

def train_model(model: nn.Module,
                train_loader: DataLoader,
                val_loader: DataLoader,
                criterion: nn.Module,
                optimizer: optim.Optimizer,
                device: torch.device,
                num_epochs: int = 100,
                print_every: int = 10) -> Tuple[List[float], List[float]]:
    """
    Train a PyTorch model with validation monitoring.
    
    Args:
        model: PyTorch model to train
        train_loader: DataLoader for training data
        val_loader: DataLoader for validation data
        criterion: Loss function
        optimizer: Optimizer
        device: Device to train on (CPU or GPU)
        num_epochs: Number of training epochs
        print_every: Print progress every N epochs
    
    Returns:
        Tuple of (num_epochs, train_losses, val_losses) - the losses contain one value per epoch
    """

    # Lists to track metrics over epochs
    train_losses = []
    val_losses = []
    
    # ------- Add your code here

    print("\nTraining completed!")
    print(f"Final Train Loss: {train_losses[-1]:.4f}")
    print(f"Final Validation Loss: {val_losses[-1]:.4f}")

    return num_epochs, train_losses, val_losses

num_epochs, train_losses, val_losses = train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer,
    device=device
)

> **Beware of gradient accumulation!** By default, PyTorch accumulates gradients across batches. Without resetting them, gradients from previous iterations are added to those from the current batch, which can lead to incorrect weight updates. A reset step is needed before each backward pass to prevent this.

### 4.3 Visualize training progress

Loss curves are your primary debugging tool for neural networks. They reveal whether your model is learning properly, overfitting, or underfitting.

In [None]:
# Plot training and validation loss curves

plt.figure(figsize=(12, 6))

epochs_range = range(1, num_epochs + 1)
plt.plot(epochs_range, train_losses, label='Training Loss', linewidth=2, color='#3498db')
plt.plot(epochs_range, val_losses, label='Validation Loss', linewidth=2, color='#e74c3c')

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (Binary Cross-Entropy)', fontsize=12)
plt.title('Training and Validation Loss Over Time', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print final statistics
print(f"\nLoss Statistics:")
print(f"Initial Train Loss: {train_losses[0]:.4f}")
print(f"Final Train Loss: {train_losses[-1]:.4f}")
print(f"Initial Val Loss: {val_losses[0]:.4f}")
print(f"Final Val Loss: {val_losses[-1]:.4f}")
print(f"\nLoss Reduction:")
print(f"Training: {((train_losses[0] - train_losses[-1]) / train_losses[0] * 100):.1f}% decrease")
print(f"Validation: {((val_losses[0] - val_losses[-1]) / val_losses[0] * 100):.1f}% decrease")

### 4.4 Interpret the loss curves

Understanding loss curves is critical for diagnosing model performance. Here are the patterns to look for:

 Pattern | Loss Behavior | Interpretation | Possible Fixes |
|----------|----------------|----------------|----------------|
| **Healthy Training** | • Both training & validation loss decrease and converge<br>• Small gap between them | Model is learning generalizable patterns | — |
| **Overfitting** | • Training loss keeps decreasing<br>• Validation loss plateaus or increases<br>• Large gap between curves | Model memorizes training data instead of generalizing | Add regularization (dropout, weight decay)<br>Reduce model complexity<br>Collect more data |
| **Underfitting** | • Both losses remain high<br>• Little or no improvement over epochs | Model lacks capacity or training | Increase model complexity<br>Train longer<br>Tune learning rate<br>Improve input features |

Diagnose your model's performance based on loss curves. Answer these questions by analyzing the plot above, either through code or markdown:
 1. What pattern do you see? Does the plot show healthy training, overfitting, or underfitting?
 2. How do you know? _Describe the behavior of the training loss line vs. the validation loss line. (e.g., "Training loss is [decreasing/high] while validation loss is [decreasing/increasing/high]")._
 3. What does this imply? What is the model doing wrong (or right)? _(e.g., "The model is memorizing the training data but failing to generalize...")_

*------- TODO 17: Add your answer here:*

<details open>
  <summary><h4>Checkpoint – Assess Training Progressing</h4></summary>
  
  Before moving to evaluation, verify that training completed successfully:
  
  - [ ] **Training completed** without errors across all epochs  
  - [ ] **Loss curves analyzed** and any overfitting or underfitting noted  
  - [ ] **Model ready** for comprehensive evaluation on test set  
  
  If training seems problematic (e.g., losses not decreasing), don't worry about it now - that's what optimizations in step 6 will focus on improving!
</details>

---
## Step 5: Evaluate the model

Training loss tells you how well the model fits training data, but comprehensive evaluation requires measuring performance on unseen test data using metrics relevant to your application. 

For medical screening, certain metrics matter more than others. For the baseline model, just check that performance is good-enough (significantly better than random guessing).

### 5.1 Define the evaluation logic

We'll create now a parameterized function that evaluates any model and returns comprehensive metrics.

In [None]:
# TODO 18: Implement a function to evaluate your trained model on new data
# HINT: This function should test how well your model performs by:
#       1. Setting the model to evaluation mode (to disable dropout, etc.)
#       2. Looping through the evaluation DataLoader without tracking gradients
#       3. Collecting predictions and true labels for each batch
#       4. Converting these results to NumPy arrays for metric calculations
#       5. Computing relevant performance metrics for the use case
#       6. Returning all computed values in a single dictionary for easy analysis
#       This function helps summarize how well your trained model generalizes to unseen data.
# REFERENCE: https://scikit-learn.org/stable/modules/model_evaluation.html
# Remember: Our model outputs probabilities (0-1), threshold at 0.5 for binary classification

def evaluate_model(model: nn.Module, 
                   data_loader: DataLoader, 
                   device: torch.device,
                   threshold: float = 0.5) -> Dict[str, float]:
    """
    Evaluate a trained model on a dataset and return comprehensive metrics.
    
    Args:
        model: Trained PyTorch model
        data_loader: DataLoader containing the evaluation dataset
        device: Device to run evaluation on (CPU or GPU)
        threshold: Decision threshold for classification (default: 0.5)
    
    Returns:
        Dictionary containing:
            - predictions: numpy array of predicted probabilities
            - true_labels: numpy array of true labels
            - pred_labels: numpy array of predicted binary labels
            - metrics
    """
    all_predictions = []
    all_labels = []
    results = []

    # ------- Add your code here
    
    return results

print("Evaluation function created successfully!")

> **Why accuracy isn't enough (even with balanced data):** While accuracy is now meaningful with our 50-50 split (unlike with imbalanced data where it'd be misleading), it still doesn't tell the whole story in medical applications.
> 
> Consider: A model with 75% accuracy could have 90% recall but only 60% precision—great at catching diabetic patients but with many false alarms. Or it could have 90% precision but 60% recall—very accurate when it predicts diabetes, but missing many cases.
> 
> *Bottom line:* Choose metrics that provide a clear view over this trade-off; this is critical in healthcare where different errors have different costs!

### 5.2 Evaluate on test set

Now we'll use our evaluation function to assess the baseline model's performance on the held-out test set. This gives us an unbiased estimate of real-world performance.

In [None]:
# TODO 19: Evaluate the model on the test set using the evaluation function
# HINT: Call evaluate_model() with your trained model and test_loader.
#       Extract the metrics from the returned dictionary and display them.

# ------- Add your code here


> **Choosing the top metric:** Which is worse for your stakeholders: missing a diabetic patient or triggering an unnecessary test? This determines whether you prioritize recall or precision, and informs threshold selection (we'll use 0.5 as default, but you could adjust it).

### 5.3 Visualize the confusion matrix

The confusion matrix shows exactly where your model succeeds and fails, breaking down predictions into true positives, true negatives, false positives, and false negatives.

In [None]:
# TODO 20: Create and visualize the confusion matrix
# HINT: Use sklearn to calculate and visualize the confusion matrix.
#       This approach automatically labels axes (true vs. predicted) and provides a clean layout.
#       Optionally, you can adjust color maps or figure size for better readability.
# REFERENCE: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html

disp =   # ------- Add your code here
disp.plot(cmap='Blues', colorbar=True)
plt.title('Confusion Matrix - Test Set', fontsize=14, fontweight='bold')
plt.show()

> **Interpreting your confusion matrix** The confusion matrix makes the trade-offs between recall and precision visible. By examining the balance between false positives and false negatives, you can decide which matters more for your real-world goal; for example, catching every diabetic case even if it means more false alarms.

### 5.4 Analyze ROC curve and threshold selection

The ROC (Receiver Operating Characteristic) curve helps you visualize the tradeoff between sensitivity (catching true cases) and specificity (avoiding false alarms) at different decision thresholds.

*Why this matters:*
- The default threshold is 0.5 (predict diabetes if probability > 0.5)
- You might want a different threshold based on your priorities:
  - Lower threshold (e.g., 0.3) → Higher recall, catch more diabetics, but more false alarms
  - Higher threshold (e.g., 0.7) → Higher precision, fewer false alarms, but miss more cases


In [None]:
# TODO 21: Generate and plot the ROC curve
# HINT: The ROC curve shows how your model’s ability to distinguish between classes 
#       changes as you vary the classification threshold: the closer the curve hugs the top-left corner, 
#       the better the model separates positives and negatives. Then, the Area Under the Curve (AUC) quantifies 
#       this performance: higher AUC also means better separability. This as a rule-of-thumb:
#            * 0.90–1.00 → Excellent discrimination
#            * 0.80–0.89 → Good discrimination
#            * 0.70–0.79 → Fair discrimination
#            * Below 0.70 → Poor discrimination
#       Use this plot and AUC value to reason about which threshold best aligns with your real-world trade-offs.
# REFERENCE: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

# Calculate ROC curve
fpr, tpr, thresholds =  # ------- Add your code here

# Plot ROC curve
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='#3498db', linewidth=2.5, label=f'Model ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=2, label='Random Classifier (AUC = 0.5)')

# Mark the current operating point (threshold = 0.5)
current_idx = np.argmin(np.abs(thresholds - 0.5))
plt.scatter(fpr[current_idx], tpr[current_idx], color='red', s=150, zorder=5, 
            label=f'Current Threshold (0.5)\nTPR={tpr[current_idx]:.3f}, FPR={fpr[current_idx]:.3f}')

plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate (Sensitivity/Recall)', fontsize=12, fontweight='bold')
plt.title('ROC Curve - Evaluating Threshold Tradeoffs', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nROC Curve Analysis:")
print("="*60)
print(f"AUC Score: {roc_auc:.4f}")
print(f"Interpretation:")
if roc_auc >= 0.90:
    print("  - Excellent discrimination ability")
elif roc_auc >= 0.80:
    print("  - Good discrimination ability")
elif roc_auc >= 0.70:
    print("  - Fair discrimination ability")
else:
    print("  - Poor discrimination ability")

> **Understanding the ROC curve**: Each point on the ROC curve represents a different probability threshold for classification. Points in the upper-left corner represent high sensitivity (catching most cases) with low false positive rates (few false alarms). The diagonal line represents random guessing. Depending on your prioritization, you might adjust the threshold as follows:
> - If missing positives is costly, lower the threshold to get the model to label more cases as "positive" → higher true positive rate, higher false positive rate.
> - If false alarms are costly, raise the threshold to get the model to be more strict in labeling cases as "positives" → lower true positive rate, lower false positive rate.

### 5.5 Interpret your results

Beyond raw metrics, it's essential to interpret your model's performance from a healthcare perspective. Let's analyze what these results mean for clinical deployment and patient care.

Reflect on your model's performance from a healthcare perspective. Answer these questions based on your evaluation results:
 1. Which metric is most important for diabetes screening and why? _(Hint: Think about the cost of a False Negative vs. a False Positive)._
 2. Would you recommend using this model in practice? Under what conditions?
 3. What probability threshold would you choose instead of 0.5, and why?

HINT: This model is quite simple since you haven't performed yet hyperparameter tuning, regularization, or feature engineering. Feel free to move forward if you achieve...

*------- TODO 22: Add your answer here:*

<details open>
  <summary><h4>Checkpoint – What to Keep in Mind About Your Model Performance </h4></summary>
  
  Before moving forward, record:
  - **Current metrics**: Accuracy ___, Precision ___, Recall ___, F1 ___, ...
  - **Training behavior**: Overfitting (train << val loss)? Underfitting (both high)?
  - **Main issue**: Which needs fixing most—false positives, false negatives, or overall performance?
  
  These observations will guide your improvement strategy in Step 6!
</details>

---
## Step 6: Improve and tune the model

Based on your evaluation results, you will now systematically improve the model. This demonstrates the iterative nature of machine learning: evaluate, diagnose, improve, and repeat. 

You'll experiment with multiple techniques and track results to identify the most impactful improvements.

**Your Goal**: Increase model performance against baseline by >=5% on your top metric.

> **Feel free to change any training parameters: num_epochs, optimizer, ...**

### 6.0 Create experiment tracking system

Before running experiments, let's create a system to automatically track and compare results. This eliminates manual result entry and makes it easy to identify the best-performing configuration.

In [None]:
# Create experiment tracking dictionary
experiment_results = {}

def track_experiment(name: str, 
                     model: nn.Module,
                     train_losses: List[float],
                     val_losses: List[float],
                     test_results: Dict[str, float],
                     notes: str = "") -> None:
    """
    Track experiment results for later comparison.
    
    Args:
        name: Experiment name
        model: Trained model
        train_losses: List of training losses per epoch
        val_losses: List of validation losses per epoch
        test_results: Dictionary from evaluate_model()
        notes: Optional notes about the experiment
    """
   # Keep only numeric metrics so we don't try to tabulate arrays, dicts, etc.
    numeric_metrics = {
        k: float(v) for k, v in test_results.items()
        if isinstance(v, numbers.Number)
    }

    experiment_results[name] = {
        'final_train_loss': float(train_losses[-1]),
        'final_val_loss': float(val_losses[-1]),
        'min_val_loss': float(min(val_losses)),
        'loss_gap': float(abs(train_losses[-1] - val_losses[-1])),
        'metrics': numeric_metrics,           # store all metrics here
        'notes': notes,
        'train_losses': train_losses,         # keep full histories if desired
        'val_losses': val_losses
    }
    print(f"✓ Experiment '{name}' tracked successfully!")


def display_experiment_comparison(sort_by: Optional[str] = "f1",
                                  descending: bool = True) -> Optional[pd.DataFrame]:
    """
    Display a comparison table of all tracked experiments.

    Args:
        sort_by: Metric name to sort by (e.g., 'f1', 'roc_auc', 'accuracy').
                 If None or not present, will fall back to:
                   1) 'roc_auc' if present
                   2) any available metric (alphabetical)
        descending: Sort order for the chosen metric.

    Returns:
        DataFrame with experiment results.
    """
    if not experiment_results:
        print("No experiments tracked yet!")
        return None

    # Collect the union of all metric names across experiments
    all_metric_names = set()
    for res in experiment_results.values():
        all_metric_names.update(res.get('metrics', {}).keys())
    all_metric_names = sorted(all_metric_names)  # stable ordering

    # Determine sort metric
    chosen_sort = None
    if sort_by in all_metric_names:
        chosen_sort = sort_by
    elif all_metric_names:
        chosen_sort = all_metric_names[0]  # fallback to first available
    # If no metrics at all, we’ll sort by final_val_loss

    # Build table rows
    rows = []
    for name, res in experiment_results.items():
        row = {
            'Experiment': name,
            'Val Loss': f"{res['final_val_loss']:.4f}",
            'Loss Gap': f"{res['loss_gap']:.4f}",
        }
        # Add metrics (formatted)
        for m in all_metric_names:
            val = res['metrics'].get(m, None)
            row[m.upper() if m.islower() else m] = (f"{val:.4f}" if isinstance(val, numbers.Number) else "")
        # Also keep raw for sorting
        row['_sort_val'] = (
            res['metrics'].get(chosen_sort)
            if chosen_sort is not None else res['final_val_loss']
        )
        rows.append(row)

    df = pd.DataFrame(rows)

    # Sort
    if chosen_sort is not None:
        df = df.sort_values('_sort_val', ascending=not descending)
    else:
        # No metrics available: sort by Val Loss ascending
        df = df.sort_values('Val Loss', ascending=True, key=lambda s: s.astype(float))

    # Clean up helper column
    df = df.drop(columns=['_sort_val'])

    return df

# Track baseline experiment
track_experiment(
    name='Baseline',
    model=model,
    train_losses=train_losses,
    val_losses=val_losses,
    test_results=test_results,
    notes='Initial model with default hyperparameters'
)

print("\nExperiment tracking system initialized!")
print("Use track_experiment() after training each variation.")
print("Use display_experiment_comparison() to see all results.")

> **Running a systematic experiment:** For each experiment, follow this workflow:
> 1. **Define change**: What are you testing? (e.g., dropout=0.3, lr=0.0001)
> 2. **Create model**: Create new model class (if architecture changes) and instantiate with new configuration, e.g., `DiabetesClassifierWithDropout(dropout_prob=0.3).to(device)`
> 3. **Train**: Run training loop, track losses with `train_model()`
> 4. **Evaluate**: Calculate test metrics (accuracy, precision, recall, F1) with `evaluate_model()`
> 5. **Record**: Add results to experiment tracker dictionary with `track_experiment()`
> 6. **Visualize**: Print key experiment metrics for quick analysis
> 
> This systematic approach helps you understand what works and why!

### 6.1 Experiment 1: Add dropout regularization

Dropout randomly deactivates neurons during training, forcing the network to learn robust features that don't rely on specific neurons. This reduces overfitting and improves generalization.

In [None]:
# TODO 23: Create a model with dropout layers and train it
# HINT: To reduce overfitting, insert dropout layers between your hidden layer activations. 
#       Think about where dropout would make the most impact — typically after nonlinear activations.
#       Experiment with different dropout probabilities (e.g., around 0.3) to find a balance 
#       between regularization and model capacity.
#       Use your existing training function to maintain consistency in training and evaluation.
# REFERENCE: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html

print("Experiment 1: Training model with Dropout")
print("="*60)

# ------- Add your code here


> **How dropout works**: During training, dropout randomly sets a fraction of neuron activations to zero. This prevents the network from relying too heavily on any single neuron and encourages redundancy. During evaluation, dropout is automatically disabled, and all neurons contribute to predictions. This simple technique is remarkably effective at reducing overfitting.

### 6.2 Experiment 2: Tune learning rate

The learning rate controls how large the weight updates are during training. Too high and training becomes unstable; too low and convergence is painfully slow.

In [None]:
# TODO 24: Experiment with different learning rates
# HINT: Try at least 3 learning rates: one lower (0.0001), one higher (0.01), and baseline (0.001).
#       Use the same model architecture for fair comparison.
#       Track how quickly each converges and their final performance.
# REFERENCE: https://pytorch.org/docs/stable/optim.html

print("Experiment 2: Learning Rate Tuning")
print("="*60)

# ------- Add your code here


> **Choosing a learning rate:** The optimal learning rate depends on your model, dataset, and optimizer. A good starting point is often *1e-3*.
> - If training diverges, lower lr (×0.1).
> - If loss plateaus too early, raise lr (×2–10).
> 
> *Tip*: Adam is generally more forgiving, while SGD benefits from careful tuning.

### 6.3 Experiment 3: Adjust network architecture

Network architecture (depth, width, and layer size) affects model capacity — its ability to learn complex patterns. Too simple and it underfits; too complex and it overfits (especially with limited data).

> **Consider running this experiment multiple times with different architectures**.

In [None]:
# TODO 25: Define a network architecture that best fits the experienced training behavior and performance
# HINT: Think about two main design directions:
#       - Make the model too simple, and the model underfits
#       - Make the model too complex, and it overfits (especially with limited data)
#       Compare how these choices influence parameter count, training behavior, and generalization.
# REFERENCE: Course module on network architecture design

print("Experiment 3: Training with tailored architecture")
print("="*60)

# ------- Add your code here


> **Model capacity tradeoffs**: Larger models (more parameters) can learn more complex patterns but are more prone to overfitting, especially with limited data. For tabular data like ours, simpler architectures often perform just as well or better than deep networks. The key is finding the right balance for your dataset size and complexity.

### 6.4 Compare all experiments

Now let's synthesize findings from all improvement experiments to identify which techniques had the biggest impact on performance.

In [None]:
# Display comprehensive comparison of all experiments
print("Comprehensive Experiment Comparison")
print("="*70)
print("\nAll experiments sorted by F1-Score (best to worst):\n")

comparison_df = display_experiment_comparison()
print(comparison_df.to_string(index=False))

# Identify best configuration
best_experiment = comparison_df.iloc[0]['Experiment']
print(f"\n{'='*70}")
print(f"Best Configuration: {best_experiment}")
print(f"{'='*70}")

print(f"\nKey Insights:")
print("  - Review the 'Loss Gap' column to identify which techniques reduced overfitting")
print("  - Consider the tradeoff between Precision and Recall for your use case")

In [None]:
# Visualize loss curves for top 3 experiments
print("\nVisualizing Loss Curves for Top 3 Experiments")
print("="*60)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
top_3_experiments = comparison_df.head(3)['Experiment'].values

for idx, exp_name in enumerate(top_3_experiments):
    exp_data = experiment_results[exp_name]
    epochs = range(1, len(exp_data['train_losses']) + 1)
    
    axes[idx].plot(epochs, exp_data['train_losses'], label='Train Loss', linewidth=2, color='#3498db')
    axes[idx].plot(epochs, exp_data['val_losses'], label='Val Loss', linewidth=2, color='#e74c3c')
    axes[idx].set_xlabel('Epoch', fontsize=11)
    axes[idx].set_ylabel('Loss', fontsize=11)
    axes[idx].set_title(f'{exp_name}', fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

> **Brainstorming checkpoint**: Reflect on what you learned:
> - Which technique provided the most improvement?
> - Did any techniques hurt performance?
> - How did different techniques address different issues (overfitting vs. underfitting)?
> - What would you try next if you had more time?

---

### 6.5 Reflect on experiments and brainstorm additional improvements

You've now systematically tested three improvement techniques. It's time to demonstrate your understanding by reflecting on results and proposing what to try next. In real-world ML projects, you'll brainstorm many ideas but only implement the most promising ones due to time and resource constraints.

#### Part A: Reflect on your experimental results

**Analyze your experiment results by answering the following questions. Reference specific metrics, loss curves, and configuration details from your experiments.**

**Questions for reflection:**

1. **Which experiment performed best overall, and why?**

2. **What patterns did you observe across experiments?**
   - Did dropout reduce the train-validation gap? By how much?
   - Which learning rate provided the best balance of convergence speed and stability?
   - Did wider or deeper architectures help? Or did they overfit?

3. **What is your model's biggest remaining weakness?**
   - Is there significant overfitting? (Train-val gap > 0.10)
   - Are false negatives or false positives the bigger problem?
   - Has validation loss plateaued, suggesting you've hit a performance ceiling?

4. **What have you learned about this dataset and problem?**
   - Is the 21-feature set sufficient, or do you suspect important information is missing?
   - Is class imbalance (86% vs 14%) still causing problems despite your experiments?
   - Are there diminishing returns from additional model complexity?

*------- TODO 26: Write your answer here:*


#### Part B: Brainstorm additional improvements

Based on your reflection above, now propose *1-2 specific improvements* you would try next from the list. Choose from the techniques below and provide detailed justification for each.

1. **Class Weights** - Give more importance to the minority class during training
2. **Threshold Tuning** - Adjust the 0.5 decision boundary based on ROC analysis
3. **Early Stopping** - Stop training when validation loss stops improving
4. **Learning Rate Scheduling** - Gradually decrease learning rate during training
5. **Weight Decay (L2 Regularization)** - Add penalty for large weights to reduce overfitting
6. **Data Augmentation** - Oversample minority class or use SMOTE for synthetic examples
7. **Different Activation Functions** - Experiment with LeakyReLU, ELU, ReLU, etc.

> **Important note on feature engineering:** Feature engineering is excluded from this list as it falls outside the scope of neural network optimization techniques. In production, that would be a priority since creating interaction terms _(e.g., BMI × Age, HighBP × HeartDisease)_ or polynomial features could yield 5-10% performance gains, often more than architectural changes.

**For each technique you select (1-2 total), provide a detailed analysis by answering:**

1. **What specific problem from your reflection does your selection address?**
   - Connect directly to weaknesses you identified in Part A
   - Reference specific metrics from your experiment results
   - Example: "Our recall is 0.7345, just barely above target, and we have 387 false negatives"

2. **Why is this technique appropriate for this problem?**
   - Explain the mechanism: how does this technique work?
   - Why would it solve your specific problem?
   - Example: "Class weights force the model to pay more attention to minority class examples during training by increasing their loss contribution"

3. **What results do you expect?**
   - Be specific: which metrics should improve and by roughly how much?
   - What trade-offs might occur? (e.g., precision vs recall)
   - Example: "Recall should increase to 0.75-0.78 because the model will work harder to identify positive cases. Precision may drop slightly from 0.66 to 0.62 due to more false positives, but F1-score should still improve overall"

4. **How would you implement this? (Implementation complexity)**
   - Easy (1-2 line change), Medium (new component), or Hard (major refactor)
   - If Easy or Medium, show the code snippet or describe the modification
   - Example: "Easy - Change loss to: `nn.BCEWithLogitsLoss(pos_weight=torch.tensor([3.0]))`"

*------- TODO 27: Write your answer here:*


#### Part C: How about combining techniques?

Based on all your experiments, you've seen that different techniques can help solve overfitting but combining them is where most improvements can be unlocked. 

Propose one combined experiment you would run next. Justify your choice and state your expected results.

*------- TODO 28: Write your answer here:*


<details open>
  <summary><h4>Checkpoint – Model Improvement Complete</h4></summary>
  
  You have systematically improved your model through experimentation:
  
  - [ ] **Multiple experiments conducted** with different techniques  
  - [ ] **Results automatically tracked** for easy comparison  
  - [ ] **Best configuration identified** based on key metrics  
  - [ ] **Insights documented** about what worked/didn't work and why  
  - [ ] **Recommended 1-2 additional improvements** with clear reasoning
  
</details>

---
## Conclusion & Next Steps

Congratulations on completing this hands-on deep learning project! You’ve successfully applied neural network fundamentals to a structured data classification task, demonstrating not only technical competence but also an understanding of data preprocessing, model evaluation, and performance optimization.

The techniques and workflows you’ve built here extend far beyond this dataset. The same methods can be applied to credit risk assessment, customer churn prediction, quality control, and more challenges in other structured-data domains. To continue your deep learning journey, explore new datasets, experiment with model architectures, and iterate on performance and interpretability.

#### Want to make this project portfolio-ready?

To showcase this project effectively:

1. Clean your notebook: Clear markdown, add an executive summary, and improve on documentation.
2. Publish on GitHub: Include model results and key visuals.
3. Prepare a short pitch: 2-minute overview of results, impact, and challenges.
4. Highlight strengths: Real-world relevance, handling imbalance, model improvement, metric choice, and deployment readiness.

> **Remember: Machine learning is iterative.** Every model can be improved, every dataset hides deeper insights, and every project sharpens your intuition as a data scientist. Keep experimenting, stay curious, and keep building!