# Exercise 7: Optimize the Performance of Your DataLoaders

Building a data pipeline is one thing; making it efficient is another. Time to explore how different preprocessing methods affect your input values, then experiment with shuffling, batching, and parallel loading to measure performance.

> **Overview**: Discover how preprocessing choices and DataLoader settings impact training efficiency. Compare StandardScaler vs. MinMaxScaler, experiment with batch sizes and shuffling, and measure the performance improvements from advanced optimizations.
> 
> **Scenario**: You've built a churn prediction model that works well in testing. Now it needs to process customer updates efficiently each night for production deployment. Your goal: Optimize the data loading to handle real-world scale.
> 
> **Goal**: Build intuition for how preprocessing and DataLoader settings affect both data quality and training speed, then make informed decisions about which settings to use for different scenarios.
> 
> **Tools**: Python, PyTorch, Pandas, scikit-learn
> 
> **Estimated Time**: 15-20 minutes

## Step 1: Setup

Let's import our libraries and set up the environment.

In [None]:
# Import core libraries
import torch
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from datasets import load_dataset
import time
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"Pandas version: {pd.__version__}")

## Step 2: Load and split the data

> Note: This step mirrors the exact same dataset and dataset processing (up to step 5) as for [demo 7](/cd1818-intro-to-deep-learning/7-preprocessing/demo7-turning-raw-data-into-trainable-tensors.ipynb).

We'll use the [aai510-group1/telco-customer-churn](https://huggingface.co/datasets/aai510-group1/telco-customer-churn) dataset from Hugging Face, which contains 4,000+ real customer records from a telecommunications company.

The full dataset has 49 features, but we'll focus on 10 key predictors that are most relevant for understanding customer churn.

> **The preprocessing workflow ahead:**
> 
> Here's a compact view of the full pipeline we'll build:
> 
> 1. **Load data** → Convert Hugging Face dataset to Pandas DataFrame
> 2. **Create subset** → Select only the 10 features we are interested into
> 3. **Handle missing values** → Fill Internet Type nulls with "None"
> 4. **Encode categories** → Convert Contract and Internet Type to numbers
> 5. **Split data** → Separate into train (70%) and validation (30%) sets
> 
> Each step prepares the data for the next, transforming messy CSV into training-ready batches.

**IMPORTANT: Feel free to skip if you've gone through the demo**. Just know we're loading the telco-customer-churn dataset (4,000+ customers), handling missing values, encoding categories, and splitting into train/val sets. 

In [None]:
# 1. Load dataset
dataset = load_dataset('aai510-group1/telco-customer-churn', split='train')
df = pd.DataFrame(dataset)

# 2. Select features
feature_columns = [
    'Age', 'Dependents', 'Tenure in Months', 'Contract', 'Internet Type',
    'Monthly Charge', 'Total Charges', 'Satisfaction Score', 'Number of Referrals'
]
target_column = 'Churn'
df = df[feature_columns + [target_column]]

# 3. Handle missing values
df['Internet Type'] = df['Internet Type'].fillna('None')

# 4. Encode categorical features
df_encoded = pd.get_dummies(df, columns=['Contract', 'Internet Type'], drop_first=True)

# 5. Split data
X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

print(f"✓ Data loaded and preprocessed")
print(f"  Training set: {X_train.shape[0]} samples")
print(f"  Validation set: {X_val.shape[0]} samples")
print(f"  Features: {X_train.shape[1]}")

> **Baseline established**: With clean, preprocessed data in hand, you're ready to experiment with different scaling methods and DataLoader settings to optimize pipeline performance.

## Step 3: Experiment with different scalers


You'll create a reusable function for building data pipelines, then use it with two different scalers (StandardScaler and MinMaxScaler) to compare how they affect your input values.

This modular approach lets you easily experiment with different preprocessing methods while keeping the rest of your pipeline consistent.

### 3.1: Build a reusable scaling pipeline

Create a function that handles the complete transformation: scaling → tensors → TensorDataset → DataLoader. This function will work with any sklearn scaler, letting you easily swap between StandardScaler, MinMaxScaler, or others.

In [None]:
# Build the reusable scaling pipeline

def create_scaled_pipeline(X_train, X_val, y_train, y_val, scaler, batch_size=32):
    """
    Create a DataLoader pipeline with scaled features.
    
    Hint: This function works with any sklearn scaler (StandardScaler, MinMaxScaler, etc.).
    The scaler should already be initialized before calling this function.

    IMPORTANT: Keep shuffling to False for both sets here as we'll experiment with shuffling in a later step!
    
    Args:
        X_train, X_val: Feature matrices
        y_train, y_val: Target arrays
        scaler: An sklearn scaler object (e.g., StandardScaler(), MinMaxScaler())
        batch_size: Number of samples per batch
    
    Returns:
        train_loader: DataLoader for training set
        val_loader: DataLoader for validation set
    """
    # TODO: Fit the scaler on training data and transform both sets
    # Hint: Remember the golden rule of deep learning!
    # Reference: https://scikit-learn.org/stable/modules/preprocessing.html
    X_train_scaled =  # Add your code here
    X_val_scaled =  # Add your code here
    
    # TODO: Convert NumPy arrays to PyTorch tensors
    # Hint: What dtype is expected by default?
    # Reference: https://pytorch.org/docs/stable/tensors.html
    X_train_tensor =  # Add your code here
    X_val_tensor =  # Add your code here
    y_train_tensor =  # Add your code here
    y_val_tensor =  # Add your code here
    
    # TODO: Create TensorDatasets to pair features with targets
    # Hint: TensorDataset wraps tensors so features and targets stay synchronized.
    # Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset
    train_dataset =  # Add your code here
    val_dataset =  # Add your code here
    
    # TODO: Create DataLoaders
    # Hint: Use the batch_size and shuffle parameters passed to the function
    # Reference: https://docs.pytorch.org/docs/stable/data.html
    train_loader =  # Add your code here
    val_loader =  # Add your code here
    
    return train_loader, val_loader

### 3.2: Create baseline pipeline with StandardScaler

Let's create a baseline DataLoader using StandardScaler and measure its performance. You'll implement the standardization step and create the DataLoader with basic settings.

This baseline will serve as your reference point for all optimizations in the following steps.

In [None]:
# 1. Standardize the features using StandardScaler

# TODO: Initialize StandardScaler
# Hint: Use StandardScaler from sklearn.preprocessing.
# Reference: https://scikit-learn.org/stable/api/sklearn.preprocessing.html
standard_scaler =  # Add your code here

# Create StandardScaler pipeline
baseline_train_loader, baseline_val_loader = create_scaled_pipeline(
    X_train, X_val, y_train, y_val, 
    scaler=standard_scaler,
    batch_size=32
)

print("✓ StandardScaler pipeline created!")
print(f"  Batch size: 32")
print(f"  Number of training batches: {len(baseline_train_loader)}")
print(f"  Number of validation batches: {len(baseline_val_loader)}")

In [None]:
# 2. Inspect baseline data values in a batch to understand what StandardScaler produces and why it matters for training.

# TODO: Get one batch from the baseline DataLoader
# Think about how to get the first item from an iterator.
# Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
batch_features, batch_targets =  # Add your code here

# Examine the value range for the batch
print("StandardScaler feature statistics:")
print("=" * 60)
print(f"Mean across batch: {batch_features.mean(dim=0)[:5]}")
print(f"Std across batch:  {batch_features.std(dim=0)[:5]}")
print(f"Min value in batch: {batch_features.min().item():.3f}")
print(f"Max value in batch: {batch_features.max().item():.3f}")
print(f"\nFirst example (first 5 features): {batch_features[0, :5]}")

> **Examining the output**: Look at the mean and std values across the batch. 
> - *What range do most values fall into?* Compare this to the original feature scales you saw in the demo (Age: 20-80, Total Charges: 100-8,000). 
> - *How does StandardScaler change the distribution?* Keep these patterns in mind when you compare to MinMaxScaler in Step 4.

### 3.3: Create alternative pipeline with MinMaxScaler

Now create a second pipeline using MinMaxScaler instead. Compare the value distributions to see how different scaling methods affect your data.

In [None]:
# 1. Standardize the features using MinMaxScaler

# TODO: Initialize MinMaxScaler
# Hint: Use MinMaxScaler from sklearn.preprocessing.
# Reference: https://scikit-learn.org/stable/api/sklearn.preprocessing.html
minmax_scaler =  # Add your code here

# Create MinMaxScaler pipeline
normalized_train_loader, normalized_val_loader = create_scaled_pipeline(
    X_train, X_val, y_train, y_val, 
    scaler=minmax_scaler,
    batch_size=32
)

print("✓ MinMaxScaler pipeline created!")
print(f"  Batch size: 32")
print(f"  Number of training batches: {len(normalized_train_loader)}")
print(f"  Number of validation batches: {len(normalized_val_loader)}")

> **Same number of batches**: Notice that regardless of scaler choice, you still get 93 training batches and 40 validation batches. The scaler only affects the VALUES in your tensors, not the structure of your DataLoader.

In [None]:
# 3. Examine how MinMaxScaler's output differs from StandardScaler and consider which might be more suitable for this dataset.

# TODO: Get one batch from the mixmax DataLoader
# Think about how to get the first item from an iterator.
# Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
norm_batch_features, norm_batch_targets =  # Add your code here

print("Normalized (MinMaxScaler) feature statistics:")
print("=" * 60)
print(f"Mean across batch: {norm_batch_features.mean(dim=0)[:5]}")
print(f"Std across batch:  {norm_batch_features.std(dim=0)[:5]}")
print(f"Min value in batch: {norm_batch_features.min().item():.3f}")
print(f"Max value in batch: {norm_batch_features.max().item():.3f}")
print(f"\nFirst example (first 5 features): {norm_batch_features[0, :5]}")
print(f"Batch target distribution: {norm_batch_targets.sum().item()}/{len(norm_batch_targets)} churned")

print("\n" + "=" * 60)
print("Comparison:")
print("=" * 60)
print(f"StandardScaler range: [{batch_features.min().item():.3f}, {batch_features.max().item():.3f}]")
print(f"MinMaxScaler range:   [{norm_batch_features.min().item():.3f}, {norm_batch_features.max().item():.3f}]")

> **Compare the distributions**: Look at how the two scalers produce very different value ranges and distributions. Consider: 
> - Which range might help gradient descent learn more effectively? 
> - Does having both negative and positive values matter? 
> - How do the mean and std values differ between the two approaches?

###  TODO - Analysis Question 1

**Based on the value ranges you observed:**

Which scaling method (StandardScaler or MinMaxScaler) produces more stable numeric ranges for training? Why might one be preferred over the other for this dataset? (2-3 sentences)

_Write your answer here:_

## Step 4: Add shuffling and measure efficiency

Now that you understand preprocessing choices, it's time to optimize for training. You'll extend your pipeline function to handle shuffling properly (for training and/or validation?), then measure how DataLoader settings affect loading speed.

### 4.1: Extend the pipeline to handle shuffling

Modify your pipeline function to shuffle the right data set(s).

In [None]:
# TODO: Extend the pipeline function to handle shuffling correctly

def create_scaled_pipeline_extended(X_train, X_val, y_train, y_val, scaler, **dataloader_kwargs):
    """
    Create a DataLoader pipeline with scaled features and proper shuffling.

    We now use `dataloader_kwargs` to pass all DataLoader parameters (e.g., batch_size, drop_last, ...), giving full 
    flexibility without modifying the function signature.
    
    Args:
        X_train, X_val: Feature matrices
        y_train, y_val: Target arrays
        scaler: An sklearn scaler object (e.g., StandardScaler(), MinMaxScaler())
        dataloader_kwargs: Extra arguments passed directly to DataLoader (including batch size)
    
    Returns:
        train_loader: DataLoader for training set (shuffled)
        val_loader: DataLoader for validation set (not shuffled)
    """
    # TODO: Create the complete pipeline from (X_train, X_val) to (train_loader, val_loader)
    # Hint: Feel free to reuse the code from `create_scaled_pipeline`, and just hardcode the right shuffling values for train and validation sets
    # Reference: https://discuss.pytorch.org/t/how-does-shuffle-in-data-loader-work/49756

    # Add your code here
    
    return train_loader, val_loader

print("✓ Extended pipeline function with shuffling created!")

### 4.2: Create shuffled pipeline

Use your updated function to create DataLoaders with shuffling. Compare the batch contents to see how shuffling changes which examples appear together.

In [None]:
# Create a DataLoader with correct shuffling enabled
# Observe how shuffling affects the order of examples

# TODO: Choose your preferred scaler
shuffled_scaler =  # Add your code

# Create DataLoaders with shuffling enabled
shuffled_train_loader, shuffled_val_loader = create_scaled_pipeline_extended(
    X_train, X_val, y_train, y_val,
    scaler=shuffled_scaler,
    batch_size=32
)
print("✓ Shuffled DataLoader created!")

# TODO: Compare first batch from baseline (no shuffle) vs. shuffled
# HINT: Take the first batch from iterating from each loader
# Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
baseline_batch, _ =  # Add your code
shuffled_batch, _ =  # Add your code

print("\nFirst example from baseline (no shuffle):")
print(baseline_batch[0, :5])
print("\nFirst example from shuffled loader:")
print(shuffled_batch[0, :5])
print("\nNotice how shuffling changes which examples appear first!")

> **Observing the shuffle effect**: The feature values should look different because shuffling randomizes the order of examples. Each time you iterate through the shuffled loader, you'll see a different order. This prevents the model from learning patterns based on the sequence of examples rather than the actual features.
> 
> Consider: What would happen if all churned customers appeared consecutively in your training data? How would that affect the model's learning?

#### TODO - Analysis Question 2

**Explain your shuffling strategy in 2-3 sentences.**

Think about what happens if the model sees all "churned" customers first, then all "stayed" customers. How would this affect learning vs. evaluation? (2-3 sentences)

_Write your answer here:_

## Step 5: Experiment with batch sizes

Now that you understand preprocessing and shuffling, let's explore how batch size affects training efficiency. You'll measure how different batch sizes impact iteration speed and understand the trade-offs between throughput and memory usage.

### 5.1: Check device availability

First, let's check whether we're running on GPU or CPU. This affects which optimizations will be most beneficial.

In [None]:
# 1. Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

> **Why device matters for DataLoader optimization**: GPU training benefits significantly from larger batch sizes (better parallelism) and pin_memory (faster CPU→GPU transfer).
> On CPU, these optimizations have minimal or negative impact due to overhead. Your exact results will vary based on your hardware, but the principles remain the same!

### 5.2: Create benchmarking function

Let's build a function that accurately measures DataLoader performance with proper warmup and averaging. This function will simulate realistic training by including a dummy model forward pass.

In [None]:
# Define helper to test batch sizes
def test_batch_size(batch_size, loader, device=device, num_warmup=2, num_runs=3):
    """
    Time multiple iterations through a DataLoader and return average.
    Works with both CPU and GPU.
    
    Args:
        batch_size: Batch size (for reporting)
        loader: DataLoader to benchmark
        device: torch.device for GPU/CPU
        num_warmup: Number of warmup iterations (to initialize CUDA kernels)
        num_runs: Number of timed runs to average
    
    Returns:
        mean_time, std_time: Average and standard deviation of iteration times
    """
    times = []

    # Create a simple dummy model to simulate real training
    dummy_model = torch.nn.Sequential(
        torch.nn.Linear(12, 64),
        torch.nn.ReLU(),
        torch.nn.Linear(64, 1)
    ).to(device)

    # Warmup runs (important for CUDA initialization, harmless on CPU)
    for _ in range(num_warmup):
        for batch_features, batch_targets in loader:
            batch_features = batch_features.to(device)
            batch_targets = batch_targets.to(device)
            _ = dummy_model(batch_features)  # Simulate forward pass
            if device.type == 'cuda':
                torch.cuda.synchronize()  # Wait for GPU operations to complete
    
    # Timed runs
    for _ in range(num_runs):
        start = time.time()
        for batch_features, batch_targets in loader:
            batch_features = batch_features.to(device)
            batch_targets = batch_targets.to(device)
            _ = dummy_model(batch_features)  # Simulate forward pass
            if device.type == 'cuda':
                torch.cuda.synchronize()  # Wait for GPU to finish
        iteration_time = time.time() - start
        times.append(iteration_time)
    
    return np.mean(times), np.std(times)

> **How the benchmark works:**
> 
> This function measures realistic DataLoader performance by simulating actual training:
> 
> 1. **Dummy model**: Creates a simple 2-layer neural network to simulate forward passes (without this, we'd only measure Python iteration overhead, not actual data processing)
> 2. **Warmup runs**: First iterations are always slower due to CUDA kernel compilation and memory allocation—we run these separately to avoid skewing measurements
> 3. **Multiple timed runs**: Averages 3 runs to reduce variance from system noise (background processes, OS scheduling)
> 4. **GPU synchronization**: `torch.cuda.synchronize()` ensures we wait for GPU operations to complete before stopping the timer (GPU operations are asynchronous!)
> 
> The result: accurate measurements of how fast your DataLoader can feed data to a training model, not just how fast Python can iterate.

### 5.3: Compare different batch sizes

Test how different batch sizes _(small, medium, large)_ affect iteration speed and the number of batches. Measure the trade-offs between throughput and memory usage.

In [None]:
# Create DataLoaders with different batch sizes and measure iteration time

# TODO: Choose your batch sizes
# HINT: Compare small, medium (32), and large batch sizes
batch_sizes = []  # Add your choices here

# Test different batch sizes
results = {}
print("Testing different batch sizes:")
print("=" * 60)

for bs in batch_sizes:
    # TODO: Create DataLoader with specified batch_size
    # HINT: 
    # - Should you use StandardScaler or MinMaxScaler?
    # - Should you use `create_scaled_pipeline` or `create_scaled_pipeline_extended`?
    scaler =  # Add your code here
    train_loader, val_loader =  # Add your code here

    # Time both loaders
    train_mean, train_std = test_batch_size(bs, train_loader, device)
    val_mean, val_std = test_batch_size(bs, val_loader, device)
    
    results[bs] = {
        'train_time': train_mean,
        'train_std': train_std,
        'val_time': val_mean,
        'val_std': val_std,
        'num_train_batches': len(train_loader),
        'num_val_batches': len(val_loader),
    }
    
    print(f"\nBatch size {bs}:")
    print(f"  Training batches: {results[bs]['num_train_batches']}")
    print(f"  Training time: {results[bs]['train_time']:.4f} ± {results[bs]['train_std']:.4f} seconds")
    print(f"  Time per batch: {results[bs]['train_time']/len(train_loader)*1000:.2f} ms/batch")
    print(f"  Validation batches: {results[bs]['num_val_batches']}")
    print(f"  Validation time: {results[bs]['val_time']:.4f} ± {results[bs]['val_std']:.4f} seconds")
    print(f"  Time per batch: {results[bs]['val_time']/len(val_loader)*1000:.2f} ms/batch")

> **Reading the metrics**: "Time per batch" increases with batch size (because each batch has more work), but "total time" decreases (because fewer iterations). 
> 
> For production, we care about total time per epoch!


### TODO - Analysis Question 3

**Based on your batch size experiments, answer the following:**

1. **Which batch size gave you the best performance (fastest total time)?** Look at the total time values and identify which batch size was fastest.

2. **Why does larger batch size generally improve speed?** Think about the relationship between number of batches and Python iteration overhead.

3. **What limits how large you can make the batch size?** Consider both hardware constraints and training considerations.

_Write your answer here:_

> **Other DataLoaders settings for efficiency**: DataLoaders have additional settings that can speed up training on larger datasets. Especially important are:
> - `num_workers`: How many CPU worker processes load data in parallel (0 = no parallelism).  
> - `pin_memory`: When `True` and using a GPU, copies batches to the GPU faster.  
> 
> These advanced optimizations are most beneficial when:
> - Loading data from disk (images, videos, large files)
> - Performing real-time data augmentation or preprocessing
> - Working with large datasets (100K+ samples)
> - GPU is waiting idle for data to arrive
> 
> On small, pre-loaded datasets like ours, the multiprocessing overhead may exceed the benefits. But, for production models on large datasets, these settings can reduce training time by 30-50%!

## Step 6: Collect your thoughts

Now that you've experimented with different preprocessing methods and DataLoader settings, let's synthesize what you've learned by answering these questions based on your experiments.


### TODO: Analysis Question 4

**Which preprocessing method (StandardScaler vs. MinMaxScaler) would you choose for this dataset, and why?**

_Write your answer here:_


### TODO: Analysis Question 5

**What's the most important lesson you learned about DataLoader optimization?**

_Write your answer here:_

### TODO: Analysis Question 6

**If you had to train on a dataset 100x larger than this one, which three settings would you change first and why?**

_Write your answer here:_


## Conclusion

Congratulations! You've explored how preprocessing choices and DataLoader settings impact both data quality and training efficiency.

**What you've accomplished:**

- [x] **Compared preprocessing methods** - Explored how StandardScaler vs. MinMaxScaler affect value ranges and stability
- [x] **Experimented with shuffling** - Discovered why training data needs shuffling but validation doesn't
- [x] **Tested batch size trade-offs** - Measured how different batch sizes affect iteration speed and memory usage
- [x] **Discovered advanced optimizations** - Discussed about num_workers and pin_memory for faster data loading

**Critical insights:**

- **Preprocessing matters for training stability**: StandardScaler's zero-centered distribution typically works better for gradient descent than MinMaxScaler's [0,1] range
- **Shuffling prevents order-based learning**: Without shuffling, models learn spurious patterns based on data order rather than true features
- **Batch size balances speed and memory**: Larger batches are faster per epoch but use more memory; smaller batches update more frequently but take longer
- **Advanced optimizations scale with dataset size**: num_workers and pin_memory show minimal gains on small datasets but can dramatically speed up training on large-scale data

Efficient data handling isn't about finding one "best" setting; it's about understanding the trade-offs and choosing configurations that match your specific constraints (dataset size, hardware, training time budget). Preprocessing, shuffling, batching, and loader optimizations interact to shape training stability and efficiency, giving you the tools to design data pipelines that scale.

> **Next steps to explore**: Try training your model with different preprocessing choices, batch sizes, and optimizers, and test how your data pipeline holds up as you increase the dataset size.