![Thinkube AI Lab](../icons/tk_full_logo.svg)

# PyTorch GPU Training 🔥

Train neural networks efficiently on GPU:
- Load and prepare datasets
- Define model architectures
- GPU-accelerated training
- Track experiments with MLflow
- Save and load checkpoints

## Setup and Imports

In [None]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import mlflow
import os

# TODO: Set device (cuda if available)
# TODO: Set random seeds for reproducibility
# TODO: Display PyTorch and CUDA versions

## Load and Prepare Dataset

Using CIFAR-10 for this example:

In [None]:
# Prepare CIFAR-10 dataset

# TODO: Define transforms (normalize, augmentation)
# TODO: Load training dataset
# TODO: Load test dataset
# TODO: Create DataLoaders with pin_memory=True
# TODO: Set num_workers for faster loading
# TODO: Display dataset statistics

## Define Model Architecture

Simple CNN for image classification:

In [None]:
# Define CNN model

# TODO: Create CNN class with conv layers
# TODO: Add batch normalization
# TODO: Add dropout for regularization
# TODO: Define forward pass
# TODO: Instantiate model
# TODO: Move model to GPU
# TODO: Display model summary

## Configure Training

Loss function and optimizer:

In [None]:
# Setup training components

# TODO: Define loss function (CrossEntropyLoss)
# TODO: Define optimizer (Adam or SGD)
# TODO: Define learning rate scheduler
# TODO: Set hyperparameters (epochs, lr, batch_size)
# TODO: Display training configuration

## MLflow Experiment Setup

Track experiments with MLflow:

In [None]:
# Setup MLflow tracking

# TODO: Set MLflow tracking URI from environment
# TODO: Set experiment name
# TODO: Start MLflow run
# TODO: Log hyperparameters
# TODO: Log model architecture details
# TODO: Display run ID

## Training Loop

Train the model on GPU:

In [None]:
# Training loop
from tqdm import tqdm

# TODO: Loop through epochs
# TODO: For each batch:
#       - Move data to GPU
#       - Forward pass
#       - Calculate loss
#       - Backward pass
#       - Optimizer step
#       - Track metrics
# TODO: Log metrics to MLflow each epoch
# TODO: Run validation after each epoch
# TODO: Update learning rate scheduler
# TODO: Display progress bar

## Validation

Evaluate on test set:

In [None]:
# Validation function

# TODO: Set model to eval mode
# TODO: Disable gradient computation with torch.no_grad()
# TODO: Loop through test data
# TODO: Calculate accuracy and loss
# TODO: Log validation metrics to MLflow
# TODO: Return to train mode

## Save Checkpoints

Save model weights and state:

In [None]:
# Save model checkpoint

# TODO: Create checkpoint dictionary
#       - model state_dict
#       - optimizer state_dict
#       - epoch number
#       - best accuracy
# TODO: Save to file
# TODO: Log model artifact to MLflow
# TODO: Display save location

## Load Checkpoint

Resume training from checkpoint:

In [None]:
# Load checkpoint

# TODO: Load checkpoint file
# TODO: Restore model state
# TODO: Restore optimizer state
# TODO: Get epoch number and metrics
# TODO: Display loaded checkpoint info

## Monitor GPU Utilization

Track GPU usage during training:

In [None]:
# GPU utilization monitoring

# TODO: Get GPU memory allocated
# TODO: Get GPU memory reserved  
# TODO: Calculate utilization percentage
# TODO: Plot memory usage over time
# TODO: Log to MLflow

## Inference

Use trained model for predictions:

In [None]:
# Inference on sample images

# TODO: Load sample images
# TODO: Preprocess images
# TODO: Run model in eval mode
# TODO: Get predictions
# TODO: Display results with confidence scores
# TODO: Visualize predictions

## Best Practices

- ✅ Use DataLoader with pin_memory and multiple workers
- ✅ Move data to GPU only when needed
- ✅ Use torch.no_grad() for validation
- ✅ Clear gradients with optimizer.zero_grad()
- ✅ Save checkpoints regularly
- ✅ Track experiments with MLflow
- ✅ Monitor GPU memory usage
- ✅ Use mixed precision for larger models (next notebook)

## Clean Up

In [None]:
# Clean up GPU memory

# TODO: Delete model and data
# TODO: Clear CUDA cache
# TODO: End MLflow run

## Next Steps

Continue with:
- **03-distributed-training.ipynb** - Multi-GPU training with DDP
- **04-transformers-training.ipynb** - Train transformer models
- **05-mlops-integration.ipynb** - Complete MLOps workflow