
## Getting Started Tips

1. **Start Small**: Begin with CIFAR-10 instead of ImageNet for faster iteration
2. **Test Each Phase**: Run verification functions after implementing each phase
3. **Debug Shapes**: Print tensor shapes frequently to catch dimension mismatches early
4. **Use Small Batches**: Start with small batch sizes to avoid memory issues

Work through each function stub systematically. The hints give you the conceptual understanding, but you'll need to research the specific PyTorch APIs and mathematical implementations. Come back with questions about specific functions when you get stuck!




# Phase 1: Environment Setup

**Detailed Hint:** You need to establish your development environment with the right deep learning framework. Think about what libraries you'll need for neural networks, computer vision operations, mathematical computations, and data handling. Also consider GPU support if available. The framework choice will determine your entire implementation approach - PyTorch tends to be more research-friendly and closer to how papers describe things.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.utils.data as data
from torch.utils.data import DataLoader
import torch.nn
import torch.nn.functional as F
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau

import time
import os

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import random

# TODO: Fill in the necessary imports
def setup_environment():
    """
    Set up all the necessary imports and check for GPU availability.
    Hint: You'll need torch, torchvision, numpy, and possibly matplotlib for visualization.
    """
    # Import statements go here
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    import torchvision.datasets as datasets
    import torch.utils.data as data
    from torch.utils.data import DataLoader
    import torch.nn
    import torch.nn.functional as F
    from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau

    import time



    import numpy as np
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    import random


    # Check if CUDA is available
    device = None  # TODO: Determine if you should use 'cuda' or 'cpu'

    if torch.cuda.is_available():
        device = torch.device("cuda")
    else:
        device = torch.device("cpu")

    print(f"Using device: {device}")
    return device

def set_random_seeds(seed=42):
    """
    Set random seeds for reproducibility across different libraries.
    Hint: Neural networks involve randomness in initialization, data shuffling, etc.
    You want consistent results across runs for debugging.
    """
    # TODO: Set seeds for torch, numpy, and random module

    torch.manual_seed(seed)                    # PyTorch CPU random numbers
    torch.cuda.manual_seed(seed)               # PyTorch GPU random numbers
    torch.cuda.manual_seed_all(seed)           # For multi-GPU setups
    np.random.seed(seed)                       # NumPy random numbers
    random.seed(seed)

    pass


setup_environment()
set_random_seeds()

Using device: cuda




# Phase 2: Data Preprocessing \& Augmentation

**Detailed Hint:** AlexNet's power comes partly from its data augmentation strategy. You need to think about how to transform images during training vs testing. During training, you want randomness (different crops, flips) to artificially expand your dataset. During testing, you want consistency and thoroughness (systematic crops). The original paper mentions specific image sizes: input images are 256×256, but the network expects 224×224 patches. Consider what happens to the "extra" 32 pixels on each side.

In [2]:
def create_basic_transforms():
    """
    Create the basic image transformations for AlexNet.
    Hint: Think about the paper's mention of 256×256 input images and 224×224 patches.
    What mathematical operations convert images to the right format for neural networks?
    """
    # TODO: Compose transformations for training

    train_transform = transforms.Compose([

         transforms.Resize(256),
         transforms.RandomResizedCrop(224),
         transforms.RandomHorizontalFlip(p=0.5),

         #as 3 transformations were listed in paper, we have done all three --> now convert to tensor and move ahead
         transforms.ToTensor(),


         #normalization is apparrently a standard practice --> ON THE OTHER HAND,
         # the numbers chosen are a standard practice for ImageNet

         transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

    ])



    # TODO: Compose transformations for validation/testing
    val_transform = transforms.Compose([
        transforms.Resize(256),

        #we choose centre crop bcoz it is deterministic which is what we want in validation set -->
        #or else each time our val acc will be different due to flips and randomk crops -->
        #hence we also skop the flips

        transforms.CenterCrop(224),

        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

    ])

    test_transform = transforms.Compose([
        transforms.Resize(256),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
    ])

    return train_transform, val_transform, test_transform


In [3]:

def extract_ten_test_crops(image_tensor):
    """
    Extract the 10 crops used during AlexNet testing phase.
    Hint: The paper mentions "four corner patches and the center patch" plus their horizontal reflections.
    Think about where these 5 locations would be in a 256×256 image when extracting 224×224 patches.
    """

    """
    Input: image_tensor: PyTorch tensor of shape [C, H, W]
    --> where C (Channels) =3 (RGB)
    -->  H (Height) = W (Width) = 256

    Output: list of 10 image tensors of shape [C, 224, 224]
    """
    crops = []

    # Verifying image_tensor shape is correct

    assert image_tensor.dim() == 3, f"Expected 3D tensor [C,W,H], got {image_tensor.dim()} D"

    assert image_tensor.shape[-2:] == (256, 256), f"Expected 256x256 image, got with shape {image_tensor.shape[-2:]} "

    # TODO: Extract 4 corner crops (top-left, top-right, bottom-left, bottom-right)


# first is colon bcoz we are taking all channels (RGB)

    top_left = image_tensor[:, 0:224, 0:224]
    crops.append(top_left)

    # Top-right corner crop
    top_right = image_tensor[:, 0:224, 32:256]  # 256-224=32, so start at col 32
    crops.append(top_right)

    # Bottom-left corner crop
    bottom_left = image_tensor[:, 32:256, 0:224]  # Start at row 32
    crops.append(bottom_left)

    # Bottom-right corner crop
    bottom_right = image_tensor[:, 32:256, 32:256]  # Start at (32, 32)
    crops.append(bottom_right)



    # TODO: Extract center crop

    center = image_tensor[:, 16:240, 16:240]
    crops.append(center)





    # TODO: Create horizontal flips of all 5 crops


    #writing 5 bcoz if that is abset then at every iteration,
    # the appended crop will also be considered --> infinite loop
    for crop in crops[:5]:

# ---------->>>>  IMPORTANT -> torch.flip() with dims=[-1] flips along the last dimension (width)


      flipped_crop = torch.flip(crop, dims=[-1])
      crops.append(flipped_crop)


      #verif=ying if we have 10 crops or other count
    assert len(crops) == 10, f"Expected 10 crops, got {len(crops)}"

      # Verify all crops have correct shape
    for i, crop in enumerate(crops):
        assert crop.shape == (3, 224, 224), f"Crop {i} has wrong shape: {crop.shape}"

    return crops  # Should return list of 10 image tensors


In [4]:

def implement_pca_color_augmentation(dataset_path):
    """
    Optional advanced function: Implement PCA-based color augmentation.
    Hint: You need to collect RGB pixel values from your entire dataset,
    compute the covariance matrix, find eigenvectors/eigenvalues,
    then create a transform that adds random combinations of these principal components.
    """


    # This is advanced - skip if you want to focus on core architecture first

    # we willlc ome back later

    pass


# Phase 3: Dataset Loading

**Detailed Hint:** You need to create data loaders that can efficiently feed batches of images to your network during training. Think about memory management, shuffling strategies, and how to handle different dataset formats. The original AlexNet used ImageNet, but you might start with CIFAR-10 for faster experimentation. Consider what batch size makes sense for your hardware constraints.


In [5]:
def create_data_loaders(batch_size=128):
    """
    Create PyTorch DataLoaders for training and validation.
    Hint: You need to handle the directory structure of your dataset.
    Think about what arguments DataLoader needs for efficient training (shuffling, number of workers).
    """
    train_transform, val_transform, test_transform = create_basic_transforms()

    # TODO: Create dataset objects using torchvision.datasets
    train_dataset = datasets.CIFAR10(
        root = "./data",
        train = True,
        transform=train_transform,
        download=True
    )


    val_dataset = datasets.CIFAR10(
        root = "./data",
        train = False,
        transform=val_transform,
        download=True
    )



    # Print dataset information
    print(f"Training dataset size: {len(train_dataset)} images")
    print(f"Validation dataset size: {len(val_dataset)} images")
    print(f"Number of classes: {len(train_dataset.classes)}")
    print(f"Class names: {train_dataset.classes}")



    test_dataset = datasets.CIFAR10(
        root="./data",
        train=False,
        transform=test_transform,
        download=True
    )





    # TODO: Create DataLoader objects
    train_loader = DataLoader(
        dataset = train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=2,
        pin_memory=True,
        drop_last=False
    )



    val_loader = DataLoader(
        dataset = val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=2,
        pin_memory=True,
        drop_last=False
    )


    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=32,
        shuffle=False,
        num_workers=2
    )


    print(f"Training batches per epoch: {len(train_loader)}")
    print(f"Validation batches: {len(val_loader)}")

    return train_loader, val_loader, test_loader


In [6]:


def verify_data_loading(data_loader):
    """
    Test function to verify your data loading works correctly.
    Hint: Grab a batch, check the shapes, verify the data types and value ranges.
    Print out some statistics to ensure everything looks reasonable.
    """
    # TODO: Get one batch from the data loader

    data_iter = iter(data_loader)
    images, labels = next(data_iter)




    # Print shapes and data types
    print(f"Batch images shape: {images.shape}")  # Should be [batch_size, 3, 224, 224]
    print(f"Batch labels shape: {labels.shape}")  # Should be [batch_size]
    print(f"Images data type: {images.dtype}")    # Should be torch.float32
    print(f"Labels data type: {labels.dtype}")    # Should be torch.int64

    # Check value ranges
    print(f"Image pixel value range: [{images.min():.3f}, {images.max():.3f}]")
    print(f"Label range: [{labels.min()}, {labels.max()}]")



    # TODO: Print shapes, min/max values, data types
    # TODO: Maybe visualize a few images to verify augmentations work
    pass



In [7]:

train_loader, val_loader, test_loader = create_data_loaders()
verify_data_loading(train_loader)
verify_data_loading(val_loader)
verify_data_loading(test_loader)

100%|██████████| 170M/170M [00:03<00:00, 42.7MB/s]


Training dataset size: 50000 images
Validation dataset size: 10000 images
Number of classes: 10
Class names: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
Training batches per epoch: 391
Validation batches: 79
Batch images shape: torch.Size([128, 3, 224, 224])
Batch labels shape: torch.Size([128])
Images data type: torch.float32
Labels data type: torch.int64
Image pixel value range: [-2.118, 2.640]
Label range: [0, 9]
Batch images shape: torch.Size([128, 3, 224, 224])
Batch labels shape: torch.Size([128])
Images data type: torch.float32
Labels data type: torch.int64
Image pixel value range: [-2.118, 2.640]
Label range: [0, 9]
Batch images shape: torch.Size([32, 3, 256, 256])
Batch labels shape: torch.Size([32])
Images data type: torch.float32
Labels data type: torch.int64
Image pixel value range: [-2.118, 2.640]
Label range: [0, 9]



# Phase 4: Local Response Normalization (LRN)

**Detailed Hint:** This is AlexNet's "secret sauce" for competition between feature maps. You're implementing the formula from the paper where each activation gets normalized by its neighbors. Think about how to efficiently compute the sum of squares across nearby channels for each spatial location. PyTorch removed built-in LRN, so you need to implement it as a custom module. Consider the sliding window of channels and how to handle edge cases.


In [8]:
class LocalResponseNorm(nn.Module):
    """
    Implement Local Response Normalization as described in AlexNet paper.
    Hint: The formula involves looking at nearby channels and computing their squared sum.
    You need to implement the forward pass that applies the mathematical formula to each activation.
    """


    def __init__(self, size=5, alpha=1e-4, beta=0.75, k=2.0):
        """
        Initialize LRN parameters.

        Args:
            size (int): Number of nearby channels to consider (n in paper)
            alpha (float): Scaling parameter (α in paper)
            beta (float): Exponent parameter (β in paper)
            k (float): Additive constant to prevent division by zero
        """
        super(LocalResponseNorm, self).__init__()
        self.size = size
        self.alpha = alpha
        self.beta = beta
        self.k = k





    def forward(self, x):
        """
        Apply LRN to input tensor x.

        Hint: x has shape [batch, channels, height, width]

        For each position, you need to look at 'size' nearby channels,
        compute the sum of their squares, then apply the normalization formula.
        """

        batch_size, channels, height, width = x.size()

        # TODO: Implement the LRN formula

        # Step 1: Squaring the input

        x_squared = x.pow(2)


        #step 2:  Padding

        #Step 2.1: Calculate how much padding we need:

        padding = self.size // 2


        # step 2.2: add the padding using the FANCY SYNTAX

        x_squared_padded = F.pad(
            x_squared,
            (0, 0, 0, 0, padding, padding),
            mode='constant',
            value=0
        )
       # x_squared_padded shape: [batch, channels + 2*padding, height, width]


        # Step 3: Sliding Window

        # Step 3.1: Store the window states/conditions

        windows = x_squared_padded.unfold(1, self.size, 1)


        # Step 3.2: Calculate sum for each window condition

        sum_of_squares = windows.sum(dim=-1)

# hence we got the "summation" term of equation


        denominator = torch.pow(self.k + self.alpha * sum_of_squares, self.beta)
        denominator = torch.clamp(denominator, min=1e-8) # --> to avoid division by zero

        return x / denominator  # Return normalized tensor

# Phase 5: AlexNet Architecture

**Detailed Hint:** Now you're building the actual network described in the paper. Think about the sequence: convolutional layers extract features, pooling layers reduce spatial dimensions, fully connected layers make final classifications. Pay attention to the paper's specific numbers: kernel sizes, strides, number of filters, etc. The architecture has two main parts - feature extraction (convolutional) and classification (fully connected). Consider where dropout and LRN fit in the architecture.


In [9]:
class AlexNet(nn.Module):
    """
    Implement the full AlexNet architecture.
    Hint: The paper describes 5 convolutional layers followed by 3 fully connected layers.
    Pay attention to the specific parameters: kernel sizes, strides, padding, number of filters.
    """
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()

        self.features = self._make_feature_layers()
        self.classifier = self._make_classifier_layers(num_classes)

        # TODO: Initialize weights using the strategy mentioned in the paper
        self._initialize_weights()

    def _make_feature_layers(self):
        """
        Create the convolutional feature extraction layers.

        Hint: Look at the paper's Table 1 or Figure 2 for the exact layer specifications.

        Remember to include ReLU activations, LRN where specified, and MaxPooling layers.
        """
        layers = []

        # TODO: Add Conv2d, ReLU, LRN, MaxPool2d in the right sequence


  # ----------- Layer 1: Conv(11x11, 96 filters, stride 4) -> ReLU -> LRN -> MaxPool -----------


        # Large 11x11 kernel to capture big patterns, stride=4 to reduce size quickly


        layers.append(nn.Conv2d(
            in_channels=3,          # RGB input
            out_channels=96,        # 96 different pattern detectors
            kernel_size=11,         # 11x11 sliding window
            stride=4,               # Move 4 pixels at a time (reduces size)
            padding=2               # Add border to maintain reasonable size

        ))


        layers.append(nn.ReLU(inplace=True))  # Activation: keep positive values only

        # Add Local Response Normalization (from your Phase 4 implementation)
        layers.append(LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2.0))

        # MaxPooling: Take the maximum in each 3x3 region, stride=2
        layers.append(nn.MaxPool2d(kernel_size=3, stride=2))


        # ----------- Layer 2: Conv(5x5, 256 filters, stride 1) -> ReLU -> LRN -> MaxPool  -----------


        # Input: 96 channels -> Output: 256 feature maps
        # Smaller 5x5 kernel for more detailed patterns


        # We implement the FULL network on one device
        layers.append(nn.Conv2d(96, 256, kernel_size=5, padding=2))
        #                       ↑    ↑
        #                       96   256
        #                       │    └─ Total output channels
        #                       └─ Total input channels (96, not 48) --> bcoz 48 are split across 2 GPUs according to the paper.


        layers.append(nn.ReLU(inplace=True))

        layers.append(LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2.0))

        layers.append(nn.MaxPool2d(kernel_size=3, stride=2))



        # ----------- Layer 3: Conv(3x3, 384 filters, stride 1) -> ReLU -----------


        # Input: 256 channels -> Output: 384 feature maps
        # Even smaller 3x3 kernel for fine details


        layers.append(nn.Conv2d(256, 384, kernel_size=3, padding=1))
        layers.append(nn.ReLU(inplace=True))



        # ----------- Layer 4: Conv(3x3, 384 filters, stride 1) -> ReLU -----------

        # Input: 384 channels -> Output: 384 feature maps
        # Same size, just processing the features further

        layers.append(nn.Conv2d(384, 384, kernel_size=3, padding=1))

        layers.append(nn.ReLU(inplace=True))


       # ----------- Layer 5: Conv(3x3, 256 filters, stride 1) -> ReLU -> MaxPool -----------

        # Input: 384 channels -> Output: 256 feature maps
        # Final feature extraction layer
        layers.append(nn.Conv2d(384, 256, kernel_size=3, padding=1))
        layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(kernel_size=3, stride=2))

        return nn.Sequential(*layers)




    def _make_classifier_layers(self, num_classes):
        """
        Create the fully connected classification layers.
        Hint: The paper mentions 3 fully connected layers with specific dimensions.
        Don't forget dropout for regularization - where should it be applied?
        """
        layers = []

        # TODO: Add Linear layers with appropriate input/output dimensions

        # TODO: Add ReLU activations and Dropout where appropriate

        # DROPOUT: Randomly turn off 50% of neurons during training
        # This prevents overfitting - like studying with distractions to build robustness
        layers.append(nn.Dropout(p=0.5))



        # FULLY CONNECTED LAYER 1
        # Input: 256 * 6 * 6 = 9216 features (flattened from conv layers)
        # Output: 4096 neurons
        layers.append(nn.Linear(256 * 6 * 6, 4096))
        layers.append(nn.ReLU(inplace=True))


        # More dropout
        layers.append(nn.Dropout(p=0.5))


        # FULLY CONNECTED LAYER 2
        # Input: 4096 -> Output: 4096
        layers.append(nn.Linear(4096, 4096))
        layers.append(nn.ReLU(inplace=True))



        # Final layer should output 'num_classes' values (no activation - handled by loss function)

        # Input: 4096 -> Output: num_classes (10 for CIFAR-10, 1000 for ImageNet)
        # No activation here - the loss function (CrossEntropy) handles it
        layers.append(nn.Linear(4096, num_classes))


        return nn.Sequential(*layers)

    def _initialize_weights(self):
        """
        Initialize network weights as described in the paper.
        Hint: The paper mentions specific initialization strategies for different layer types.
        Conv layers and Linear layers might need different approaches.
        """

        # we loop over every layer in the network and set weights
        for module in self.modules():

          # if layer is a conv layer

            if isinstance(module, nn.Conv2d):
                # Convolutional layers: Gaussian distribution with std=0.01

                #weights
                nn.init.normal_(module.weight, mean=0, std=0.01)

                #bias
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)



           # if layer is fc layer

            elif isinstance(module, nn.Linear):
                # Fully connected layers: Gaussian distribution with std=0.01\

              #weights
                nn.init.normal_(module.weight, mean=0, std=0.01)

                #bias
                nn.init.constant_(module.bias, 1)  # Paper initializes FC biases to 1


    def forward(self, x):

      # self - is basically the alexnet network itself
      # x is the input image




        """
        Define the forward pass through the network.
        Hint: Data flows through features, then gets flattened, then through classifier.
        Pay attention to tensor shapes - when do you need to reshape?
        """
        # TODO: Pass through feature extraction layers


        #step 1: feature extraction layers
        #pass through all conv layers
        # Input shape: [batch_size, 3, 224, 224]
        # Output shape: [batch_size, 256, 6, 6]


        x = self.features(x)



        '''
        Original photo: [Cat sitting on a chair]

        Detective's report (256 observations):
        - Report 1: "I see vertical edges in regions..."
        - Report 2: "I see curved shapes in regions..."
        - Report 3: "I see furry textures in regions..."
        - Report 4: "I see pointy triangular shapes in regions..."
        - ...
        - Report 256: "I see whisker-like patterns in regions..."

        ----- tHEREFORE we get the shape change as ------


        # Before feature extraction:
        # x.shape = [batch_size, 3, 224, 224]
        #           [how many photos, RGB, height, width]
        #           [4, 3, 224, 224] = 4 color photos, each 224×224 pixels

        # After feature extraction:
        # x.shape = [batch_size, 256, 6, 6]
        #           [how many photos, observations, small regions, small regions]
        #           [4, 256, 6, 6] = 4 analysis reports, each with 256 observations about 6×6 grid
        '''











        # TODO: Flatten the tensor for fully connected layers
        # Step 2: flatten bcoz we need to convert from 4d tensor to 2d tensore

       # Think: Convert from "image with features" to "list of features"

        x = x.view(x.size(0), -1) # Keep batch size, flatten everything else

        # continuation from example: turning the report into a list

        # New shape: [batch_size, 256*6*6] = [batch_size, 9216]

        #we have done this bcoz FC layers need a list of numbers and not grids





        # TODO: Step 3: Pass through classifier layers

        x = self.classifier(x)

        # gives us the final class label

        return x



# Phase 6: Training Infrastructure

**Detailed Hint:** You need to set up the training loop components: loss function, optimizer, and learning rate scheduling. The paper mentions specific choices - cross-entropy loss, SGD with momentum, specific learning rates and weight decay values. Think about what each hyperparameter does and why the authors chose these values. Also consider how to track and display training progress.


In [10]:

def setup_training_components(model, learning_rate=0.01):
    """
    Set up loss function, optimizer, and learning rate scheduler.

    Hint: AlexNet paper specifies SGD with momentum, specific weight decay values.

    What loss function makes sense for multi-class classification?
    """
    # TODO: Define appropriate loss function
    criterion = nn.CrossEntropyLoss()

    # TODO: Define optimizer with paper's hyperparameters
    optimizer = optim.SGD(
        model.parameters(),           # Which weights to update
        lr=learning_rate,            # How big steps to take (learning rate)
        momentum=0.9,                # How much to remember previous updates
        weight_decay=0.0005          # Regularization to prevent overfitting
    )



    # TODO: Optional - create learning rate scheduler


    # LEARNING RATE SCHEDULER - When to change learning speed
    # AlexNet reduces LR by factor of 10 when validation error stops improving
    scheduler = ReduceLROnPlateau(
        optimizer,
        mode='min',          # Reduce LR when validation loss stops decreasing
        factor=0.1,          # Multiply LR by 0.1 (reduce by factor of 10)
        patience=1,         # Wait 10 epochs before reducing
        verbose=True         # Print when LR changes
    )

    return criterion, optimizer, scheduler


In [11]:

def train_one_epoch(model, train_loader, criterion, optimizer, device):
    """
    Execute one training epoch.
    Hint: This is your main training loop - iterate through batches,
    compute forward pass, calculate loss, backpropagate, update weights.
    Track metrics like loss and accuracy for monitoring.
    """
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    start_time= time.time()

    for batch_idx, (data, target) in enumerate(train_loader):

        # TODO: Move data to appropriate device

        data = data.to(device)
        target = target.to(device)


        # TODO: Zero gradients

        optimizer.zero_grad()



        # TODO: Forward pass

        output = model(data)

        """
        Example output from AlexNet:

        AlexNet's guess: [0.1, 0.8, 0.05, 0.02, 0.01, 0.01, 0.005, 0.005, 0.0, 0.0]
        Class meanings:  [plane, car, bird, cat, deer, dog, frog, horse, ship, truck]
        Translation: "I'm 80% confident this is a car, 10% confident it's a plane..."
        """


        # TODO: Compute loss

        loss = criterion(output, target)


# --------- STARTING BACKPROP---------

        # TODO: Backward pass

        loss.backward()

        """
        computes gradients WRT all model parameters
        - model parameteres here (i think) is the 'theta' which is basically all the weights and biases
        """



        # TODO: Optimizer step

        optimizer.step()




        # TODO: Update running statistics

        running_loss += loss.item()

        predicted = torch.argmax(output, dim=1)  # Get class with highest probability
        total += target.size(0)                  # Add batch size to total
        correct += (predicted == target).sum().item()  # Count correct predictions


        # Optional: Print progress every N batches

        if batch_idx % 100 == 0:
          current_acc = 100.0 * correct / total
          print(f"  Batch {batch_idx:4d}/{len(train_loader)}: "
                f"Loss: {loss.item():.4f}, "
                f"Accuracy: {current_acc:.2f}%")





    epoch_loss = running_loss / len(train_loader)
    epoch_acc = correct / total

    return epoch_loss, epoch_acc


In [12]:

def validate_model(model, val_loader, criterion, device):
    """
    Evaluate model on validation set.
    Hint: Similar to training but without gradient computation.
    """


    model.eval()

    """
    # Evaluation mode: model.eval()
    # - Dropout is DISABLED (all neurons active for best performance)
    # - BatchNorm uses stored statistics from training
    # - Network gives its "best shot" consistently
    """




    val_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad(): # as specified earlier -->  no need of calculating gradients

        for data, target in val_loader:

            """
            ALMOST SIMILAR JUS THAT -->
            1. we are doing this for monitoring the performance of model and not to make it better
            --> thats why we dont have a backprop process
            --> just simple calculating loss and accuracy

            2. we do all of this on DIFFERENT DATA - loaded usnig val_loader ---> NEVER SEEN BEFORE

            ]
            """


            """
            IMP ------ During validation, we want to see AlexNet's true current ability, not give it more practice!
            """

            # TODO: Move data to device

            data = data.to(device)
            target = target.to(device)


            # TODO: Forward pass

            output = model(data)

            # TODO: Compute loss

            # this loss is not calculated for learning --> we are just checking/monitoring using this loss

            loss = criterion(output, target)
            val_loss += loss.item()


            # TODO: Calculate accuracy

            predicted = torch.argmax(output, dim=1)
            total += target.size(0)
            correct += (predicted == target).sum().item()


    avg_loss = val_loss / len(val_loader)
    accuracy = correct / total
    return avg_loss, accuracy


# Phase 7: Main Training Loop

**Detailed Hint:** This ties everything together - your main training script that orchestrates the entire process. Think about how many epochs to train, when to save checkpoints, how to handle early stopping, and what information to log. Consider what you want to track during training and how to save the best model.


In [13]:
def train_alexnet(num_epochs=1, save_path="alexnet_checkpoint.pth", resume_training=True):
    """
    Main training function that coordinates everything.
    Hint: This should set up all components, then run training/validation loops.
    Consider saving checkpoints, tracking best performance, and logging progress.

    this will return model --> which we did not have earlier
    it will also return best_val_acc as before
    """
    # TODO: Set up device, data loaders, model, training components
    device = setup_environment()
    train_loader, val_loader, _ = create_data_loaders(batch_size = 256)
    model = AlexNet(num_classes=10).to(device)
    criterion, optimizer, scheduler = setup_training_components(model, learning_rate=0.01)


    # Initializing tracking variables for a LOT of things
    best_val_acc = 0.0
    best_val_loss = float('inf')
    epochs_without_improvement = 0
    start_epoch = 0 # required if we load the model which was already trained till "x" epochs



    """
    before training we need to check if we already have a trained/saved model

    Therefore we need to have logic to load that model
    AND to use its metadata for initializing our training journey
    """


    if resume_training and os.path.exists(save_path):
        print ("loading checkpoint")

        try:

            checkpoint = torch.load(save_path, map_location = device)


            # loading ----->>>>> MODEL WIGHTS

            model.load_state_dict (checkpoint['model_state_dict'])


            # loading ------>>>>> optimizer and scheduler states

            optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
            scheduler.load_state_dict(checkpoint['scheduler_state_dict'])


            # Loading ------->>>>>>> training progress

            start_epoch = checkpoint.get('epoch', 0)

            best_val_acc = checkpoint.get('best_val_acc', 0.0)
            best_val_loss = checkpoint.get('best_val_loss', float('inf'))


            print ("checkpoint loaded")

        except Exception as e:

            print ("model was there apparently but unable to load")


    else:
        print ("training fresh init")



    """
    ---------------- model loading code ended ------------------------




    ---------------- training loop strats below ------------------------

    """


    for epoch in range(num_epochs):
        print(f"Epoch {epoch+1}/{num_epochs}")

        epoch_start_time = time.time()
        current_lr = optimizer.param_groups[0]['lr']


        # TODO: Train for one epoch

        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)


        # TODO: Validate the model

        val_loss, val_acc = validate_model(model, val_loader, criterion, device)



        # TODO: Update learning rate if using scheduler

        scheduler.step(val_loss)

        # Check if learning rate was reduced
        new_lr = optimizer.param_groups[0]['lr']
        if new_lr != current_lr:
            print(f" ----- Learning rate reduced: {current_lr:.6f} → {new_lr:.6f}")


        epoch_time = time.time() - epoch_start_time



        # TODO: Save checkpoint if this is the best model so far

        is_best_flag = val_acc > best_val_acc

        if is_best_flag:
            best_val_acc = val_acc
            best_val_loss = val_loss
            epochs_without_improvement = 0


            # Creating a list of things that we need to save

            checkpoint = {
                'epoch': epoch + 1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'best_val_acc': best_val_acc,
                'best_val_loss': best_val_loss,
                'train_acc': train_acc,
                'train_loss': train_loss,
                'val_acc': val_acc,
                'val_loss': val_loss
            }


            torch.save(checkpoint, save_path)
            print ("new best model saved")



# if the model did not get saved - it means that model was not better - which means no improvement

        else:
            epochs_without_improvement += 1



        # TODO: Print/log progress
        print(f"\n EPOCH {epoch+1} SUMMARY:")
        print(f"   Time: {epoch_time:.1f}s")
        print(f"   Train → Loss: {train_loss:.4f}, Accuracy: {train_acc:.2f}%")
        print(f"   Val   → Loss: {val_loss:.4f}, Accuracy: {val_acc:.2f}%")
        print(f"   Best Val Acc: {best_val_acc:.2f}% (Epoch {epoch + 1 - epochs_without_improvement})")


    print("Training completed!")



    #Most IMP return the model
    return model, best_val_acc


In [15]:
model, best_val_acc = train_alexnet()

Using device: cuda
Training dataset size: 50000 images
Validation dataset size: 10000 images
Number of classes: 10
Class names: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
Training batches per epoch: 196
Validation batches: 40
training fresh init
Epoch 1/1




  Batch    0/196: Loss: 2.6561, Accuracy: 10.55%
  Batch  100/196: Loss: 2.3026, Accuracy: 10.28%
new best model saved

 EPOCH 1 SUMMARY:
   Time: 162.9s
   Train → Loss: 2.7702, Accuracy: 0.10%
   Val   → Loss: 2.3027, Accuracy: 0.10%
   Best Val Acc: 0.10% (Epoch 1)
Training completed!



# Phase 8: Testing with 10-Crop Strategy

**Detailed Hint:** Implement AlexNet's testing strategy where you extract 10 different crops from each test image and average their predictions. This is different from training where you use random crops. Think about how this averaging helps improve accuracy and robustness.


In [16]:
def load_trained_model(checkpoint_path="alexnet_checkpoint.pth", device=None):
    """
    Load a trained AlexNet model from checkpoint.

    Args:
        checkpoint_path: Path to the saved checkpoint
        device: Device to load model on (if None, auto-detect)

    Returns:
        model: Loaded AlexNet model ready for testing
    """
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    print(f"📥 Loading model from {checkpoint_path}...")

    try:
        # Create model architecture
        model = AlexNet(num_classes=10).to(device)

        # Load checkpoint
        checkpoint = torch.load(checkpoint_path, map_location=device)

        # Load model weights
        model.load_state_dict(checkpoint['model_state_dict'])

        # Set to evaluation mode
        model.eval()

        print(f"✅ Model loaded successfully!")
        if 'best_val_acc' in checkpoint:
            print(f"   Best validation accuracy: {checkpoint['best_val_acc']:.2f}%")
        if 'epoch' in checkpoint:
            print(f"   Trained for {checkpoint['epoch']} epochs")

        return model

    except FileNotFoundError:
        print(f"❌ Error: Checkpoint file '{checkpoint_path}' not found!")
        return None
    except Exception as e:
        print(f"❌ Error loading model: {e}")
        return None


In [20]:
def test_with_ten_crops(model = None, device = None, checkpoint_path = "alexnet_checkpoint.pth"):
    """
    Evaluate model using the 10-crop testing strategy from the paper.
    Hint: For each test image, extract 10 crops, get predictions for each,
    then average the softmax outputs before making final prediction.
    """


    # do the "SETUP" of things

    # 1. setting up device - bcoz it is needed everywhere

    device = setup_environment()


    # 2. load the model

    if model is None:
        model = load_trained_model(checkpoint_path, device)


    # 3. setup the test data using test loaded

    _, _, test_loader = create_data_loaders(batch_size = 256)




    # INITIALIZATIONS that are required

    model.eval()
    correct = 0
    total = 0



    with torch.no_grad():
        for batch_idx, (images, labels) in enumerate(test_loader):

          labels = labels.to(device)
          images = images.to(device)
          batch_size = images.size(0)


          for i in range (batch_size):

            img = images[i]
            # Shape = [3, 256, 256]

            true_label = labels[i].item()



            # TODO: Extract 10 crops from this image
            crops = extract_ten_test_crops(img)


            # TODO: Get prediction for each crop
            crop_predictions = []

            for crop in crops:


              crop_batch = crop.unsqueeze(0).to(device)  # Add batch dimension

              # TODO: Forward pass through model

              output = model(crop_batch)


              # TODO: Apply softmax to get probabilities

              probabilities = F.softmax(output, dim=1)  # Convert logits to probabilities

              # Store this crop's prediction
              crop_predictions.append(probabilities)


            # STEP 3: Average the 10 predictions
            stacked_predictions = torch.stack(crop_predictions)
            avg_prediction = torch.mean(stacked_predictions, dim=0)


            # STEP 4: Make final classification decision
            predicted_class = torch.argmax(avg_prediction, dim=1).item()




            # STEP 5: Check if prediction is correct
            if predicted_class == true_label:
                correct += 1

            total += 1

          if batch_idx % 20 == 0:
                current_acc = 100.0 * correct / total if total > 0 else 0.0
                print(f"  Processed {batch_idx + 1} batches, Current accuracy: {current_acc:.2f}%")



    final_accuracy = correct / total
    print(f"\n✅ 10-crop test accuracy: {final_accuracy:.4f} ({final_accuracy*100:.2f}%)")
    return final_accuracy

In [21]:
test_accuracy = test_with_ten_crops()

Using device: cuda
📥 Loading model from alexnet_checkpoint.pth...
✅ Model loaded successfully!
   Best validation accuracy: 0.10%
   Trained for 1 epochs
Training dataset size: 50000 images
Validation dataset size: 10000 images
Number of classes: 10
Class names: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
Training batches per epoch: 196
Validation batches: 40
  Processed 1 batches, Current accuracy: 3.12%
  Processed 21 batches, Current accuracy: 9.38%
  Processed 41 batches, Current accuracy: 9.76%
  Processed 61 batches, Current accuracy: 9.73%
  Processed 81 batches, Current accuracy: 10.46%
  Processed 101 batches, Current accuracy: 10.40%
  Processed 121 batches, Current accuracy: 10.61%
  Processed 141 batches, Current accuracy: 10.48%
  Processed 161 batches, Current accuracy: 10.15%
  Processed 181 batches, Current accuracy: 10.07%
  Processed 201 batches, Current accuracy: 10.00%
  Processed 221 batches, Current accuracy: 10.05%
 

In [22]:
print (test_accuracy)

0.1
