# Introduction

In this assignment you will practice putting together an image classification pipeline based on CNNs for [CIFAR-10 and/or CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) dataset. The goals of this assignment are as follows:



*   Understand the components of a CNN model and a Vision Transformer (ViT) model.
*   Understand how to modify a standard CNN model towards a specific task.
*   Implement a basic neural network training pipeline in Pytorch.
*   Implement and train an AlexNet model.
*   Implement and train a ResNet model.
*   Implement and train a ViT model.
*   Understand the differences and tradeoffs between these models.

Please fill in all the **TODO** code blocks. Once you are ready to submit:

* Export the notebook `CSCI677_assignment_3.ipynb` as a PDF `[Your USC ID]_CSCI677_assignment_3.pdf`

Please make sure that the notebook have been run before exporting PDF, and your code and all cell outputs are visible in your submitted PDF. Regrading request will not be accepted if your code/output is not visible in the original submission. Thank you!

In case you haven't installed PyTorch yet, run the following command to install torch and torchvision.

In [None]:
!pip install torch torchvision

In [3]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

2.6.0+cu126
True
NVIDIA GeForce RTX 2070 with Max-Q Design


In [4]:
import math
import torch
import torch.nn as nn

# **Data Preparation**

[CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) is a well known dataset composed of 60,000 colored 32x32 images in 10 classes, with 6000 images per class. The utility function `cifar10()` returns the entire CIFAR-10 dataset as a set of four Torch tensors:
* `x_train` contains all training images (real numbers in the range  [0,1] )
* `y_train` contains all training labels (integers in the range  [0,9] )
* `x_test` contains all test images
* `y_test` contains all test labels

This function automatically downloads the CIFAR-10 dataset the first time you run it.

[CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) is just like the CIFAR-10 dataset, except it has 100 classes containing 600 images each. Below we provided wrapper classes for CIFAR-10 and CIFAR-100 datasets. You can choose one or both of them for training your CNNs. If you choose one of them, use the same one to train all your models.

In [5]:
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

class CIFAR10Dataset:
    def __init__(self, batch_size=128, root="data"):
        self.transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))]
        )
        self.batch_size = batch_size

        self.training_data = datasets.CIFAR10(
            root=root,
            train=True,
            download=True,
            transform=self.transform
        )
        self.train_dataloader = DataLoader(self.training_data, batch_size=self.batch_size, shuffle=True)

        self.test_data = datasets.CIFAR10(
            root=root,
            train=False,
            download=False,
            transform=self.transform
        )
        self.test_dataloader = DataLoader(self.test_data, batch_size=self.batch_size, shuffle=False)

        self.classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


class CIFAR100Dataset:
    def __init__(self, batch_size=128, root="data"):
        self.transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761))]  # CIFAR-100 normalization values
        )
        self.batch_size = batch_size

        self.training_data = datasets.CIFAR100(
            root=root,
            train=True,
            download=True,
            transform=self.transform
        )
        self.train_dataloader = DataLoader(self.training_data, batch_size=self.batch_size, shuffle=True)

        self.test_data = datasets.CIFAR100(
            root=root,
            train=False,
            download=False,
            transform=self.transform
        )
        self.test_dataloader = DataLoader(self.test_data, batch_size=self.batch_size, shuffle=False)

        self.classes = self.training_data.classes


In [6]:
# Function to count the number of trainable parameters in a model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Example usage
model = torch.nn.Linear(10, 2)  # Example model
print(f"Number of parameters: {count_parameters(model)}")

Number of parameters: 22


# AlexNet (20 pts)
AlexNet, introduced by Alex Krizhevsky in 2012, marked a significant breakthrough in deep learning for computer vision. This deep convolutional neural network consists of five convolutional layers, some followed by max-pooling layers, and three fully connected layers. AlexNet was designed for large-scale image classification tasks and was notably successful in the ImageNet Large Scale Visual Recognition Challenge.

## Implement AlexNet (20 pts)
Classical AlexNet architecture is as follows:


![LeNet-5 Architecture](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*wgJ9iOjl_JzjOZ3e9jDFAw.png)


The original AlexNet was designed for high-resolution images (224x224x3) from the ImageNet dataset. However, the CIFAR-10 and CIFAR-100 datasets consist of lower-resolution images (32x32x3). To adapt AlexNet for these datasets, you need to modify it.

Requirements:
* **Input Adaptation**: Modify the network to accept 32x32x3 input dimensions, suitable for CIFAR-10 and CIFAR-100 images.
* **Architecture**: Implement a network with the following layers:

  (Convolutional Layer 1 -> ReLU -> Max Pooling 1) ->

  (Convolutional Layer 2 -> ReLU -> Max Pooling 2) ->

  (Convolutional Layer 3 -> ReLU -> Convolutional Layer 4 -> ReLU -> Convolutional Layer 5 -> ReLU -> Max Pooling 3) ->

  Flattening ->

  (Linear -> ReLU) ->

  (Linear -> ReLU) -> Linear.
* Use you can design your own convolution filters and max pooling layers.
* Your model must contains less than **40 Million** parameters. We provide `count_parameters()` function to count the number of parameters in a model.

**Hint**: you can use nn.Sequential() to simplify your implementation.

In [7]:
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        # TODO
        self.features = nn.Sequential(
            # Input 32x32, conv, then downsample to 16x16
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),   # Output: 64 x 32 x 32
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),                    # Output: 64 x 16 x 16

            # Input 32x32, conv, then downsample to 8x8
            nn.Conv2d(64, 192, kernel_size=3, stride=1, padding=1),   # Output: 192 x 16 x 16
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),                    # Output: 192 x 8 x 8

            # Input 8x8, 3 convs, then downsample to 4x4
            nn.Conv2d(192, 384, kernel_size=3, stride=1, padding=1),  # Output: 384 x 8 x 8
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),  # Output: 256 x 8 x 8
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),  # Output: 256 x 8 x 8
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)                     # Output: 256 x 4 x 4
        )

        self.classifier = nn.Sequential(
            nn.Linear(256 * 4 * 4, 4096),
            nn.ReLU(inplace=True), # inplace = True to save memory
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        # TODO
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten the 4x4 conv output for the classifier
        x = self.classifier(x)
        return x


# ResNet (20 pts)
ResNet, short for Residual Network, was introduced in 2015 by Kaiming He et al. At its core, ResNet introduces the concept of residual blocks, which allows gradients to flow directly through the network's many layers. In comparison to earlier architectures like AlexNet, ResNet's approach demonstrates the transformative power of residual connections.

In this section, you will implement ResNet-18 for CIFAR-10/100.

## Implement Residual Block (10 pts)
The Residual Block is a crucial component in ResNet. It works by introducing a shortcut connection, also known as a skip connection, alongside a regular neural network layer. This shortcut connection enables the flow of information directly from one layer to another, bypassing some intermediate layers.

The key idea is to learn a residual function, which represents the difference between the desired output and the current output of the block. By doing so, the block aims to make the output closer to what it should be. This approach mitigates the vanishing gradient problem, which can occur in very deep networks, making it easier to train deep models effectively.

![Residual Block](https://miro.medium.com/v2/resize:fit:1140/format:webp/1*6WlIo8W1_Qc01hjWdZy-1Q.png)


The weight layer usually consists of a convolutional layer and a batch normalization layer. The batch normalization layer, often abbreviated as BatchNorm, normalizes the input of a neural network layer across a mini-batch of data during training. BatchNorm not only accelerates convergence but also acts as a form of regularization, reducing the risk of overfitting. In PyTorch, it is implemented by nn.BatchNorm2d().

You are asked to implement the residual block with the following requirements:
* The residual block takes input of size n * n * `in_channels` and output m * m * `out_channels` with m = (n-1) / `stride` + 1
* The residual function consists of the following components:

  Conv -> BatchNorm -> ReLU -> Conv -> BatchNorm

  where Conv means 3x3 convolutional filters with padding 1. If `stride` != 1, set stride for the first Conv.
* The shortcut should be identity if `in_channels` == `out_channels` and `stride` == 1. Otherwise, it should be a convolutional layer with kernel_size=1 and stride=`stride`.
* After adding the residual function and the shortcut, apply another ReLU activation.

In [8]:
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()
        # TODO
        # Residual function: Conv -> BN -> ReLU -> Conv -> BN
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False) # set bias to False to avoid redundancy bc BatchNorm has its own
        self.bn1   = nn.BatchNorm2d(out_channels) # Input size = output size from conv1
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)

        # Shortcut connection: identity if input and output dimensions match; otherwise 1x1 conv with BN to match them
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        # TODO
        # Residual function: Conv -> BN -> ReLU -> Conv -> BN
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)

        # Add the shortcut connection
        out += self.shortcut(x)
        # Final ReLU activation for residual block
        out = self.relu(out)
        return out

## Implement ResNet (10 pts)
ResNet-18 is part of the ResNet family, known for its exceptional depth and performance in image classification tasks. It consists of 18 layers, beginning with one convolutional layer, followed by a few residual blocks, and ending with a fully-connected layer. Here is a glimpse of its architecture:


![ResNet-18](https://www.researchgate.net/profile/Sajid-Iqbal-13/publication/336642248/figure/fig1/AS:839151377203201@1577080687133/Original-ResNet-18-Architecture.png)


In this part of the assignment, you are asked to implement a modified ResNet for CIFAR-10/100. Requirements:
* The model should take inputs of 32x32x3 and output a vector of dimension equal to the number of classes (10 for CIFAR-10 and 100 for CIFAR-100).
* The model should begin with a convolutional layer with kernel_size=3 and padding=1:

  Conv -> BatchNorm -> ReLU

  The output size should be 64x32x32.
* After the first layer, append with 4 residual blocks such that the output size changes as follows:
  
  (Input size after previous step) 64x32x32 -> 64x32x32 -> 256x16x16 -> 256x8x8 -> 512x2x2
* The model should end with average pooling (kernel_size=2), flattening, and a fully-connected layer.


In [9]:
class ResNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ResNet, self).__init__()
        # TODO
        self.initial = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True)
        )

        # 64x32x32 -> 64x32x32 -> 256x16x16 -> 256x8x8 -> 512x2x2

        # Block 1: 64x32x32 -> 64x32x32 (no downsampling)
        self.layer1 = ResidualBlock(64, 64, stride=1)
        # Block 2: 64x32x32 -> 256x16x16 (downsample spatially and increase channels)
        self.layer2 = ResidualBlock(64, 256, stride=2)
        # Block 3: 256x16x16 -> 256x8x8 (downsample spatially)
        self.layer3 = ResidualBlock(256, 256, stride=2)
        # Block 4: 256x8x8 -> 512x4x4 (downsample spatially and increase channels)
        self.layer4 = ResidualBlock(256, 512, stride=2)

        # Average pooling to reduce from 4x4 to 2x2
        self.avgpool = nn.AvgPool2d(kernel_size=2)
        # Fully connected layer mapping to num_classes
        self.fc = nn.Linear(512 * 2 * 2, num_classes)

    def forward(self, x):
        # TODO
        x = self.initial(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)  # Flatten the 2x2 conv output for the linear classifier
        x = self.fc(x)
        return x

# Vision Transformer (20 pts)
The Vision Transformer (ViT), introduced in 2020 by Dosovitskiy et al., applies the Transformer architectures, originally designed for natural language processing, to visual data.

![ViT](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*q0tvs1aDxi_7Otm_Zgys1A.png)

In this section, you are tasked with implementing a Vision Transformer model for the CIFAR-10 or CIFAR-100 dataset.

## Implement Patch Embedding Block with Positional Encoding (10 pts)

In Vision Transformers (ViTs), the Patch Embedding Block converts input images into a sequence of patch embeddings, enabling the model to process image data using transformer architectures. Since transformers are not inherently aware of the spatial relationships between patches, positional encoding is added to provide this information.

Overview:
- **Input:** 3x32x32 images, **Arguments:**: `patch_size` (make sure `32 % patch_size == 0` ), `embed_dim`
- Divide the image into non-overlapping patches of size `3 x patch_size x patch_size`. You should end up getting `(32 // patch_size)**2` patches.
- Flatten the pixels in each patch (into a single dimension of size `3 x patch_size x patch_size`), apply Layer Normalization, project it into a higher-dimensional space (e.g., 256 dimensions) using a fully-connected layer, and then apply another Layer Normalization.
- Add positional encodings to the patch embeddings to retain spatial information.

You are asked to implement the Patch Embedding Block as follow:
- Transform "b c (h x p) (w x p) -> b (h x w) (p x p x c)" where b is batch size, c is number of channels, h x p = 32 is the input image height, w x p = 32 is the input image width, and p is the patch size (e.g., 4).
- "b (h x w) (p x p x c)" -> LayerNorm -> fully-connected layer -> LayerNorm
- add positional encodings

In [10]:
def get_positional_encoding(seq_len, embed_dim):
    # refer to this paper https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
    # follow the original implementation proposed in Section 3.5
    pe = torch.zeros(seq_len, embed_dim)
    for pos in range(seq_len):
        for i in range(0, embed_dim, 2):
            # TODO
            div_term = 10000 ** (i / embed_dim)
            pe[pos, i] = math.sin(pos / div_term)
            if i + 1 < embed_dim:
                pe[pos, i + 1] = math.cos(pos / div_term)
    return pe

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size, embed_dim):
        super(PatchEmbedding, self).__init__()
        # TODO
        self.patch_size = patch_size # Size of each patch (must divide 32)
        self.embed_dim = embed_dim # Dimension of the output patch embedding

        # Calculate the number of patches. For a 32x32 image
        self.num_patches = (32 // patch_size) ** 2
        self.patch_dim = 3 * patch_size * patch_size  # since input has 3 channels

        # Two LayerNorms sandwiching the linear projection
        self.layernorm1 = nn.LayerNorm(self.patch_dim)
        self.proj = nn.Linear(self.patch_dim, embed_dim) # Project into higher-dimensional space
        self.layernorm2 = nn.LayerNorm(embed_dim)

        # Compute and register positional encodings (non-learnable)
        pos_encoding = get_positional_encoding(self.num_patches, embed_dim)
        self.register_buffer("pos_encoding", pos_encoding)  # shape: [num_patches, embed_dim]

    def forward(self, x):
        # TODO
    # B = batch size; C = num channels; P = patch size; H x P = 32 is input img height; W x P = 32 is input img width
        # b c (h x p) (w x p) -> b (h x w) (p x p x c)
        B, C, H, W = x.shape
        # Unfold to extract non-overlapping patches
        unfold = nn.Unfold(kernel_size=self.patch_size, stride=self.patch_size)
        patches = unfold(x)  # shape: [B, patch_dim, num_patches]
        patches = patches.transpose(1, 2)  # shape: [B, num_patches, patch_dim]

        # Apply first LayerNorm, project, and apply second LayerNorm.
        patches = self.layernorm1(patches)
        embeddings = self.proj(patches)  # shape: [B, num_patches, embed_dim]
        embeddings = self.layernorm2(embeddings)

        # Add the positional encoding (broadcasted along the batch dimension)
        embeddings = embeddings + self.pos_encoding
        return embeddings

## Implement Vision Transformer (ViT) (10 pts)

The Vision Transformer (ViT) model comprises three components (refer to figure above):

1. **Patch Embedding:** Converts input images into a sequence of patch embeddings.

2. **Transformer Encoder:** Processes the sequence of patch embeddings to capture complex patterns and relationships.

3. **MLP Head:** Maps the output from the Transformer Encoder to class predictions.

**Implementation Requirements:**

- **Input and Output Dimensions:** The model should accept inputs of size 32x32x3 and output a vector with a dimension equal to the number of classes (10 for CIFAR-10 and 100 for CIFAR-100).

- **Patch Embedding:** Begin with the PatchEmbedding module you previously implemented.

- **Transformer Encoder:** Utilize `nn.TransformerEncoder()` to process the sequence of patch embeddings, capturing high-level representations.

- **MLP Head:** Conclude with a mean pooling operation over the temporal dimension (dimension 1), followed by a Multi-Layer Perceptron (MLP) head that maps the pooled embeddings to class predictions.


In [11]:
class VisionTransformer(nn.Module):
    def __init__(self, img_size=32, patch_size=4, in_channels=3, embed_dim=512, depth=6, num_heads=8, num_classes=10): # depth = # of transformer encoder layers; num_heads = num attention heads
        super(VisionTransformer, self).__init__()
        # TODO
        # Patch embedding block with positional encoding
        self.patch_embed = PatchEmbedding(patch_size, embed_dim)

        # Transformer encoder to capture high-level representations
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=depth)

        # MLP head: average pooling over patches (happens in fwd) then fully connected layer.
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, num_classes)
        )

    def forward(self, x):
        # TODO
        # Obtain patch embeddings: [B, num_patches, embed_dim]
        x = self.patch_embed(x)

        # Transformer encoder expects shape: [seq_len, B, embed_dim]
        x = x.transpose(0, 1)  # now shape: [num_patches, B, embed_dim]
        x = self.transformer_encoder(x)
        x = x.transpose(0, 1)  # back to shape: [B, num_patches, embed_dim]

        # Mean pooling over the patch dimension
        x = x.mean(dim=1)  # shape: [B, embed_dim]

        # MLP head for class predictions
        x = self.mlp_head(x)  # shape: [B, num_classes]
        return x

# Training Neural Networks (20 pts)
In this section, you will implement a `Trainer` class, use it to train the models that you defined previously, and evaluate them.

## Check CUDA and GPUs
The following code helps you check if CUDA is available and lists the available GPUs.

In [12]:
# Check if CUDA is available
if torch.cuda.is_available():
    # Get the number of available GPUs
    num_gpus = torch.cuda.device_count()
    print(f"Number of available GPUs: {num_gpus}")

    # Get the name of each GPU
    for i in range(num_gpus):
        gpu_name = torch.cuda.get_device_name(i)
        print(f"GPU {i}: {gpu_name}")

    # Set the current GPU device
    device = torch.cuda.current_device()
    print(f"Current GPU device: {device} - {torch.cuda.get_device_name(device)}")
else:
    print("CUDA is not available.")

Number of available GPUs: 1
GPU 0: NVIDIA GeForce RTX 2070 with Max-Q Design
Current GPU device: 0 - NVIDIA GeForce RTX 2070 with Max-Q Design


## Complete the Trainer Class (15 pts)
Fill-in all the TODOs

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class Trainer:
    def __init__(self, dataset, net, optimizer, loss_function=nn.CrossEntropyLoss(),
                 device="cuda:0" if torch.cuda.is_available() else "cpu"):
        self.dataset = dataset
        self.net = net.to(device)
        self.lossFunction = loss_function
        self.optimizer = optimizer
        self.device = device

    def train_one_epoch(self):
        # TODO (5 pts): complete training loop
        self.net.train()
        total_loss = 0.0
        for x, y in self.dataset.train_dataloader:
            x, y = x.to(self.device), y.to(self.device)
            self.optimizer.zero_grad()
            outputs = self.net(x)
            loss = self.lossFunction(outputs, y)
            loss.backward()
            self.optimizer.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(self.dataset.train_dataloader)
        return avg_loss

    def compute_test_accuracy(self, path):
        # TODO (5 pts): compute classification accuracy based on test data
        self.net.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for x, y in self.dataset.test_dataloader:
                x, y = x.to(self.device), y.to(self.device)
                outputs = self.net(x)
                _, predicted = torch.max(outputs, dim=1)
                total += y.size(0)
                correct += (predicted == y).sum().item()
        accuracy = 100.0 * correct / total
        return accuracy

    def train(self, path, num_epochs=20):
        self.net.train()  # Set model to training mode
        best_accuracy = 0.0
        for epoch in range(num_epochs):
            # TODO (5 pts): print loss for every epoch, print test accuracy for every 5 epochs
            # Feel free to record the training process for analysis
            avg_loss = self.train_one_epoch()
            # Print training loss every epoch; print test accuracy every 5 epochs
            if (epoch + 1) % 5 == 0 or (epoch + 1) == num_epochs:
                test_acc = self.compute_test_accuracy(path)
                print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}, Test Accuracy: {test_acc:.2f}%")
                # Save the model if test accuracy improves
                if test_acc > best_accuracy:
                    best_accuracy = test_acc
                    torch.save(self.net.state_dict(), path)
                    print("Saved best model checkpoint.")
            else:
                print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}")

## Training (5 pts)
Follow these steps to train and evaluate your models (AlexNet, ResNet, and ViT):
* Create the model, the dataset, and the optimizer. We suggest using SGD with a learning rate of `1e-2`, but you are welcome to explore other options.
* Configure the trainer.
* Compute and print test accuracy before training.
* Train the model.
* Compute and print test accuracy after training.


In [21]:
# AlexNet train and evaluation
# TODO
dataset = CIFAR10Dataset(batch_size=128)
alexnet_model = AlexNet(num_classes=10)
print(f"Number of parameters: {count_parameters(alexnet_model)}")
optimizer_alex = optim.SGD(alexnet_model.parameters(), lr=1e-2, momentum=0.9)
trainer_alexnet = Trainer(dataset, alexnet_model, optimizer_alex)

initial_acc = trainer_alexnet.compute_test_accuracy("alexnet_best.pth")
print("AlexNet Initial Test Accuracy: {:.2f}%".format(initial_acc))
trainer_alexnet.train("alexnet_best.pth", num_epochs=20)
final_acc = trainer_alexnet.compute_test_accuracy("alexnet_best.pth")
print("AlexNet Final Test Accuracy: {:.2f}%".format(final_acc))

# ResNet train and evaluation
# TODO
dataset = CIFAR10Dataset(batch_size=128)
resnet_model = ResNet(num_classes=10)
print(f"Number of parameters: {count_parameters(resnet_model)}")
optimizer_resnet = optim.SGD(resnet_model.parameters(), lr=1e-2, momentum=0.9)
trainer_resnet = Trainer(dataset, resnet_model, optimizer_resnet)

initial_acc = trainer_resnet.compute_test_accuracy("resnet_best.pth")
print("ResNet Initial Test Accuracy: {:.2f}%".format(initial_acc))
trainer_resnet.train("resnet_best.pth", num_epochs=20)
final_acc = trainer_resnet.compute_test_accuracy("resnet_best.pth")
print("ResNet Final Test Accuracy: {:.2f}%".format(final_acc))

# ViT train and evaluation
# TODO
dataset = CIFAR10Dataset(batch_size=128)
vit_model = VisionTransformer(img_size=32, patch_size=4, in_channels=3, embed_dim=512, depth=6, num_heads=8, num_classes=10)
print(f"Number of parameters: {count_parameters(vit_model)}")
optimizer_vit = optim.SGD(vit_model.parameters(), lr=1e-2, momentum=0.9)
trainer_vit = Trainer(dataset, vit_model, optimizer_vit)

initial_acc = trainer_vit.compute_test_accuracy("vit_best.pth")
print("ViT Initial Test Accuracy: {:.2f}%".format(initial_acc))
trainer_vit.train("vit_best.pth", num_epochs=20)
final_acc = trainer_vit.compute_test_accuracy("vit_best.pth")
print("ViT Final Test Accuracy: {:.2f}%".format(final_acc))

100.0%


Number of parameters: 35855178
AlexNet Initial Test Accuracy: 10.00%
Epoch 1/20 - Loss: 2.0498
Epoch 2/20 - Loss: 1.4887
Epoch 3/20 - Loss: 1.2185
Epoch 4/20 - Loss: 0.9925
Epoch 5/20 - Loss: 0.8320, Test Accuracy: 71.00%
Saved best model checkpoint.
Epoch 6/20 - Loss: 0.7032
Epoch 7/20 - Loss: 0.5953
Epoch 8/20 - Loss: 0.4998
Epoch 9/20 - Loss: 0.4090
Epoch 10/20 - Loss: 0.3213, Test Accuracy: 79.24%
Saved best model checkpoint.
Epoch 11/20 - Loss: 0.2427
Epoch 12/20 - Loss: 0.1700
Epoch 13/20 - Loss: 0.1213
Epoch 14/20 - Loss: 0.0808
Epoch 15/20 - Loss: 0.0650, Test Accuracy: 80.79%
Saved best model checkpoint.
Epoch 16/20 - Loss: 0.0464
Epoch 17/20 - Loss: 0.0388
Epoch 18/20 - Loss: 0.0301
Epoch 19/20 - Loss: 0.0358
Epoch 20/20 - Loss: 0.0225, Test Accuracy: 79.99%
AlexNet Final Test Accuracy: 79.99%
Number of parameters: 5771338
ResNet Initial Test Accuracy: 7.38%
Epoch 1/20 - Loss: 1.2467
Epoch 2/20 - Loss: 0.7317
Epoch 3/20 - Loss: 0.4917
Epoch 4/20 - Loss: 0.3145
Epoch 5/20 - Lo



Number of parameters: 18946666
ViT Initial Test Accuracy: 10.90%
Epoch 1/20 - Loss: 1.8558
Epoch 2/20 - Loss: 1.4660
Epoch 3/20 - Loss: 1.2994
Epoch 4/20 - Loss: 1.1695
Epoch 5/20 - Loss: 1.0827, Test Accuracy: 58.01%
Saved best model checkpoint.
Epoch 6/20 - Loss: 1.0209
Epoch 7/20 - Loss: 0.9421
Epoch 8/20 - Loss: 0.8788
Epoch 9/20 - Loss: 0.8265
Epoch 10/20 - Loss: 0.7617, Test Accuracy: 64.66%
Saved best model checkpoint.
Epoch 11/20 - Loss: 0.7089
Epoch 12/20 - Loss: 0.6598
Epoch 13/20 - Loss: 0.6002
Epoch 14/20 - Loss: 0.5535
Epoch 15/20 - Loss: 0.5092, Test Accuracy: 65.68%
Saved best model checkpoint.
Epoch 16/20 - Loss: 0.4667
Epoch 17/20 - Loss: 0.4236
Epoch 18/20 - Loss: 0.3884
Epoch 19/20 - Loss: 0.3525
Epoch 20/20 - Loss: 0.3193, Test Accuracy: 64.70%
ViT Final Test Accuracy: 64.70%


## Evaluation using Confusion Matrix (5 pts)
A confusion matrix is a fundamental tool for evaluating the performance of classification models. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class.

You are asked to evaluate your trained model by computing and printing the confusion matrix. You can either compute it by yourself or use sklearn.metrics.confusion_matrix().

In [None]:
# TODO
from sklearn.metrics import confusion_matrix
def compute_confusion_matrix(trainer):
    trainer.net.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for x, y in trainer.dataset.test_dataloader:
            x = x.to(trainer.device)
            outputs = trainer.net(x)
            _, predicted = torch.max(outputs, dim=1)
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(y.cpu().numpy())
    cm = confusion_matrix(all_labels, all_preds)
    print("Confusion Matrix:")
    print(cm)
    print("\n\n")

print("AlexNet Confusion Matrix:")
compute_confusion_matrix(trainer_alexnet)

print("ResNet Confusion Matrix:")
compute_confusion_matrix(trainer_resnet)

print("ViT Confusion Matrix:")
compute_confusion_matrix(trainer_vit)

AlexNet Confusion Matrix:
Confusion Matrix:
[[825  10  48  30  18   4   9   5  28  23]
 [ 11 889   9   7   3   1   7   1  22  50]
 [ 33   2 769  52  72  27  30  10   4   1]
 [ 15   2  69 715  58  69  41  20   5   6]
 [  8   0  57  50 827  12  22  21   3   0]
 [  5   3  68 249  41 598   9  22   1   4]
 [  2   2  38  60  32  11 847   3   3   2]
 [  6   1  40  51  64  27   3 801   1   6]
 [ 55  17  21  11   9   1   2   3 857  24]
 [ 18  54  11  14   5   2   7   5  13 871]]
ResNet Confusion Matrix:
Confusion Matrix:
[[847  13  31  13  13   2   7   9  42  23]
 [  9 914   1   4   3   1   4   1  14  49]
 [ 52   2 738  38  51  41  49  21   4   4]
 [ 13   3  56 679  42 123  45  20   8  11]
 [  9   3  49  46 811  15  29  31   6   1]
 [ 13   4  33 133  30 747   6  29   2   3]
 [  3   4  36  39  15  13 880   4   4   2]
 [ 12   1  16  22  36  36   6 864   1   6]
 [ 40  11   6   7   0   3   3   4 915  11]
 [ 18  44   3   9   1   2   3  10  12 898]]
ViT Confusion Matrix:
Confusion Matrix:
[[538  30  

## Observations (15 pts)
Write down your observations regarding the results you obtained throughout this assignment. Here are some suggestions:
* **Accuracy and Loss Curves**: Plot and compare the training and validation accuracy and loss curves for each model. This helps visualize how well each model is learning over time and whether they are overfitting or underfitting.
* **Top Misclassified Images**: Examine the classes that are most frequently misclassified by each model. This can provide insights into the types of images that are challenging for each model and may suggest areas for improvement.
* **Feature Visualization**: Visualize the feature maps or activations of intermediate layers in each CNN. This can help you understand what features or patterns each model is learning and whether they differ in terms of learned representations.
* **Robustness Testing**: Assess the robustness of each model by introducing noise, transformations, or adversarial examples to the test data. This can help identify which models are more resilient to perturbations.
* **Runtime and Resource Usage**: Compare the training time and resource usage (e.g., GPU memory) of each model.
* **Hyperparameter Tuning**: Analyze the impact of hyperparameters (learning rates, batch sizes, etc.) on training speed and convergence.
* **Model Size and Efficiency**: Analyze the trade-off between model size and accuracy for each model.
* **Ablation Studies**: Conduct ablation studies by removing or modifying specific components (e.g., dropout, batch normalization, etc.) of each model to understand their contributions to performance.

You don't need to follow them. Feel free to write down any observation you have, or to use tools like Tensorboard to support your observations. You are also welcome to give comments on the design of the assignment.

## **TODO: write down your observations**

These results are fairly in line with my expectations. The CNNs were more or less designed around this classification set and dataset. The confusion matrices are cool to see--I haven't used them before. They show the CNNs, and ResNet in particular, have good numbers on the diagonal showing their good accuracy. When they struggle, it is usually involving just one other class, which likely indicated some crossover or confusion between objects with similar visual features (ie birds/planes or dogs/cats). The ViT confusion matrix is a bit all over the place. The main things holding back the ViT are likely 1) not enough data to train on, and 2) only running 20 epochs. The CIFAR datasets only have 60,000 images each. For the ViT to preform well, millions of images would be desirable. There may be possible slight gains from hyperparameter tuning, but the real key here is the data for transformers. The local connectivity and shared weights architecture of the CNNs give them an edge on classification tasks with the limited data. 

The ViT was by far the most challenging to implement. Particularly figuring out the math behind the Patch Embedding. Conceptually it made sense to me, but getting it to function as desired was definitely a challenge. It was helpful to have the paper cited for reference. It would have been cool to see the ViT perform well (have a part of the assignment use a large dataset), but I know that kind of implementation can be implausible given compute contraints of students. I may extend this work to visualize the patch embedings for my own learning since that was the section I struggled with most.

Thanks for putting together a solid assignment! 