# Assignment 4: Self-Attention for Vision

For this assignment, we're going to implement self-attention blocks in a convolutional neural network for CIFAR-10 Classification.

# Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [2]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./data/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./data/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./data/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [3]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cuda


## Flatten Function

In [4]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

def test_flatten():
    x = torch.arange(12).view(2, 1, 3, 2)
    print('Before flattening: ', x)
    print('After flattening: ', flatten(x))

test_flatten()

Before flattening:  tensor([[[[ 0,  1],
          [ 2,  3],
          [ 4,  5]]],


        [[[ 6,  7],
          [ 8,  9],
          [10, 11]]]])
After flattening:  tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])


### Check Accuracy Function


In [5]:
import torch.nn.functional as F  # useful stateless functions
def check_accuracy(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
        return 100 * acc

### Training Loop

In [6]:
def train(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    acc_max = 0
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Epoch %d, Iteration %d, loss = %.4f' % (e, t, loss.item()))
                acc = check_accuracy(loader_val, model)
                if acc >= acc_max:
                    acc_max = acc
                print()
    print("Maximum accuracy attained: ", acc_max)

In [7]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

## Vanilla CNN; No Attention
We implement the vanilla architecture for you here. Do not modify the architecture. You will use the same architecture in the following parts. Do not modify the hyper-parameters.

In [10]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3
num_classes = 10

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)


train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3174
Checking accuracy on validation set
Got 120 / 1000 correct (12.00)

Epoch 0, Iteration 100, loss = 1.8555
Checking accuracy on validation set
Got 417 / 1000 correct (41.70)

Epoch 0, Iteration 200, loss = 1.6521
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)

Epoch 0, Iteration 300, loss = 1.8310
Checking accuracy on validation set
Got 477 / 1000 correct (47.70)

Epoch 0, Iteration 400, loss = 1.1676
Checking accuracy on validation set
Got 497 / 1000 correct (49.70)

Epoch 0, Iteration 500, loss = 1.4664
Checking accuracy on validation set
Got 531 / 1000 correct (53.10)

Epoch 0, Iteration 600, loss = 1.1012
Checking accuracy on validation set
Got 554 / 1000 correct (55.40)

Epoch 0, Iteration 700, loss = 1.5144
Checking accuracy on validation set
Got 553 / 1000 correct (55.30)

Epoch 1, Iteration 0, loss = 1.1416
Checking accuracy on validation set
Got 560 / 1000 correct (56.00)

Epoch 1, Iteration 100, loss = 1.0251
Checking acc

Epoch 9, Iteration 600, loss = 0.3445
Checking accuracy on validation set
Got 603 / 1000 correct (60.30)

Epoch 9, Iteration 700, loss = 0.3826
Checking accuracy on validation set
Got 586 / 1000 correct (58.60)

Maximum accuracy attained:  63.3


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should be able to see atleast 55% accuracy

In [11]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)


Checking accuracy on test set
Got 5784 / 10000 correct (57.84)


57.84

## Part II Self-Attention

In the next section, you will implement an Attention layer which you will then use within a convnet architecture defined above for cifar 10 classification task.

A self-attention layer is formulated as following:

Input: $X$ of shape $(H\times W, C)$

Query, key, value linear transforms are $W_Q$, $W_K$, $W_V$, of shape $(C, C)$. We implement these linear transforms as 1x1 convolutional layers of the same dimensions.

$XW_Q$, $XW_K$, $XW_V$, represent the output volumes when input X is passed through the transforms.


Self-Attention is given by the formula: $Attention(X) = X + Softmax(\frac{XW_Q(XW_K)^\top}{\sqrt{C}})XW_V$

### Inline Question 1: Self-Attention is equivalent to which of the following: (5 points)
1. K-means clustering <br />
2. Non-local means <br />
3. Residual Block <br />
4. Gaussian Blurring <br />

Your Answer: Non-local means. Because self-attention is a mechanism where each element of a sequence attends to all other elements, computing a weighted sum based on a similarity measure (such as dot product). This is similar to the Non-local means algorithm, which enhances each pixel based on a weighted average of all other pixels, with weights determined by the similarity between local patches. Hence, Self-Attention is conceptually equivalent to Non-local means.

### Here you implement the Attention module, and run it in the next section (40 points)

In [12]:
# Initialize the attention module as a nn.Module subclass
class Attention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        
        # TODO: Implement the Key, Query and Value linear transforms as 1x1 convolutional layers
        # Hint: channel size remains constant throughout
        self.conv_query = nn.Conv2d(in_channels, in_channels, kernel_size=1)
        self.conv_key = nn.Conv2d(in_channels, in_channels, kernel_size=1)
        self.conv_value = nn.Conv2d(in_channels, in_channels, kernel_size=1)

    def forward(self, x):
        N, C, H, W = x.shape
        
        # TODO: Pass the input through conv_query, reshape the output volume to (N, C, H*W)
        q = self.conv_query(x)
        q = q.reshape(N, C, H*W)
        # TODO: Pass the input through conv_key, reshape the output volume to (N, C, H*W)
        k = self.conv_key(x)
        k = k.reshape(N, C, H*W)
        
        # TODO: Pass the input through conv_value, reshape the output volume to (N, C, H*W)
        v = self.conv_value(x)
        v = v.reshape(N, C, H*W)
        # TODO: Implement the above formula for attention using q, k, v, C
        # NOTE: The X in the formula is already added for you in the return line
        temp=torch.sqrt(torch.Tensor([C])).to('cuda')
        attention =F.softmax(q @ torch.transpose(k,1,2)/temp, dim=-1)
        attention=attention@v
        # Reshape the output to (N, C, H, W) before adding to the input volume
        attention = attention.reshape(N, C, H, W)
        return x + attention

## Single Attention Block: Early attention; After the first conv layer. (10 points)

In [21]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the first Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3,channel_1,3,padding=1,stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1,channel_2,3,padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32,10),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3116
Checking accuracy on validation set
Got 145 / 1000 correct (14.50)

Epoch 0, Iteration 100, loss = 1.7059
Checking accuracy on validation set
Got 378 / 1000 correct (37.80)

Epoch 0, Iteration 200, loss = 1.4474
Checking accuracy on validation set
Got 480 / 1000 correct (48.00)

Epoch 0, Iteration 300, loss = 1.4143
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Epoch 0, Iteration 400, loss = 1.6302
Checking accuracy on validation set
Got 542 / 1000 correct (54.20)

Epoch 0, Iteration 500, loss = 1.3633
Checking accuracy on validation set
Got 561 / 1000 correct (56.10)

Epoch 0, Iteration 600, loss = 1.0971
Checking accuracy on validation set
Got 560 / 1000 correct (56.00)

Epoch 0, Iteration 700, loss = 1.1583
Checking accuracy on validation set
Got 566 / 1000 correct (56.60)

Epoch 1, Iteration 0, loss = 1.2048
Checking accuracy on validation set
Got 601 / 1000 correct (60.10)

Epoch 1, Iteration 100, loss = 0.8910
Checking acc

Epoch 9, Iteration 600, loss = 0.0750
Checking accuracy on validation set
Got 612 / 1000 correct (61.20)

Epoch 9, Iteration 700, loss = 0.0796
Checking accuracy on validation set
Got 609 / 1000 correct (60.90)

Maximum accuracy attained:  66.10000000000001


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should see improvement of about 2-3% over the vanilla convnet model. * Use this part to tune your Attention module and then move on to the next parts. *

In [22]:
earlyAttention = model
check_accuracy(loader_test, earlyAttention)

Checking accuracy on test set
Got 6151 / 10000 correct (61.51)


61.51

## Single Attention Block: Late attention; After the second conv layer. (10 points)

In [28]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3,channel_1,3,padding=1,stride=1),
    nn.ReLU(),
    
    
    nn.Conv2d(channel_1,channel_2,3,padding=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32,10),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3026
Checking accuracy on validation set
Got 130 / 1000 correct (13.00)

Epoch 0, Iteration 100, loss = 1.7085
Checking accuracy on validation set
Got 360 / 1000 correct (36.00)

Epoch 0, Iteration 200, loss = 1.7219
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)

Epoch 0, Iteration 300, loss = 1.3964
Checking accuracy on validation set
Got 499 / 1000 correct (49.90)

Epoch 0, Iteration 400, loss = 1.2129
Checking accuracy on validation set
Got 539 / 1000 correct (53.90)

Epoch 0, Iteration 500, loss = 1.1482
Checking accuracy on validation set
Got 531 / 1000 correct (53.10)

Epoch 0, Iteration 600, loss = 1.2951
Checking accuracy on validation set
Got 573 / 1000 correct (57.30)

Epoch 0, Iteration 700, loss = 1.2179
Checking accuracy on validation set
Got 549 / 1000 correct (54.90)

Epoch 1, Iteration 0, loss = 1.0460
Checking accuracy on validation set
Got 570 / 1000 correct (57.00)

Epoch 1, Iteration 100, loss = 1.1031
Checking acc

Epoch 9, Iteration 600, loss = 0.2182
Checking accuracy on validation set
Got 635 / 1000 correct (63.50)

Epoch 9, Iteration 700, loss = 0.2719
Checking accuracy on validation set
Got 633 / 1000 correct (63.30)

Maximum accuracy attained:  67.60000000000001


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [29]:
lateAttention = model
check_accuracy(loader_test, lateAttention)

Checking accuracy on test set
Got 6149 / 10000 correct (61.49)


61.49

### Inline Question 2: Provide one example each of usage of self-attention and attention in computer vision. Explain the difference between the two. (5 points)


Your Answer:
#### Example of Self-Attention in Computer Vision:
Vision Transformers (ViT)
#### Example of Attention in Computer Vision:
Attention Mechanism in Image Captioning

#### Difference Between Self-Attention and Attention:
Self-Attention:<br />
Self-attention computes the relationship between all pairs of elements in the input sequence, attending to all parts of the sequence simultaneously. In the context of computer vision, self-attention allows the model to consider all patches of the image simultaneously, capturing long-range dependencies and global context.<br />
<br />
Attention: 
<br />Attention generally refers to mechanisms that allow a model to focus on different parts of the input sequence (or input image) selectively when making predictions. It typically involves two different sequences: an encoder sequence (source) and a decoder sequence (target), where the attention mechanism aligns elements of the target sequence with relevant parts of the source sequence.
<br />

In summary, self-attention is a specific type of attention where the model attends to all parts of a single input simultaneously, while attention typically involves focusing on specific parts of the input relevant to generating a particular part of the output, often in an encoder-decoder context.

## Double Attention Blocks: After conv layers 1 and 2 (10 points)

In [34]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3,channel_1,3,padding=1,stride=1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    
    nn.Conv2d(channel_1,channel_2,3,padding=1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32,10),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.2951
Checking accuracy on validation set
Got 112 / 1000 correct (11.20)

Epoch 0, Iteration 100, loss = 1.6876
Checking accuracy on validation set
Got 388 / 1000 correct (38.80)

Epoch 0, Iteration 200, loss = 1.5639
Checking accuracy on validation set
Got 461 / 1000 correct (46.10)

Epoch 0, Iteration 300, loss = 1.1586
Checking accuracy on validation set
Got 491 / 1000 correct (49.10)

Epoch 0, Iteration 400, loss = 1.5346
Checking accuracy on validation set
Got 514 / 1000 correct (51.40)

Epoch 0, Iteration 500, loss = 1.2276
Checking accuracy on validation set
Got 518 / 1000 correct (51.80)

Epoch 0, Iteration 600, loss = 1.0638
Checking accuracy on validation set
Got 528 / 1000 correct (52.80)

Epoch 0, Iteration 700, loss = 1.2492
Checking accuracy on validation set
Got 547 / 1000 correct (54.70)

Epoch 1, Iteration 0, loss = 1.1985
Checking accuracy on validation set
Got 561 / 1000 correct (56.10)

Epoch 1, Iteration 100, loss = 1.5013
Checking acc

Epoch 9, Iteration 600, loss = 0.1328
Checking accuracy on validation set
Got 617 / 1000 correct (61.70)

Epoch 9, Iteration 700, loss = 0.2752
Checking accuracy on validation set
Got 617 / 1000 correct (61.70)

Maximum accuracy attained:  65.7


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [35]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)

Checking accuracy on test set
Got 6248 / 10000 correct (62.48)


62.480000000000004

## Resnet with Attention 

Now we will experiment with applying attention within the Resnet10 architecture that we implemented in Homework 2. Please note that for a deeper model such as Resnet we do not expect significant improvements in performance with Attention

## Vanilla Resnet, No Attention

The architecture for Resnet is given below, please train it and evaluate it on the test set.

In [36]:
import torch
import torch.nn as nn

class ResNet(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10(num_classes = 100, batchnorm= False):

    return ResNet(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [37]:
learning_rate = 1e-3

model = ResNet10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

vanillaResnet = model
check_accuracy(loader_test, vanillaResnet)

Epoch 0, Iteration 0, loss = 4.5391
Checking accuracy on validation set
Got 98 / 1000 correct (9.80)

Epoch 0, Iteration 100, loss = 1.4879
Checking accuracy on validation set
Got 406 / 1000 correct (40.60)

Epoch 0, Iteration 200, loss = 1.4594
Checking accuracy on validation set
Got 470 / 1000 correct (47.00)

Epoch 0, Iteration 300, loss = 1.2715
Checking accuracy on validation set
Got 444 / 1000 correct (44.40)

Epoch 0, Iteration 400, loss = 1.3083
Checking accuracy on validation set
Got 504 / 1000 correct (50.40)

Epoch 0, Iteration 500, loss = 1.2812
Checking accuracy on validation set
Got 460 / 1000 correct (46.00)

Epoch 0, Iteration 600, loss = 1.0965
Checking accuracy on validation set
Got 548 / 1000 correct (54.80)

Epoch 0, Iteration 700, loss = 0.8592
Checking accuracy on validation set
Got 587 / 1000 correct (58.70)

Epoch 1, Iteration 0, loss = 1.0434
Checking accuracy on validation set
Got 620 / 1000 correct (62.00)

Epoch 1, Iteration 100, loss = 0.8645
Checking accur

Epoch 9, Iteration 600, loss = 0.3564
Checking accuracy on validation set
Got 756 / 1000 correct (75.60)

Epoch 9, Iteration 700, loss = 0.3299
Checking accuracy on validation set
Got 760 / 1000 correct (76.00)

Maximum accuracy attained:  78.10000000000001
Checking accuracy on test set
Got 7341 / 10000 correct (73.41)


73.41

## Resnet with Attention (5 points)

In [39]:
class ResNetwithAttention(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNetwithAttention, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.attention = Attention(128)  # Add attention module here
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.attention(x)  # Apply attention here
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10withAttention(num_classes = 100, batchnorm= False):

    return ResNetwithAttention(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

In [40]:
## Resnet with Attention

learning_rate = 1e-3

# TODO: Use the above Attention module after the 2nd resnet block i.e. after self.layer2.

model = ResNet10withAttention()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 4.6237
Checking accuracy on validation set
Got 114 / 1000 correct (11.40)

Epoch 0, Iteration 100, loss = 1.4172
Checking accuracy on validation set
Got 418 / 1000 correct (41.80)

Epoch 0, Iteration 200, loss = 1.6484
Checking accuracy on validation set
Got 436 / 1000 correct (43.60)

Epoch 0, Iteration 300, loss = 1.5618
Checking accuracy on validation set
Got 479 / 1000 correct (47.90)

Epoch 0, Iteration 400, loss = 1.0918
Checking accuracy on validation set
Got 549 / 1000 correct (54.90)

Epoch 0, Iteration 500, loss = 1.2874
Checking accuracy on validation set
Got 531 / 1000 correct (53.10)

Epoch 0, Iteration 600, loss = 1.1581
Checking accuracy on validation set
Got 548 / 1000 correct (54.80)

Epoch 0, Iteration 700, loss = 1.2388
Checking accuracy on validation set
Got 618 / 1000 correct (61.80)

Epoch 1, Iteration 0, loss = 1.2048
Checking accuracy on validation set
Got 555 / 1000 correct (55.50)

Epoch 1, Iteration 100, loss = 1.2558
Checking acc

Epoch 9, Iteration 600, loss = 0.3513
Checking accuracy on validation set
Got 770 / 1000 correct (77.00)

Epoch 9, Iteration 700, loss = 0.1209
Checking accuracy on validation set
Got 765 / 1000 correct (76.50)

Maximum accuracy attained:  78.5


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [41]:
AttentionResnet = model
check_accuracy(loader_test, AttentionResnet)

Checking accuracy on test set
Got 7647 / 10000 correct (76.47)


76.47

## Inline Question 3: Rank the above models based on their performance on test dataset (15 points)
( You are encouraged to run each of the experiments (training) at
least 3 times to get an average estimate )

Report the test accuracies alongside the model names. For example, 1. Vanilla CNN (57.45%, 57.99%).. etc

1. Vanilla CNN; No Attention (57.87%, 58.47%, 57.84%), Avg: 58.06%<br />
2. Vanilla CNN; Single Attention Block: Early attention (60.12%, 60.09%, 61.23%), Avg: 60.48%<br />
3. Vanilla CNN; Single Attention Block: Late attention(59.3%, 59.10%, 61.49%), Avg: 59.96%<br />
4. Vanilla CNN; Double Attention Blocks (62.25%, 62.74%, 62.48%), Avg: 62.49%<br />
5. ResNet10 (74.91%, 75.01%, 73.41%), Avg: 74.44%<br />
6. ResNet10withAttention (76.73%, 76.21%, 76.67%), Avg: 76.53%

<br />
Rank: ResNet10withAttention>ResNet10>Vanilla CNN; Double Attention Blocks>Vanilla CNN; Single Attention Block: Early attention>Vanilla CNN; Single Attention Block: Late attention>Vanilla CNN; No Attention

### Bonus Question (Ungraded): Can you give a possible explanation that supports the rankings?
Your Answer:
1. ResNet10 with Attention
ResNet Architecture: ResNet (Residual Network) is known for its skip connections, which help in training very deep networks by mitigating the vanishing gradient problem.
Attention Mechanism: Attention mechanisms improve model performance by allowing the network to focus on important parts of the input. Combining ResNet with attention mechanisms enhances its ability to learn and generalize from the data.
2. ResNet10
ResNet Architecture: Even without attention, the ResNet architecture is very powerful due to its deep layers and skip connections, which facilitate better feature extraction and gradient flow during training.
3. Vanilla CNN (Double Attention Blocks)
Double Attention Blocks: Adding two attention blocks to a standard CNN helps the network to focus on important features at multiple stages, significantly improving its accuracy compared to a vanilla CNN without attention.
4. Vanilla CNN (Single Attention Block: Early Attention)
Single Attention Block (Early): Placing an attention block early in the network helps to enhance the learning of initial features. This is beneficial, but not as much as having multiple attention blocks.
5. Vanilla CNN (Single Attention Block: Late Attention)
Single Attention Block (Late): Placing an attention block later in the network also aids in focusing on important features, but it might not be as effective as early attention because the initial features might not be as well-learned.
6. Vanilla CNN (No Attention)
No Attention Mechanism: A vanilla CNN without any attention mechanism relies solely on its convolutional layers to learn features. This makes it the least effective among the listed models as it lacks the ability to dynamically focus on the most relevant parts of the input.