## ResNet Paper Implementation from Scratch with PyTorch on CIFAR-10 Dataset

# [Link to my Youtube Video Explaining this whole Notebook](https://www.youtube.com/watch?v=P8U1VL93jzA&list=PLxqBkZuBynVRX6QExfPyzRGj5Ap_zmcAJ&index=6)

[![Imgur](https://imgur.com/NhVhb4u.png)](https://www.youtube.com/watch?v=P8U1VL93jzA&list=PLxqBkZuBynVRX6QExfPyzRGj5Ap_zmcAJ&index=6)

---

The below comments are taken from the original paper from implementing ResNet on CIFAR-10 Dataset. And I will follow this structure for this implementation of ResNet on CIFAR10.

"We conducted more studies on the CIFAR-10 dataset
which consists of 50k training images and 10k test-
ing images in 10 classes.

 The network inputs are 32×32 images, with
the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively,
with 2n layers for each feature map size. The numbers of
filters are {16, 32, 64} respectively.

The subsampling is performed by convolutions with a stride of 2. The network ends
with a global average pooling, a 10-way fully-connected
layer, and softmax.

There are totally 6n+2 stacked weighted layers.

We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks.

When shortcut connections are used, they are connected
to the pairs of 3×3 layers (totally 3n shortcuts).

On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts."
---

For this example here in this file, I have used n=9. So my ResNet blocks are like [9, 9, 9]

And thats why the total layers are 56 (i.e. 9 * 6 + 2)

![Imgur](https://imgur.com/ifD8qbd.png)



In [1]:
import os
import shutil
from collections import OrderedDict

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import transforms, datasets
from torchsummary import summary
from torch.utils.data import Dataset, DataLoader, random_split

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!nvidia-smi

Sat Mar  5 13:11:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
class LambdaLayer(nn.Module):
    
    def __init__(self, lambd):
        super(LambdaLayer, self).__init__()
        self.lambd = lambd
    
    def forward(self, x):
        return self.lambd(x)

class BasicConvBlock(nn.Module):
    
    ''' The BasicConvBlock takes an input with in_channels, applies some blocks of convolutional layers 
    to reduce it to out_channels and sum it up to the original input. 
    If their sizes mismatch, then the input goes into an identity. 
    
    Basically The BasicConvBlock will implement the regular basic Conv Block + 
    the shortcut block that does the dimension matching job (option A or B) when dimension changes between 2 blocks
    '''
    
    def __init__(self, in_channels, out_channels, stride=1, option='A'):
        super(BasicConvBlock, self).__init__()
        
        self.features = nn.Sequential(OrderedDict([
            ('conv1', nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)),
            ('bn1', nn.BatchNorm2d(out_channels)),
            ('act1', nn.ReLU()),
            ('conv2', nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)),
            ('bn2', nn.BatchNorm2d(out_channels))
        ]))

        self.shortcut = nn.Sequential()
        
        '''  When input and output spatial dimensions don't match, we have 2 options, with stride:
            - A) Use identity shortcuts with zero padding to increase channel dimension.    
            - B) Use 1x1 convolution to increase channel dimension (projection shortcut).
         '''
        if stride != 1 or in_channels != out_channels:
            if option == 'A':
                # Use identity shortcuts with zero padding to increase channel dimension.
                pad_to_add = out_channels//4
                ''' ::2 is doing the job of stride = 2
                F.pad apply padding to (W,H,C,N).
                
                The padding lengths are specified in reverse order of the dimensions,
                F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad,pad, 0,0))

                [width_beginning, width_end, height_beginning, height_end, channel_beginning, channel_end, batchLength_beginning, batchLength_end ]

                '''
                self.shortcut = LambdaLayer(lambda x:
                            F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad_to_add, pad_to_add, 0,0)))
            if option == 'B':
                self.shortcut = nn.Sequential(OrderedDict([
                    ('s_conv1', nn.Conv2d(in_channels, 2*out_channels, kernel_size=1, stride=stride, padding=0, bias=False)),
                    ('s_bn1', nn.BatchNorm2d(2*out_channels))
                ]))
        
    def forward(self, x):
        out = self.features(x)
        # sum it up with shortcut layer
        out += self.shortcut(x)
        out = F.relu(out)
        return out



### Explanations on using Option A and B in below code

```py

if stride != 1 or in_channels != out_channels:
            if option == 'A':
                pad = out_channels//4
                # ::2 replace the stride 2 + F.pad apply padding to (W,H,C,N).
                self.shortcut = LambdaLayer(lambda x:
                            F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad,pad, 0,0)))
            if option == 'B':
                self.shortcut = nn.Sequential(OrderedDict([
                    ('s_conv1', nn.Conv2d(in_channels, 2*out_channels, kernel_size=1, stride=stride, padding=0, bias=False)),
                    ('s_bn1', nn.BatchNorm2d(2*out_channels))
                ]))

```

As per the original Paper

#### We use identity shortcuts when input and output channel dimensions are the same.

#### Otherwise, When input and output spatial dimensions don't match, we have 2 options, with stride:

    - A) Use identity shortcuts with zero padding to increase channel dimension.

    - B) Use 1x1 convolution to increase channel dimension (projection shortcut).

-----------------------

### Understanding `F.pad` on a 4-D Tensor and the following line

### `F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad,pad, 0,0)))`

https://stackoverflow.com/a/61945903/1902852

The padding lengths are specified in reverse order of the dimensions, where every dimension has two values, one for the padding at the beginning and one for the padding at the end.

For an image with the dimensions `[channels, height, width]` the padding is given as:

`[width_beginning, width_end, height_beginning, height_end, channels_beginning, channels_end]`,

which can be reworded to

`[left, right, top, bottom]`

Therefore the code above pads the images to the right and bottom. The channels are left out, because they are not being padded, which also means that the same padding could be directly applied to the masks.

So the below line means

`F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad,pad, 0,0))`


`[width_beginning, width_end, height_beginning, height_end, channel_beginning, channel_end, batchLength_beginning, batchLength_end ]`

In [5]:

class ResNet(nn.Module):
    """
        ResNet-56 architecture for CIFAR-10 Dataset of shape 32*32*3
    """
    def __init__(self, block_type, num_blocks):
        super(ResNet, self).__init__()
        
        self.in_channels = 16
        
        self.conv0 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn0 = nn.BatchNorm2d(16)
        
        self.block1 = self.__build_layer(block_type, 16, num_blocks[0], starting_stride=1)
        
        self.block2 = self.__build_layer(block_type, 32, num_blocks[1], starting_stride=2)
        
        self.block3 = self.__build_layer(block_type, 64, num_blocks[2], starting_stride=2)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.linear = nn.Linear(64, 10)
    
    def __build_layer(self, block_type, out_channels, num_blocks, starting_stride):
        
        strides_list_for_current_block = [starting_stride] + [1]*(num_blocks-1)
        ''' Above line will generate an array whose first element is starting_stride
        And it will have (num_blocks-1) more elements each of value 1
         '''
        # print('strides_list_for_current_block ', strides_list_for_current_block)
        
        layers = []
        
        for stride in strides_list_for_current_block:
            layers.append(block_type(self.in_channels, out_channels, stride))
            self.in_channels = out_channels
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        out = F.relu(self.bn0(self.conv0(x)))
        out = self.block1(out)
        out = self.block2(out)        
        out = self.block3(out)
        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.linear(out)
        return out

### _build_layer() method

In ResNet Every layer downsamples the input at the start using stride equals to 2 i.e for 1st convolutional layer in 1st block of a layer.

If we look at the first operation of each layer, we see that the stride used at that first one is 2, instead of 1 like for the rest of them.

This is because, here in ResNet, reduction between layers is achieved by an increase on the stride, from 1 to 2, at the first convolution of each layer; instead of by a pooling operation, which we are used to see as down samplers.

Quoting from Paper

" For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2."

In [6]:
def ResNet56():
    return ResNet(block_type=BasicConvBlock, num_blocks=[9,9,9])

In [7]:
model = ResNet56()
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# device = 'cpu'
model.to(device)
summary(model, (3, 32, 32))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 32, 32]             432
       BatchNorm2d-2           [-1, 16, 32, 32]              32
            Conv2d-3           [-1, 16, 32, 32]           2,304
       BatchNorm2d-4           [-1, 16, 32, 32]              32
              ReLU-5           [-1, 16, 32, 32]               0
            Conv2d-6           [-1, 16, 32, 32]           2,304
       BatchNorm2d-7           [-1, 16, 32, 32]              32
    BasicConvBlock-8           [-1, 16, 32, 32]               0
            Conv2d-9           [-1, 16, 32, 32]           2,304
      BatchNorm2d-10           [-1, 16, 32, 32]              32
             ReLU-11           [-1, 16, 32, 32]               0
           Conv2d-12           [-1, 16, 32, 32]           2,304
      BatchNorm2d-13           [-1, 16, 32, 32]              32
   BasicConvBlock-14           [-1, 16,

## Loading CIFAR-10 Dataset

In [8]:
def dataloader_cifar():
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.5], std=[0.5])])
            
    # Input Data in Local Machine
    # train_dataset = datasets.CIFAR10('../input_data', train=True, download=True, transform=transform)
    # test_dataset = datasets.CIFAR10('../input_data', train=False, download=True, transform=transform)
    
    # Input Data in Google Drive
    train_dataset = datasets.CIFAR10('/content/drive/MyDrive/All_Datasets/CIFAR10', train=True, download=True, transform=transform)
    test_dataset = datasets.CIFAR10('/content/drive/MyDrive/All_Datasets/CIFAR10', train=False, download=True, transform=transform)

    # Split dataset into training set and validation set.
    train_dataset, val_dataset = random_split(train_dataset, (45000, 5000))
    
    print("Image shape of a random sample image : {}".format(train_dataset[0][0].numpy().shape), end = '\n\n')
    
    print("Training Set:   {} images".format(len(train_dataset)))
    print("Validation Set:   {} images".format(len(val_dataset)))
    print("Test Set:       {} images".format(len(test_dataset)))
    
    BATCH_SIZE = 32

    # Generate dataloader
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=10000, shuffle=True)
    
    return train_loader, val_loader, test_loader

In [9]:
train_loader, val_loader, test_loader = dataloader_cifar()

Files already downloaded and verified
Files already downloaded and verified
Image shape of a random sample image : (3, 32, 32)

Training Set:   45000 images
Validation Set:   5000 images
Test Set:       10000 images


## Start Actual Training

In [10]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [15]:
def train_model():
    EPOCHS = 15
    train_samples_num = 45000
    val_samples_num = 5000
    train_costs, val_costs = [], []
    
    #Training phase.    
    for epoch in range(EPOCHS):

        train_running_loss = 0
        correct_train = 0
        
        model.train().cuda()
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            """ for every mini-batch during the training phase, we typically want to explicitly set the gradients 
            to zero before starting to do backpropragation """
            optimizer.zero_grad()
            
            # Start the forward pass
            prediction = model(inputs)
                        
            loss = criterion(prediction, labels)
          
            # do backpropagation and update weights with step()
            loss.backward()         
            optimizer.step()
            
            # print('outputs on which to apply torch.max ', prediction)
            # find the maximum along the rows, use dim=1 to torch.max()
            _, predicted_outputs = torch.max(prediction.data, 1)
            
            # Update the running corrects 
            correct_train += (predicted_outputs == labels).float().sum().item()
            
            ''' Compute batch loss
            multiply each average batch loss with batch-length. 
            The batch-length is inputs.size(0) which gives the number total images in each batch. 
            Essentially I am un-averaging the previously calculated Loss '''
            train_running_loss += (loss.data.item() * inputs.shape[0])


        train_epoch_loss = train_running_loss / train_samples_num
        
        train_costs.append(train_epoch_loss)
        
        train_acc =  correct_train / train_samples_num

        # Now check trained weights on the validation set
        val_running_loss = 0
        correct_val = 0
      
        model.eval().cuda()
    
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)

                # Forward pass.
                prediction = model(inputs)

                # Compute the loss.
                loss = criterion(prediction, labels)

                # Compute validation accuracy.
                _, predicted_outputs = torch.max(prediction.data, 1)
                correct_val += (predicted_outputs == labels).float().sum().item()

            # Compute batch loss.
            val_running_loss += (loss.data.item() * inputs.shape[0])

            val_epoch_loss = val_running_loss / val_samples_num
            val_costs.append(val_epoch_loss)
            val_acc =  correct_val / val_samples_num
        
        info = "[Epoch {}/{}]: train-loss = {:0.6f} | train-acc = {:0.3f} | val-loss = {:0.6f} | val-acc = {:0.3f}"
        
        print(info.format(epoch+1, EPOCHS, train_epoch_loss, train_acc, val_epoch_loss, val_acc))
        
        torch.save(model.state_dict(), '/content/checkpoint_gpu_{}'.format(epoch + 1)) 
                                                                
    torch.save(model.state_dict(), '/content/resnet-56_weights_gpu')  
        
    return train_costs, val_costs

    

In [16]:
# !pwd
train_costs, val_costs = train_model()

[Epoch 1/15]: train-loss = 0.876177 | train-acc = 0.692 | val-loss = 0.001271 | val-acc = 0.734
[Epoch 2/15]: train-loss = 0.694989 | train-acc = 0.759 | val-loss = 0.002828 | val-acc = 0.769
[Epoch 3/15]: train-loss = 0.580110 | train-acc = 0.800 | val-loss = 0.000333 | val-acc = 0.778
[Epoch 4/15]: train-loss = 0.492105 | train-acc = 0.829 | val-loss = 0.001201 | val-acc = 0.780
[Epoch 5/15]: train-loss = 0.416747 | train-acc = 0.854 | val-loss = 0.001592 | val-acc = 0.810
[Epoch 6/15]: train-loss = 0.351747 | train-acc = 0.877 | val-loss = 0.000720 | val-acc = 0.784
[Epoch 7/15]: train-loss = 0.297035 | train-acc = 0.896 | val-loss = 0.001334 | val-acc = 0.794
[Epoch 8/15]: train-loss = 0.248202 | train-acc = 0.912 | val-loss = 0.000497 | val-acc = 0.823
[Epoch 9/15]: train-loss = 0.206233 | train-acc = 0.927 | val-loss = 0.001534 | val-acc = 0.814
[Epoch 10/15]: train-loss = 0.169824 | train-acc = 0.940 | val-loss = 0.000273 | val-acc = 0.815
[Epoch 11/15]: train-loss = 0.146284 | 

In [17]:
#Restore the model.
model = ResNet56()
model.load_state_dict(torch.load('/content/resnet-56_weights_gpu'))

<All keys matched successfully>

## Test the trained model on Test dataset

In [18]:
test_samples_num = 10000
correct = 0 

model.eval().cuda()

with  torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        # Make predictions.
        prediction = model(inputs)

        # Retrieve predictions indexes.
        _, predicted_class = torch.max(prediction.data, 1)

        # Compute number of correct predictions.
        correct += (predicted_class == labels).float().sum().item()

test_accuracy = correct / test_samples_num
print('Test accuracy: {}'.format(test_accuracy))

Test accuracy: 0.8344
