# Creating a Toy ResNet from scratch

ResNet, short for Residual Network, is a deep learning architecture that introduced the concept of residual learning. It was first showcased in their landmark paper "Deep Residual Learning for Image Recognition". ResNet revolutionized deep learning by allowing researchers to train extremely deep neural networks without facing issues like vanishing or exploding gradients, which had previously made deep architectures difficult to optimize.

In this notebook, I'll be attempting to create ResNet from scratch to better my understanding of how it functions

## Importing Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F

## Loading Data and Transformation
First, I load the CIFAR-10 training and testing data from torchvision. We apply a sequential transformation to the data to prepare it for training.

### Why Choose (0.5, 0.5, 0.5) for Normalize?

1. Centered Around Zero:

When images are converted to tensors using transforms.ToTensor(), the pixel values are scaled from the range [0, 255] (which is typical for raw images) to the range [0, 1].
By choosing mean = 0.5 and std = 0.5, the pixel values are normalized to the range [-1, 1] instead of [0, 1].

This shifts the center of the pixel values to 0 and rescales them to fall within the range [-1, 1]. Having data centered around zero is generally helpful for neural networks because it leads to faster and more stable convergence during training.

2. Consistency for RGB Channels:

In simple examples, we assume that the pixel values for the Red, Green, and Blue (RGB) channels are distributed roughly the same. Therefore, using (0.5, 0.5, 0.5) for the mean and std applies the same normalization to all three channels equally.
This assumption works well for generic images, where the pixel values for each channel might follow a similar distribution.

3. Simplicity:

Choosing mean = 0.5 and std = 0.5 makes the normalization straightforward, especially in cases where you don’t have access to the dataset’s specific statistics (like dataset mean and standard deviation). It's a practical default for experiments or small-scale projects, especially when the focus is on understanding the network behavior rather than optimizing accuracy.

In [2]:
# Apply transformations to the images
# 1. Convert images to PyTorch tensors
# 2. Normalize the pixel values to the range [-1, 1]
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean = (0.5, 0.5, 0.5), std = (0.5, 0.5, 0.5))
])

# Load CIFAR-10 training and testing datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Create DataLoader to load batches of training and test data
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
testloader = DataLoader(testset, batch_size=64, shuffle=False)


Files already downloaded and verified
Files already downloaded and verified


### Residual Blocks

Before ResNet, deep networks faced a critical challenge as they got deeper—gradients during backpropagation would become very small (or large), making it hard to update the network weights properly. This phenomenon, called vanishing (or exploding) gradients, would make deeper networks either slow to train or lead to worse performance.

ResNet solves this issue by introducing residual blocks that allow gradients to flow more easily through the network. In a typical deep network, layers learn the desired mapping 𝐻(𝑥) from the input 𝑥
to the output. ResNet, however, reformulates this as learning the residual of that mapping, 
𝐹(𝑥)=𝐻(𝑥)−𝑥
and then adds the input back to the output via a skip connection. In other words, it learns the difference between the desired output and the input, which simplifies the learning process.

In [3]:
# Define a residual block, which contains two convolutional layers
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        
        # First convolution layer followed by batch normalization
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)# Kernel is the filter, Kernel_size indicates defines the dimensions of the filter (or kernel) that is applied to the input image.
        self.bn1 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU() 

        # Second convolution layer followed by batch normalization
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut (identity or downsampling to match dimensions)
        self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            ) if in_channels!= out_channels else None

    def forward(self, x):
        # Save the input as the residual (identity) connection
        identity = x #identity mapping

        # First convolution + batch norm + ReLU
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # Second convolution + batch norm
        out = self.conv2(out)
        out = self.bn2(out)

        if self.shortcut is not None:
            identity = self.shortcut(x)

        # Add the original input (identity) to the output from the convolutions
        out += identity
        out = self.relu(out)

        return out


The shortcut connection bypasses these layers and adds the input directly to the output of the convolutional layers. 

How the Shortcut Works

1. Identity Mapping:

The shortcut connection essentially performs an identity mapping, meaning it directly passes the input 
x to the output y.
This connection allows the network to learn an identity function if that is optimal, effectively allowing the block to "skip" the convolutions if needed.

2. Adding Outputs:

After computing the output of the convolutional layers F(x), the original input x is added to F(x) to produce the final output y.
The addition operation is element-wise, allowing the dimensions of the input and output to match.

3. Handling Different Dimensions:

If the dimensions of x and F(x) do not match (e.g., due to changes in the number of channels or spatial dimensions), a convolution (or other transformation) can be applied to the input x in the shortcut to ensure that the dimensions match before the addition.
This is typically done using a 1x1 convolution or a pooling layer.



In [4]:
# Define the full ResNet architecture
class ToyResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super().__init__()
        self.in_channels = 64  # Initial number of channels

        # Initial convolution layer (without residuals)
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # Create a series of layers (stacks of residual blocks)
        # Each layer will have a different number of output channels and may downsample using stride 2
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1) 
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2) 
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2) 
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        # Fully connected layer to output 10 classes (for CIFAR-10)
        self.fc = nn.Linear(512, num_classes)

    def _make_layer(self, block, out_channels, num_blocks, stride):
        # Create a layer with `num_blocks` residual blocks
        layers = []
        # The first block may change the number of channels or downsample (using stride)
        layers.append(block(self.in_channels, out_channels, stride))
        self.in_channels = out_channels  # Update input channels for the next block

        # Add additional residual blocks that do not change dimensions
        for _ in range(1, num_blocks):
            layers.append(block(out_channels, out_channels))

        return nn.Sequential(*layers)  # Return as a Sequential layer

    def forward(self, x):
        # Initial convolution and batch normalization
        out = torch.relu(self.bn1(self.conv1(x)))

        # Pass through the four layers (stacks of residual blocks)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)

        # Global average pooling (reduces spatial dimensions to 1x1)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)  # Flatten the tensor to feed into the fully connected layer

        # Output logits from fully connected layer
        out = self.fc(out)
        return out


In [5]:
# Set up device (GPU if available), model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ToyResNet(ResidualBlock, [2, 2, 2, 2]).to(device)  # Toy ResNet-18 with residual blocks

# Cross entropy loss for classification tasks
criterion = nn.CrossEntropyLoss()

# Adam optimizer for training
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [6]:
# Function to train the model
def train_model(num_epochs):
    for epoch in range(num_epochs):
        running_loss = 0.0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward pass and optimize
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        # Print epoch loss
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(trainloader):.4f}")


In [7]:
train_model(3)

Epoch [1/3], Loss: 1.2950
Epoch [2/3], Loss: 0.8078
Epoch [3/3], Loss: 0.5995


In [8]:
def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item()
    acc = (correct / len(y_pred)) * 100
    return acc

In [11]:
with torch.inference_mode():
    model.eval()
    test_loss = 0
    test_acc = 0
    for test_images, test_labels in testloader:
        test_images, test_labels = test_images.to(device), test_labels.to(device)

        test_outputs = model(test_images)

        test_loss += criterion(test_outputs,test_labels)
        test_outputs = test_outputs.argmax(dim = 1)

        test_acc += accuracy_fn(test_labels, test_outputs)


print(f'Test Loss : {test_loss/ len(testloader)}, Test Accuracy : {test_acc/ len(testloader)}',)




Test Loss : 0.714032769203186, Test Accuracy : 75.77627388535032


In [13]:
test_outputs

tensor([7, 2, 8, 0, 8, 4, 7, 0, 3, 3, 3, 0, 4, 5, 1, 7])

In [14]:
test_labels

tensor([7, 5, 8, 0, 8, 2, 7, 0, 3, 5, 3, 8, 3, 5, 1, 7])