<a href="https://colab.research.google.com/github/vijaygwu/IntroToDeepLearning/blob/main/SkipConnections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Dive into Residuals

**The Vanishing Gradient Problem**

Deep neural networks, while powerful, often face a challenge during training called the **vanishing gradient problem**. In backpropagation, gradients (error signals) are propagated backward through the network to update the weights. As the network gets deeper, these gradients can become increasingly small, hindering the learning process in earlier layers. This makes it difficult to train very deep networks effectively.

**Residuals as a Solution**

Residuals, or skip connections, are designed to address this problem. They provide a direct path for gradients to flow from later layers to earlier layers, making training deeper networks more feasible.

**Mathematical Formulation**

Let's break down the residual block's operation more formally:

* **Input:**  `x` (the input to the block)
* **Desired Output:**  `H(x)` (the ideal mapping the block should learn)
* **Residual Function:** `F(x)` (the function the block actually learns)
* **Output:** `y = F(x) + x` (the output of the residual block)

The key idea is that instead of directly learning the complex mapping `H(x)`, the residual block focuses on learning the *residual* mapping `F(x)`, which is the difference between the desired output and the input:

```
F(x) = H(x) - x
```

Therefore, the desired output can be expressed as:

```
H(x) = F(x) + x
```

**Advantages of Residuals**

1. **Gradient Flow:** During backpropagation, the gradient of the loss with respect to the input `x` is:

```
∂Loss/∂x = ∂Loss/∂y * (∂F(x)/∂x + 1)
```

* The `+1` term ensures that the gradient doesn't vanish even if `∂F(x)/∂x` becomes very small.
* This allows gradients to flow more easily to earlier layers, facilitating training of deep networks.

2. **Identity Mapping:**

* If the residual function `F(x)` learns to output zero (or close to it), then the output `y` becomes approximately equal to the input `x`. This essentially creates an identity mapping, allowing the signal to pass through the block unchanged.
* This is beneficial because it provides the network with the flexibility to skip layers if they don't contribute significantly to the learning process.

3. **Improved Optimization:**

* Residual connections can make the optimization landscape smoother, helping the optimizer converge faster and potentially find better solutions.

**Assumptions**

* The convolutional path represents the residual function `F(x)`.
* The skip connection directly carries the input `x` to the end of the block.
* The element-wise addition combines `F(x)` and `x` to produce the final output `f(x)`.

**Beyond the Basics**

* **Variations:** There are different types of residual blocks, such as bottleneck blocks and pre-activation blocks, each with its own advantages.
* **Impact:** Residual connections have been instrumental in enabling the training of extremely deep neural networks, leading to breakthroughs in various domains, including image recognition and natural language processing.




### Detailed Explanation of the Code:

1. **Initialization (`__init__` method)**:
   - **`conv1` and `conv2`**: These are the main convolutional layers in the block. The first layer might change the spatial dimensions depending on the stride, while the second layer maintains the spatial dimensions.
   - **`bn1` and `bn2`**: Batch normalization layers are applied after each convolution to stabilize the learning process.
   - **Skip Connection**: The skip connection allows the input to bypass the convolutional layers. If the input and output shapes differ, the skip connection adjusts the input using a 1x1 convolution.

2. **Forward Pass (`forward` method)**:
   - **Input Tensor**: The input tensor's shape is printed before any processing.
   - **Skip Connection**: The skip connection tensor's shape is printed. This tensor is used to add to the convolutional output.
   - **Convolutional Layers**: After each convolutional layer, the tensor's shape is printed to show how the data flows through the network.
   - **Difference Calculation**: The difference \( C(x) = f(x) - x \) is calculated and printed to highlight the modification made by the convolutional layers.
   - **Skip Connection Addition**: The skip connection is added back to the convolutional output, and the shape is printed.
   - **Final ReLU**: The final activation function (ReLU) is applied, and the resulting shape is printed.

3. **Example Usage**:
   - **Residual Block**: An instance of the `ResidualBlock` is created with 64 input channels, 128 output channels, and a stride of 2.
   - **Input Tensor**: A random input tensor is generated with a batch size of 32, 64 channels, and spatial dimensions of 56x56.
   - **Forward Pass**: The input tensor is passed through the residual block, and the output shape is printed.
   - **Loss Calculation**: A simple mean loss is computed from the output tensor.
   - **Backward Pass**: The gradients are computed by backpropagation, and the gradients of the input tensor are printed to show how the network adjusts the input based on the loss.

### Running the Code:
When you run this code, you will see detailed print statements that explain each step of the forward pass and the effect of the skip connection on both the data flow and the gradient flow during backpropagation. This will provide a clear understanding of how residual blocks function and why skip connections are crucial for training deep networks effectively.

In [3]:

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()

        # First convolutional layer:
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        # Batch normalization after the first convolution
        self.bn1 = nn.BatchNorm2d(out_channels)

        # Second convolutional layer:
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        # Batch normalization after the second convolution
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Identity shortcut (skip connection):
        self.skip = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            # 1x1 convolution to adjust input dimensions to match output
            self.skip = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        print("Input Tensor Shape: ", x.shape)

        # Apply the skip connection (identity mapping).
        identity = self.skip(x)
        print("Skip Connection (Identity) Shape: ", identity.shape)

        # First layer: Conv1 -> BatchNorm1 -> ReLU
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)
        print("After Conv1 + BN + ReLU Shape: ", out.shape)

        # Second layer: Conv2 -> BatchNorm2 (no ReLU yet)
        out = self.conv2(out)
        out = self.bn2(out)
        print("After Conv2 + BN Shape: ", out.shape)

        # Compute the difference between the convolutional path output and the skip connection:
        difference = out - identity
        print("Difference C(x) = f(x) - x Shape: ", difference.shape)
        print("Difference C(x) = f(x) - x Values (first 5 elements): ", difference.view(-1)[:5])

        # Add the skip connection (identity) to the output:
        out += identity
        print("After Adding Skip Connection Shape: ", out.shape)

        # Final ReLU activation applied after adding the skip connection.
        out = F.relu(out)
        print("After Final ReLU Shape: ", out.shape)

        return out

# Example usage:

# Create an instance of the ResidualBlock:
res_block = ResidualBlock(64, 128, stride=2)

# Create a random input tensor:
input_tensor = torch.randn(32, 64, 56, 56, requires_grad=True)

# Forward pass through the residual block
output_tensor = res_block(input_tensor)

# To retain the gradient of the output tensor
output_tensor.retain_grad()

print("Output Tensor Shape: ", output_tensor.shape)

# Define a simple loss function:
loss = output_tensor.mean()
print("Loss: ", loss.item())

# Perform a backward pass to compute gradients
loss.backward()

# Print the gradient of the input tensor with respect to the loss:
print("Gradient of Input Tensor (first 5 elements): ", input_tensor.grad.view(-1)[:5])

# Print the gradient of the output tensor:
print("Gradient of Output Tensor (first 5 elements): ", output_tensor.grad.view(-1)[:5])


Input Tensor Shape:  torch.Size([32, 64, 56, 56])
Skip Connection (Identity) Shape:  torch.Size([32, 128, 28, 28])
After Conv1 + BN + ReLU Shape:  torch.Size([32, 128, 28, 28])
After Conv2 + BN Shape:  torch.Size([32, 128, 28, 28])
Difference C(x) = f(x) - x Shape:  torch.Size([32, 128, 28, 28])
Difference C(x) = f(x) - x Values (first 5 elements):  tensor([-1.1270, -1.0176,  0.5778, -0.3388, -1.6565], grad_fn=<SliceBackward0>)
After Adding Skip Connection Shape:  torch.Size([32, 128, 28, 28])
After Final ReLU Shape:  torch.Size([32, 128, 28, 28])
Output Tensor Shape:  torch.Size([32, 128, 28, 28])
Loss:  0.5638895034790039
Gradient of Input Tensor (first 5 elements):  tensor([ 3.6980e-09, -7.7582e-09,  8.1402e-08, -1.0858e-07,  2.1012e-07])
Gradient of Output Tensor (first 5 elements):  tensor([3.1140e-07, 3.1140e-07, 3.1140e-07, 3.1140e-07, 3.1140e-07])




## Breakdown of the Residual Block's Operation:

1. **Input \( x \)**:
   - This is the input tensor that is passed into the residual block. In the context of deep learning, this could be a feature map from a previous layer.

2. **Desired Output \( H(x) \)**:
   - This is the ideal mapping that we want the residual block to learn. For example, if we want the network to extract a certain feature from the input, \( H(x) \) represents this feature extraction.

3. **Residual Function \( F(x) \)**:
   - Instead of directly learning \( H(x) \), the residual block is designed to learn the residual function \( F(x) = H(x) - x \). The idea is that learning the difference (or residual) between the input and the desired output is often easier than learning the entire function \( H(x) \) directly.

4. **Output \( y \)**:
   - The output of the residual block is \( y = F(x) + x \). This is the result of adding the residual function \( F(x) \) to the original input \( x \), which effectively gives us \( H(x) \).

### Explanation in the Context of the Code:

Let’s relate this to the code:

- **Input \( x \)**: The input to the block is the tensor `input_tensor`, which could represent feature maps with shape `[batch_size, channels, height, width]`.

- **Desired Output \( H(x) \)**: In theory, this is what we want the block to learn. However, the block doesn’t directly learn \( H(x) \); it learns \( F(x) \).

- **Residual Function \( F(x) \)**:
  - In the code, \( F(x) \) is computed through two convolutional layers with batch normalization and ReLU activation:
    ```python
    out = self.conv1(x)
    out = self.bn1(out)
    out = F.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    ```
  - This represents the residual function \( F(x) \).

- **Output \( y = F(x) + x \)**:
  - After computing \( F(x) \), the code adds the original input \( x \) (or `identity`, which might be a transformed version of \( x \)) to the output of the residual function:
    ```python
    out += identity
    out = F.relu(out)
    ```
  - This addition is where the skip connection happens, resulting in the output \( y = F(x) + x \). The final ReLU activation is applied to introduce non-linearity.

### Implementation in the Code:

```python
def forward(self, x):
    # x is the input to the block
    print("Input Tensor Shape: ", x.shape)

    # Skip connection (identity mapping)
    identity = self.skip(x)
    print("Skip Connection (Identity) Shape: ", identity.shape)
    
    # Compute the residual function F(x)
    out = self.conv1(x)
    out = self.bn1(out)
    out = F.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    print("Residual Function F(x) Shape: ", out.shape)
    
    # Add the skip connection: y = F(x) + x
    out += identity
    print("After Adding Skip Connection: y = F(x) + x Shape: ", out.shape)
    
    # Final ReLU activation
    out = F.relu(out)
    print("Final Output After ReLU Shape: ", out.shape)
    
    return out
```

### Example Workflow:

1. **Input Tensor \( x \)**: Suppose the input tensor `input_tensor` has a shape of `[32, 64, 56, 56]`, representing a batch of 32 images with 64 channels and spatial dimensions 56x56.

2. **Skip Connection \( x \)**: The skip connection (`identity`) is either the original input or a transformed version (e.g., passed through a 1x1 convolution if dimensions don’t match).

3. **Residual Function \( F(x) \)**: The two convolutional layers transform \( x \) into a new tensor `out`, representing the residual function \( F(x) \).

4. **Output \( y = F(x) + x \)**: The output tensor is computed by adding the residual function \( F(x) \) to the skip connection (`identity`). This step effectively "skips" the convolutions if they don't contribute much to learning, which helps in training deep networks by avoiding the vanishing gradient problem.

### Summary:

- The residual block simplifies the learning task by having the network focus on learning the residual function \( F(x) = H(x) - x \) rather than directly learning \( H(x) \).
- The output of the residual block, \( y = F(x) + x \), combines the learned residual with the original input, which allows the network to easily preserve useful information from the input.