# Seminar 03 — Residual Blocks (Very Detailed)

A **residual block** is a neural network building block that learns a *residual function* — i.e., the difference between the desired output and the input — instead of learning the full transformation directly. Formally, if a block receives an input tensor **x**, the block learns a function **F(x)** and returns:

\[	ext{output} = x + F(x)\]

This idea was popularized in **ResNet** (He et al., 2015). Residual connections are now a standard component in modern deep learning architectures (vision, NLP, diffusion models, etc.).

---

## 1) What is a residual block?

A residual block is a small sub-network with a **skip connection** (also called a shortcut or identity connection). Instead of outputting only `F(x)`, it outputs `x + F(x)` (or a projection `W_s x + F(x)` when dimensions differ).

### Intuition
* If learning the full mapping is hard, learning the *difference* from the identity can be easier.
* If the best solution is close to identity, the network can push `F(x)` toward **0**, so the block approximates the identity `x`.
* This eases optimization and reduces the risk of vanishing gradients in deep networks.

---

## 2) Why do we need residual blocks?

### 2.1) Optimization gets hard as depth grows
Deep networks can be difficult to train due to **vanishing/exploding gradients**. As the number of layers increases, the signal that backpropagates to early layers can become too small. Residual connections create *direct gradient paths* from deeper layers to earlier ones.

### 2.2) Identity is a strong baseline
If a deeper network is not helpful, it should at least behave like a shallower one. With residual blocks, the network can easily learn the identity mapping by driving `F(x) → 0`. Without residual connections, an identity mapping might require carefully tuned weights.

### 2.3) Empirical gains
Residual blocks:
* Enable **much deeper networks**.
* Improve **accuracy** and **generalization**.
* Make training more stable, especially with normalization layers.

---

## 3) How do we implement a residual block?

There are multiple standard designs. Below are the most common patterns and why you might choose them.

### 3.1) Basic residual block (same shape)
If input and output shapes are identical, use a simple **identity skip**:

\[	ext{output} = x + F(x)\]

Example structure:
```
Conv -> ReLU -> Conv -> (Add x) -> ReLU
```

### 3.2) Projection residual block (different shape)
If the block changes the number of channels or spatial resolution, the skip path uses a **projection** (e.g., 1x1 convolution or linear layer) so shapes match:

\[	ext{output} = W_s x + F(x)\]

### 3.3) Pre-activation residual block
In the original ResNet v2, normalization + activation is applied **before** the main layers:
```
BN -> ReLU -> Conv -> BN -> ReLU -> Conv -> (Add x)
```
This can improve gradient flow and stability.

### 3.4) Bottleneck residual block
Used in deeper ResNets (e.g., ResNet-50/101). The idea is to reduce computation:
```
1x1 (reduce channels) -> 3x3 -> 1x1 (restore channels)
```

### 3.5) Residual block variants
Residual connections appear in many architectures:
* **Transformer**: `x + Attention(x)` and `x + MLP(x)`.
* **U-Net**: residual blocks in encoder/decoder.
* **Diffusion models**: residual blocks with time embeddings.

---

## 4) How do we use residual blocks?

You use residual blocks as reusable building blocks to assemble deeper networks. A model can be:

```
Input -> ResidualBlock -> ResidualBlock -> ... -> Head -> Output
```

In practice:
* Use **identity skips** when input/output shapes match.
* Use **projection skips** when you change feature dimensions.
* Stack multiple blocks to increase depth without losing trainability.

---

## 5) Python 3.14 Example Code (Framework-agnostic + PyTorch-style)

Below is a fully self-contained example of a residual block using **NumPy**, followed by a PyTorch-style implementation. Both follow the same logic and illustrate how you would actually use a residual block in a model.


In [None]:
# Python 3.14 - Residual Block Example (NumPy)
# This is a minimal, educational example with simple linear layers.

from __future__ import annotations
from dataclasses import dataclass
from typing import Callable
import numpy as np

Array = np.ndarray


def relu(x: Array) -> Array:
    return np.maximum(0.0, x)


@dataclass
class Linear:
    weight: Array
    bias: Array

    @classmethod
    def init(cls, in_features: int, out_features: int, rng: np.random.Generator) -> "Linear":
        # He initialization for ReLU
        scale = np.sqrt(2.0 / in_features)
        weight = rng.normal(0.0, scale, size=(in_features, out_features))
        bias = np.zeros((out_features,), dtype=np.float64)
        return cls(weight=weight, bias=bias)

    def __call__(self, x: Array) -> Array:
        return x @ self.weight + self.bias


@dataclass
class ResidualBlock:
    # Two-layer residual block with optional projection for the skip path
    layer1: Linear
    layer2: Linear
    activation: Callable[[Array], Array] = relu
    projection: Linear | None = None

    def __call__(self, x: Array) -> Array:
        # Main path
        out = self.layer1(x)
        out = self.activation(out)
        out = self.layer2(out)

        # Skip path (identity or projection)
        skip = x if self.projection is None else self.projection(x)

        # Residual addition + final activation
        return self.activation(out + skip)


# Example usage
rng = np.random.default_rng(0)

# Case 1: Identity skip (same feature size)
block_same = ResidualBlock(
    layer1=Linear.init(8, 8, rng),
    layer2=Linear.init(8, 8, rng),
)

x = rng.normal(size=(4, 8))  # batch=4, features=8
out_same = block_same(x)
print("Identity-skip output shape:", out_same.shape)

# Case 2: Projection skip (different feature size)
block_proj = ResidualBlock(
    layer1=Linear.init(8, 16, rng),
    layer2=Linear.init(16, 16, rng),
    projection=Linear.init(8, 16, rng),
)

out_proj = block_proj(x)
print("Projection-skip output shape:", out_proj.shape)

# Stack blocks into a tiny model
blocks = [
    block_same,
    ResidualBlock(
        layer1=Linear.init(8, 8, rng),
        layer2=Linear.init(8, 8, rng),
    ),
]

h = x
for b in blocks:
    h = b(h)
print("Stacked output shape:", h.shape)


In [None]:
# Optional: PyTorch-style residual block (if torch is available)
# This mirrors real-world practice but keeps the example simple.

from __future__ import annotations

try:
    import torch
    import torch.nn as nn

    class ResidualBlockTorch(nn.Module):
        def __init__(self, in_features: int, out_features: int):
            super().__init__()
            self.layer1 = nn.Linear(in_features, out_features)
            self.layer2 = nn.Linear(out_features, out_features)
            self.act = nn.ReLU()
            self.projection = None
            if in_features != out_features:
                self.projection = nn.Linear(in_features, out_features)

        def forward(self, x):
            out = self.layer1(x)
            out = self.act(out)
            out = self.layer2(out)
            skip = x if self.projection is None else self.projection(x)
            return self.act(out + skip)

    # Example usage
    x_t = torch.randn(4, 8)
    block_t = ResidualBlockTorch(8, 8)
    y_t = block_t(x_t)
    print("Torch output shape:", y_t.shape)
except ModuleNotFoundError:
    print("PyTorch not installed; skipping torch example.")
