2. Normalization Layer

output is forced into a standard distribution, without scale and shift the network would lose the ability to choose scale or offset destroying representational freedom.

If the network decides normalization is harmful at some layer:

γ → √variance
β → mean

Then:y≈x

So LayerNorm does not force normalization — it offers it as an option.

Also, the model can learn the params of scale and shift during backpropagation and adjust weights accordingly - giving freedom to express one feature more than the other and not forcing the activations to have 0 mean.

In [26]:
import torch
import torch.nn as nn

class LayerNorm(nn.Module):

    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x-mean)/torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

GELU

In [27]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

GELU VS RELU

**Point 1 — Smooth gradients:**
ReLU has a sharp corner at zero, so a tiny change in weights can suddenly flip a neuron’s gradient from 1 to 0. This makes optimization noisy and unstable in very deep networks. GELU is smooth everywhere, so small weight changes lead to small, predictable gradient changes, which makes gradient descent more stable and easier to optimize.

**Point 2 — No hard shut-off:**
ReLU completely shuts neurons off for negative inputs (output and gradient both become zero), so those neurons stop contributing and learning. GELU instead gives small, non-zero outputs and gradients for negative inputs, allowing neurons to keep participating and learning, which improves gradient flow and overall training stability.


**FeedForward**

feed forward layers are expanded in deep neural networks so that they can -

1. learn richer representation by feature separation - easier to distinguish
2. capture more patterns - since more room/neurons to express the features; feature mixing - the neurons can compute more combinations among input features
3. bringing in non linearity here - helps the network to make decisions and learn these complex patterns via these non-linear shapes 

this is being used everywhere - transformers, vision transformers, MoE, CNNs.

underlying principle remains the same - give the network room to compute complex transformations, then distill back to what matters.

In [30]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(config["emb_dim"], 4*config["emb_dim"]),
            GELU(),
            nn.Linear(4*config["emb_dim"], config["emb_dim"])
        )

    def forward(self, x):
        return self.layers(x)

GPT_CONFIG = {
    "vocab_size": 50257,      # Vocabulary size
    "context_length": 1024,   # Context length
    "emb_dim": 768,           # Embedding dimension
    "num_heads": 12,          # Number of attention heads
    "number_of_layers": 12,   # Number of transformer blocks
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-Key-Value bias
}
ff_model = FeedForward(GPT_CONFIG)
print(f"\nModel has {sum(p.numel() for p in ff_model.parameters()):,} parameters")


Model has 4,722,432 parameters



![](../images/feed-forward-1.png)

![](../images/feed-forward-2.png)

**Skip Connections**

deep nn, 2 problems - 
1. updates we do to the later layers aren’t very meaningful - because the inputs are too scrambled as so many random weight multiplications and activations to the actual input has turned it into a noise and the output/last layers has almost no signal/relevance to input layers
2. updates we do at the early layer also aren’t very meaningful - because the gradients are also too scrambled by many multiplications 

we would like to -
1. create a way where the inputs can arrive at the later layers and make the inputs meaningful
2. loss gradients to arrive at early layer and make their updates more meaningful

Residual Blocks - a collection of layers where data goes both - through and around using skip connections



**Making a full transformers block**

![T](../images/transformer-block-1.png)

In [33]:
import torch.nn as nn
from modules import MultiHeadAttention

class TransformerBlock(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(
            d_in = config["emb_dim"],
            d_out = config["emb_dim"],
            context_length = config["context_length"],
            num_heads = config["num_heads"],
            dropout = config["drop_rate"],
            qkv_bias = config["qkv_bias"],
        )
        self.feedforward = FeedForward(config)
        self.norm1 = LayerNorm(config["emb_dim"])
        self.norm2 = LayerNorm(config["emb_dim"])
        self.drop_shortcut = nn.Dropout(config["drop_rate"])


    def forward(self, x):
        skip_connection = x
        x = self.norm1(x)
        x = self.attention(x)
        x = self.drop_shortcut(x)

        x = x + skip_connection

        skip_connection = x
        x = self.norm2(x)
        x = self.feedforward(x)
        x = self.drop_shortcut(x)
        x = x + skip_connection

        return x

GPT_CONFIG = {
    "vocab_size": 50257,      # Vocabulary size
    "context_length": 1024,   # Context length
    "emb_dim": 768,           # Embedding dimension
    "num_heads": 12,          # Number of attention heads
    "number_of_layers": 12,   # Number of transformer blocks
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-Key-Value bias
}

transformer = TransformerBlock(GPT_CONFIG)
print(transformer)
print(f"\nModel has {sum(p.numel() for p in transformer.attention.parameters()):,} parameters")
print(f"\nModel has {sum(p.numel() for p in transformer.feedforward.parameters()):,} parameters")

TransformerBlock(
  (attention): MultiHeadAttention(
    (W_query): Linear(in_features=768, out_features=768, bias=False)
    (W_key): Linear(in_features=768, out_features=768, bias=False)
    (W_value): Linear(in_features=768, out_features=768, bias=False)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=768, bias=True)
  )
  (feedforward): FeedForward(
    (layers): Sequential(
      (0): Linear(in_features=768, out_features=3072, bias=True)
      (1): GELU()
      (2): Linear(in_features=3072, out_features=768, bias=True)
    )
  )
  (norm1): LayerNorm()
  (norm2): LayerNorm()
  (drop_shortcut): Dropout(p=0.1, inplace=False)
)

Model has 2,360,064 parameters

Model has 4,722,432 parameters
