<a href="https://colab.research.google.com/github/shreyans-sureja/llm-101/blob/main/part12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Layer Normalization

transformer block is stack of

1. Layer Normalization
2. Masked multi-head attention
3. Dropout
4. Shortcut connection


###**Why layer normalization?**

- Training deep neural networks with many layers can be challenging due to vanishing/exploding gradients.
- This leads to unstable training dynamics.
- Layer normalization improves the stability and efficiency of neural network training.
- Main idea: Adjust outputs of neural network to have mean zero and variance one.


In other words,

If layer output is too large or small, gradient magnitudes can become too large or small. This affects training and layer normalization keeps gradient stable.

As the training proceeds, input to each layer change(internal covariate shifts), this delays convergence and layer normalization prevents this.

In GPT-2 and morden transformer architectures, layer normalization is typically applied before and after the multi-head attention module and before the final output layer.


#### Simple example

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(123)
batch_example = torch.randn(2,5)
print("batch_example: ", batch_example)
layer = nn.Sequential(nn.Linear(5,6), nn.ReLU())
out = layer(batch_example)
print("output: ", out)

batch_example:  tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969],
        [ 0.2093, -0.9724, -0.7550,  0.3239, -0.1085]])
output:  tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


The neural network layer we have coded consists of a Linear layer followed by a non-linear
activation function, ReLU (short for Rectified Linear Unit), which is a standard activation
function in neural networks.

If you are unfamiliar with ReLU, it simply thresholds negative
inputs to 0, ensuring that a layer outputs only positive values, which explains why the
resulting layer output does not contain any negative values.


Now we can apply layer normalization

In [None]:
mean = out.mean(dim=-1, keepdim=True)
print("mean: ", mean)

var = out.var(dim=-1, keepdim=True)
print("var:", var)

mean:  tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
var: tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


Using keepdim=True in operations like mean or variance calculation ensures that the
output tensor retains the same number of dimensions as the input tensor, even though the
operation reduces the tensor along the dimension specified via dim.

For instance, without
keepdim=True, the returned mean tensor would be a 2-dimensional vector [0.1324,
0.2170] instead of a 2×1-dimensional matrix [[0.1324], [0.2170]].

For a 2D tensor (like a matrix), using dim=-1 for operations such as
mean or variance calculation is the same as using dim=1.

This is because -1 refers to the
tensor's last dimension, which corresponds to the columns in a 2D tensor.

Later, when
adding layer normalization to the GPT model, which produces 3D tensors with shape
[batch_size, num_tokens, embedding_size], we can still use dim=-1 for normalization
across the last dimension, avoiding a change from dim=1 to dim=2.


In [None]:
out_norm = (out- mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)

print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[9.9341e-09],
        [1.9868e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


- This specific implementation of layer Normalization operates on the last dimension of the
input tensor x, which represents the embedding dimension (emb_dim).

- The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during normalization.

- The scale and shift are two trainable parameters (of the same dimension
as the input) that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task.

- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing.

In [None]:
class LayerNorm(nn.Module):
  def __init__(self, emb_dim):
    super().__init__()
    self.eps = 1e-5
    self.scale = nn.Parameter(torch.ones(emb_dim))
    self.shift = nn.Parameter(torch.zeros(emb_dim))

  def forward(self, x):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=True)
    norm_x = (x - mean) / torch.sqrt(var + self.eps);

    return self.scale * norm_x + self.shift

- In our variance calculation method, we have used unbiased=False, which means in the variance calculation, we divide by the number of inputs n in the variance formula.

- This approach does not apply Bessel's correction, which typically uses n-1 instead of n in the denominator to adjust for bias in sample variance estimation.

- This decision results in a so-called biased estimate of the variance.

- For large-scale language models (LLMs), where the embedding dimension n is significantly large, the difference between using n and n-1 is practically negligible.

- We chose this approach to ensure compatibility with the GPT-2 model's normalization layers and because it reflects TensorFlow's default behavior, which was used to implement the original GPT2 model.

In [None]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[-1.4901e-08],
        [ 2.3842e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.8000],
        [0.8000]], grad_fn=<VarBackward0>)


#### **Layer vs Batch Normalization** -

**Layer** - Normalize feature dimension independently of batch size.

**Batch** - Normalize across the batch dimension and spatial dimensions.

Available hardware dictates batch size and hence normalization depends on that, whearas layer normalization is more flexible and stability for distributed training and specially environments which lack resources.