Chapter 4: Inside a Transformer
=====

## Understanding LayerNorm

$$\hat{x}_i = \frac{x_i - \mu}{\sigma+\epsilon}\gamma + \beta$$

* In the context of Transformers:
  * $\gamma$ is initialized to a vector of $\mathcal{1}$
  * $\beta$ is initialized to a vector of $\mathbb{0}$

In [1]:
import torch
import torch.nn as nn
import numpy as np

# set the random seed for reproducibility
torch.manual_seed(0)

<torch._C.Generator at 0x12244cb10>

In [2]:
# setting the parameters
B = 5
seq_length = 10
embedding_dim = 8

# create a random tensor of shape (B, seq_length, embedding_dim)
X = torch.randn(B, seq_length, embedding_dim)
print(X.shape)

torch.Size([5, 10, 8])


In [3]:
# Build a layernorm
layer_norm = nn.LayerNorm(embedding_dim)

# print the shape of layer_norm parameters
print("LN's gamma:", layer_norm.weight.shape)
print("LN's beta:", layer_norm.bias.shape)
print("\nInitial values:")
print("gamma:", layer_norm.weight.detach().numpy())
print("beta:", layer_norm.bias.detach().numpy())

LN's gamma: torch.Size([8])
LN's beta: torch.Size([8])

Initial values:
gamma: [1. 1. 1. 1. 1. 1. 1. 1.]
beta: [0. 0. 0. 0. 0. 0. 0. 0.]


In [4]:
# apply layernorm to X
Y = layer_norm(X)
print(Y.shape)

torch.Size([5, 10, 8])


In [5]:
# check mean and variance after applying layernorm
mean = Y.mean(dim=-1, keepdim=True)
print("Mean shape:", mean.shape)
print("Mean close to zero?", np.allclose(mean.detach().numpy(), 0, atol=1e-3))

var = Y.var(dim=-1, keepdim=True, unbiased=False)
print("Variance shape:", var.shape)
print("Variance close to one?", np.allclose(var.detach().numpy(), 1, atol=1e-3))

Mean shape: torch.Size([5, 10, 1])
Mean close to zero? True
Variance shape: torch.Size([5, 10, 1])
Variance close to one? True


### Hands-on computation

In [6]:
# compute mean and variance manually
manual_mean = X.mean(dim=-1, keepdim=True)
manual_var = X.var(dim=-1, keepdim=True, unbiased=False)
print("Manual mean shape:", manual_mean.shape)
print("Manual variance shape:", manual_var.shape)

# apply layernorm manually
Y_manual = (X - manual_mean) / torch.sqrt(manual_var + 1e-6)
Y_manual = layer_norm.weight * Y_manual + layer_norm.bias

Y_array = Y.detach().numpy()
Y_manual_array = Y_manual.detach().numpy()
# compare the results
print(
    "All close?",
    np.allclose(Y_array, Y_manual_array, atol=1e-3)
)

Manual mean shape: torch.Size([5, 10, 1])
Manual variance shape: torch.Size([5, 10, 1])
All close? True
