In [1]:
import torch
import torch.nn as nn

  cpu = _conversion_method_template(device=torch.device("cpu"))


In [2]:
outputs = torch.randn(3,6)
outputs

tensor([[ 1.5778,  1.6956, -1.9398, -0.6370,  0.3499,  0.9727],
        [-0.9691, -1.4902,  0.8478, -1.4404,  0.0469, -0.9731],
        [ 1.0857,  0.8106, -0.1296,  0.9529, -0.5721, -1.0813]])

In [3]:
mean = outputs.mean( dim=-1, keepdim=True )
mean

tensor([[ 0.3365],
        [-0.6630],
        [ 0.1777]])

In [4]:
sd = outputs.std( dim=-1, keepdim=True)
sd

tensor([[1.4087],
        [0.9236],
        [0.9019]])

In [5]:
normalized_outputs = (outputs - mean)/sd
normalized_outputs

tensor([[ 0.8811,  0.9648, -1.6159, -0.6911,  0.0095,  0.4516],
        [-0.3314, -0.8956,  1.6359, -0.8417,  0.7686, -0.3357],
        [ 1.0067,  0.7017, -0.3407,  0.8594, -0.8313, -1.3959]])

In [6]:
normalized_outputs.mean( dim=-1, keepdim=True)

tensor([[-9.9341e-09],
        [ 9.9341e-09],
        [ 0.0000e+00]])

In [7]:
normalized_outputs.std( dim=-1, keepdim=True)

tensor([[1.],
        [1.],
        [1.]])

In [8]:
torch.set_printoptions( sci_mode=False )

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift




This file creates the normalizing layer class. This turns all token emb vectors from really small numbers to between 0-1

first chunk sets up parameters,
1. embedding size(features per token)
2. eps (a constant in this case .000005 to avoid undfeined vectors when dividing by zero)
3. self scale and shift a learnable weight and bias that stretches and shifts normalized values
    - scale: multiplies a normalized vector bt a learnable weight (stretching/compressing a dimension after normalization)
    - shfit: allows the model to recenter the normalized output wherever it wants, not just zero

* feature = tensor x entering the layer shaped as (batch_size, seq_length, emb_dim) *

quick note: context_len: max num tokens the model can consider at once, vocab_size: total num unique tokens, seq_len :the actual num of tokens in a specific input sequence.


Second chunk
mean = computing the mean across features
var = computing variance across same features
norm = normalize each embedding to have mean 0 and var 1
then returns a rescaled and shifted normalized output with learnable scale and shift parameters 

how this all fits in: 

Layer normalization makes every tokens featrure vector (b,seq_len, emb_dim) stable, learnable rescaled and ready for the mha that does all the attention scores, weighst to create context vects

MHA then takes this bad boys and finds dot products, "which vectors attends to which"

What happens if this normalization isnt done:

vectors get blown up or vanish, gradients dont work and training becomes chaos. Starting with a clean slate before anything else is necessary

