### Layer normalization
Done to ensure that the outputs of a layer have a mean of 0 and a variance of 1. Helps with training stability. The input to a transformer block and its output are layer normalized. And so is the final output before the token decoding in GPT.

But take all this with a pinch of salt. We want to use layer normalization (mean=0, variance=1) only if it really helps. What if it doesn't? That's the beauty of scale and shift below - these are trainable parameters that the model can update during training to adjust the layer norm output to best suit the training needs.

Kinda beautiful if you think about it. Fence sitting. But still beautiful.

In [12]:
import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    #embedded_dim represents the number of embedded dimensions
    def __init__(self, embedded_dim):
        super().__init__()
        self.epsilon = 1e-5 #see below. avoids a division by 0.
        self.scale = nn.Parameter(torch.ones(embedded_dim))
        self.shift = nn.Parameter(torch.zeros(embedded_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm = (x - mean)/torch.sqrt(var + self.epsilon)
        scaled_shifted_norm = (norm * self.scale) + self.shift
        return scaled_shifted_norm


batch = torch.randn(2, 5)
l = LayerNorm(batch.shape[-1])
out = l(batch)
out

torch.set_printoptions(sci_mode=False)
out.mean(dim=-1, keepdim=True), out.var(dim=-1, keepdim=True, unbiased=False)

(tensor([[    0.0000],
         [    0.0000]], grad_fn=<MeanBackward1>),
 tensor([[1.0000],
         [1.0000]], grad_fn=<VarBackward0>))

### GELU activation function
Gaussian error linear unit

GELU(x) = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))

Preferred over the ReLU. No elbows. Smoooooth. 


In [13]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0/torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))

### Feed forward layer
Now this one required understanding the GPT model as a whole after trying to train and use it. But think of it as an expansion in the dimensional space to give the model a chance to tease out connections in a much higher dimensional space than its embedding space, and then ofcourse back to the emdedding dimensions. The activation function here is GELU. The code is pretty straightforward but the concept is rich. 

In [14]:
class FeedFoward(nn.Module):
    # we are going to use a configuration dict here to avoid have to pass in random looking 
    # parameters
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
        )

    def forward(self, x):
        return self.layers(x)