# Lecture 5 Transformer - Layer Normalization

Layer Normalization is typically applied to the output of each sub-layer in the encoder. Here's a breakdown of the code:

Layer Normalization normalizes the input across the features for each data point (rather than across the batch, as in Batch Normalization). It subtracts the mean and divides by the standard deviation, followed by scaling with a learned weight (gamma) and shifting with a learned bias (beta).

In [2]:
import numpy as np

class LayerNorm:
    def __init__(self, d_model, eps=1e-6):
        """
        Initialize Layer Normalization parameters.
        
        Args:
            d_model: The dimension of the model (number of features).
            eps: A small constant for numerical stability (to avoid division by zero).
        """
        self.gamma = np.ones((d_model,))
        self.beta = np.zeros((d_model,))
        self.eps = eps

    def __call__(self, x):
        """
        Apply layer normalization to the input x.
        
        Args:
            x: Input matrix of shape (batch_size, d_model)
        
        Returns:
            Normalized input with the same shape as x.
        """
        mean = np.mean(x, axis=-1, keepdims=True)
        std = np.std(x, axis=-1, keepdims=True)
        """
        axis=-1 refers to the last axis of the array, 
        which is the d_model dimension (number of features) in this case. 
        Thus, the mean and std are calculated along the d_model dimension, 
        which means: For each batch (each row in x), 
        the mean and standard deviation are calculated across the features (across the elements of that row).
        """ 
        x_normalized = (x - mean) / (std + self.eps)
        return self.gamma * x_normalized + self.beta

# Example usage of LayerNorm in a Transformer Encoder

class TransformerEncoderLayer:
    def __init__(self, d_model):
        """
        A simplified Transformer Encoder layer with layer normalization.
        
        Args:
            d_model: The dimension of the model.
        """
        self.d_model = d_model
        self.layer_norm = LayerNorm(d_model)
        
    def forward(self, x):
        """
        Forward pass through the encoder layer.
        
        Args:
            x: Input matrix of shape (batch_size, d_model)
        
        Returns:
            Output matrix of shape (batch_size, d_model)
        """
        # Here, we can have sub-layers like self-attention and feedforward (not implemented)
        
        # Apply layer normalization (after sub-layers processing, for simplicity assume x)
        x_norm = self.layer_norm(x)
        
        return x_norm

# Testing the encoder layer with layer normalization

if __name__ == "__main__":
    # Sample input (batch_size = 2, d_model = 4)
    x = np.array([[0.2, 0.8, 1.0, 1.2],
                  [1.0, 0.0, 0.5, 1.5]])

    encoder_layer = TransformerEncoderLayer(d_model=4)
    output = encoder_layer.forward(x)

    print("Input:")
    print(x)
    print("\n Output after Layer Normalization:")
    print(output)

Input:
[[0.2 0.8 1.  1.2]
 [1.  0.  0.5 1.5]]

 Output after Layer Normalization:
[[-1.60356317  0.          0.53452106  1.06904211]
 [ 0.4472128  -1.34163839 -0.4472128   1.34163839]]


**Explanation:**

1. LayerNorm class: Implements layer normalization, using the formula:<br>
$LayerNorm(x) = \gamma \frac{x - \mu}{\sigma + \epsilon}+\beta$

where:<br>
- $\mu$ is the mean of the input.
- $\sigma$ is the standard deviation.
- $\epsilon$ is a small number to avoid division by zero.
- $\gamma$ and $\beta$ are set to be $\gamma = 1$ and $\beta = 0$ by default. They can also be learned parameters.

2. TransformerEncoderLayer class: Represents a simplified encoder layer in the Transformer. Normally, it would have sub-layers like self-attention and feed-forward networks, but here we focus on layer normalization.

3. Output: The code outputs the normalized input using layer normalization.