<a href="https://colab.research.google.com/github/vijaygwu/IntroToDeepLearning/blob/main/LayerNormVs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Layernorm

Layer Normalization (LayerNorm) is a normalization technique typically used in deep learning models, particularly in transformer models. Unlike Batch Normalization (which normalizes across the batch dimension), Layer Normalization normalizes across the feature dimension within each individual data point, making it more suitable for recurrent neural networks or transformer architectures where the batch size may be very small or vary in size.

### How Layer Normalization Works:

1. **Input to a Layer:**
   Consider an input tensor \( x \) of shape \( [N, D] \), where:
   - \( N \) is the number of data points (batch size).
   - \( D \) is the number of features in each data point.

2. **Normalization Step:**
   For each data point, LayerNorm computes the mean \( \mu \) and variance \( \sigma^2 \) of the features in that data point. This is done individually for each data point and is given by:
   \[
   \mu_i = \frac{1}{D} \sum_{j=1}^{D} x_{ij}
   \]
   \[
   \sigma^2_i = \frac{1}{D} \sum_{j=1}^{D} (x_{ij} - \mu_i)^2
   \]
   where \( i \) indexes the data points and \( j \) indexes the features.

3. **Scaling and Shifting:**
   After calculating the mean and variance for each data point, LayerNorm normalizes the input by subtracting the mean and dividing by the standard deviation (with a small epsilon added for numerical stability):
   \[
   \hat{x}_{ij} = \frac{x_{ij} - \mu_i}{\sqrt{\sigma^2_i + \epsilon}}
   \]
   Then, two learnable parameters, gamma \( \gamma \) and beta \( \beta \), are introduced to scale and shift the normalized output:
   \[
   y_{ij} = \gamma_j \hat{x}_{ij} + \beta_j
   \]
   This ensures that the model can learn to revert back to the original activations if necessary.

### Key Differences Between LayerNorm and BatchNorm:
- **BatchNorm** normalizes across the batch dimension and is dependent on the batch size, which can cause problems for small batch sizes.
- **LayerNorm** normalizes across the features, making it independent of the batch size and more suitable for sequential tasks or tasks with varying batch sizes.

### LayerNorm in PyTorch:
In PyTorch, you can use `torch.nn.LayerNorm` to apply Layer Normalization.

Here’s an example:

```python
import torch
import torch.nn as nn

# Example input tensor with batch size 3 and feature size 4
x = torch.tensor([[1.0, 2.0, 3.0, 4.0],
                  [2.0, 4.0, 6.0, 8.0],
                  [3.0, 6.0, 9.0, 12.0]])

# Define a LayerNorm with feature size 4
layer_norm = nn.LayerNorm(4)

# Apply LayerNorm
output = layer_norm(x)

# Print the input and output
print("Input:\n", x)
print("Output after LayerNorm:\n", output)
```

### Explanation of the code:
- We define an input tensor `x` with shape `[3, 4]` (3 data points, 4 features each).
- We initialize `nn.LayerNorm` with the size of the feature dimension, which is `4` in this case.
- We apply LayerNorm to the input tensor and print both the original input and the normalized output.

Layer Normalization is particularly useful in transformer-based architectures like BERT or GPT, where the focus is on normalizing each individual sequence or token representation rather than the whole batch.



In [4]:
import torch
import torch.nn as nn

# Example input tensor with batch size 3 and feature size 4
x = torch.tensor([[1.0, 2.0, 3.0, 4.0],
                  [2.0, 4.0, 6.0, 8.0],
                  [3.0, 6.0, 9.0, 12.0]])

# Define a LayerNorm with feature size 4
layer_norm = nn.LayerNorm(4)

# Apply LayerNorm
outputLayer = layer_norm(x)

# Print the input and output
print("Input:\n", x)
print("Output after LayerNorm:\n", outputLayer)

Input:
 tensor([[ 1.,  2.,  3.,  4.],
        [ 2.,  4.,  6.,  8.],
        [ 3.,  6.,  9., 12.]])
Output after LayerNorm:
 tensor([[-1.3416, -0.4472,  0.4472,  1.3416],
        [-1.3416, -0.4472,  0.4472,  1.3416],
        [-1.3416, -0.4472,  0.4472,  1.3416]],
       grad_fn=<NativeLayerNormBackward0>)


LayerNorm vs BatchNorm

 Unlike LayerNorm, BatchNorm normalizes across the batch dimension. It calculates the mean and variance for each feature across the entire batch and then normalizes those feature values.



### Batch Normalization Example in PyTorch:

```python
import torch
import torch.nn as nn

# Example input tensor with batch size 3 and feature size 4
x = torch.tensor([[1.0, 2.0, 3.0, 4.0],
                  [2.0, 4.0, 6.0, 8.0],
                  [3.0, 6.0, 9.0, 12.0]])

# Define a BatchNorm layer with feature size 4 (normalize over the feature dimension)
batch_norm = nn.BatchNorm1d(4)

# Apply BatchNorm
output = batch_norm(x)

# Print the input and output
print("Input:\n", x)
print("Output after BatchNorm:\n", output)
```

### Explanation of the Code:
- The input tensor `x` has a shape of `[3, 4]`, representing 3 samples (batch size of 3), each with 4 features.
- We use `nn.BatchNorm1d(4)` for normalizing the feature dimension (4 in this case). The "1d" here signifies that this is 1D data, which fits our use case.
- The Batch Normalization layer will calculate the mean and variance for each feature across the entire batch and normalize the values accordingly.
- The output tensor will have normalized values based on these calculations.

### Key Difference from LayerNorm:
- **BatchNorm** computes the mean and variance for each feature across the entire batch, which depends on the batch size. It works well when the batch size is reasonably large.
- **LayerNorm**, on the other hand, normalizes across the features within each data point, making it independent of the batch size.



In [5]:
import torch
import torch.nn as nn

# Example input tensor with batch size 3 and feature size 4
x = torch.tensor([[1.0, 2.0, 3.0, 4.0],
                  [2.0, 4.0, 6.0, 8.0],
                  [3.0, 6.0, 9.0, 12.0]])

# Define a BatchNorm layer with feature size 4 (normalize over the feature dimension)
batch_norm = nn.BatchNorm1d(4)

# Apply BatchNorm
outputBatch = batch_norm(x)

# Print the input and output
print("Input:\n", x)
print("Output after BatchNorm:\n", outputBatch)

Input:
 tensor([[ 1.,  2.,  3.,  4.],
        [ 2.,  4.,  6.,  8.],
        [ 3.,  6.,  9., 12.]])
Output after BatchNorm:
 tensor([[-1.2247, -1.2247, -1.2247, -1.2247],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 1.2247,  1.2247,  1.2247,  1.2247]],
       grad_fn=<NativeBatchNormBackward0>)


In [10]:
print("Input:\n\n", x)
print("\nOutput after BatchNorm:\n\n", outputLayer)
print("\nOutput after BatchNorm:\n\n", outputBatch)

Input:

 tensor([[ 1.,  2.,  3.,  4.],
        [ 2.,  4.,  6.,  8.],
        [ 3.,  6.,  9., 12.]])

Output after BatchNorm:

 tensor([[-1.3416, -0.4472,  0.4472,  1.3416],
        [-1.3416, -0.4472,  0.4472,  1.3416],
        [-1.3416, -0.4472,  0.4472,  1.3416]],
       grad_fn=<NativeLayerNormBackward0>)

Output after BatchNorm:

 tensor([[-1.2247, -1.2247, -1.2247, -1.2247],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 1.2247,  1.2247,  1.2247,  1.2247]],
       grad_fn=<NativeBatchNormBackward0>)
