# Transformer Components for Neural Networks

This notebook contains PyTorch examples demonstrating transformer components.

## Table of Contents
1. [Layer Normalization](#layer-normalization)
2. [Residual Connections](#residual-connections)

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Layer Normalization

**Formula:** $\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \odot \gamma + \beta$

Normalizes activations within each sample.

In [None]:
# Manual layer normalization
def manual_layer_norm(x, eps=1e-5):
    mean = x.mean(dim=-1, keepdim=True)
    std = x.std(dim=-1, keepdim=True)
    return (x - mean) / (std + eps)

# Compare with PyTorch implementation
x = torch.randn(2, 5, 10)  # (batch, seq_len, d_model)
manual_norm = manual_layer_norm(x)
pytorch_norm = torch.nn.functional.layer_norm(x, x.shape[-1:])

print(f"Manual norm mean: {manual_norm.mean(dim=-1)}")  # Should be ~0
print(f"Manual norm std: {manual_norm.std(dim=-1)}")    # Should be ~1
print(f"Difference: {torch.norm(manual_norm - pytorch_norm):.6f}")

## Residual Connections

**Formula:** $\mathbf{h}_{l+1} = \mathbf{h}_l + F(\mathbf{h}_l)$

Creates gradient highways for deep networks.

In [None]:
class ResidualBlock(torch.nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.linear1 = torch.nn.Linear(d_model, d_model)
        self.linear2 = torch.nn.Linear(d_model, d_model)
        self.activation = torch.nn.ReLU()
    
    def forward(self, x):
        residual = x
        out = self.activation(self.linear1(x))
        out = self.linear2(out)
        return out + residual  # Residual connection

# Demonstrate gradient flow
deep_net = torch.nn.Sequential(*[ResidualBlock(64) for _ in range(20)])
x = torch.randn(1, 64, requires_grad=True)
output = deep_net(x)
loss = output.sum()
loss.backward()

print(f"Input gradient norm: {torch.norm(x.grad):.3f}")
print("Residual connections enable training of very deep networks")