# 04 - Attention Mechanisms & Transformers

This notebook covers attention mechanisms and the transformer architecture that revolutionized NLP.

## Topics Covered:
- Attention concept and mechanisms
- Self-attention and cross-attention
- Transformer architecture components
- Multi-head attention
- Positional encoding
- Layer normalization and residual connections

In [3]:
pip install numpy

Collecting numpy
  Downloading numpy-2.2.6-cp310-cp310-win_amd64.whl (12.9 MB)
Installing collected packages: numpy
Successfully installed numpy-2.2.6
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\ADMIN\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [5]:
pip install matplotlib

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\ADMIN\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Collecting matplotlib
  Downloading matplotlib-3.10.8-cp310-cp310-win_amd64.whl (8.1 MB)
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting pyparsing>=3
  Downloading pyparsing-3.3.1-py3-none-any.whl (121 kB)
Collecting pillow>=8
  Downloading pillow-12.1.0-cp310-cp310-win_amd64.whl (7.0 MB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.61.1-cp310-cp310-win_amd64.whl (1.6 MB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.2-cp310-cp310-win_amd64.whl (221 kB)
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.9-cp310-cp310-win_amd64.whl (73 kB)
Installing collected packages: pyparsing, pillow, kiwisolver, fonttools, cycler, contourpy, matplotlib
Successfully installed contourpy-1.3.2 cycler-0.12.1 fonttools-4.61.1 kiwisolver-1.4.9 matplotlib-3.10.8 pillow-12.1.0 pyparsing-3.3.1


In [6]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Optional
import math

np.random.seed(42)

## 1. Attention Mechanisms

In [9]:
class AttentionMechanisms:
    """Implementation of various attention mechanisms."""
    
    @staticmethod
    def scaled_dot_product_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, 
                                   mask: Optional[np.ndarray] = None) -> Tuple[np.ndarray, np.ndarray]:
        """Scaled dot-product attention."""
        d_k = Q.shape[-1]
        
        # Compute attention scores
        scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = np.where(mask == 0, -1e9, scores)
        
        # Softmax to get attention weights
        attention_weights = AttentionMechanisms.softmax(scores)
        
        # Apply attention to values
        output = attention_weights @ V
        
        return output, attention_weights
    
    @staticmethod
    def softmax(x: np.ndarray) -> np.ndarray:
        """Numerically stable softmax."""
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    @staticmethod
    def additive_attention(query: np.ndarray, keys: np.ndarray, values: np.ndarray,
                          W_q: np.ndarray, W_k: np.ndarray, v: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """Additive (Bahdanau) attention."""
        # Transform query and keys
        query_proj = query @ W_q  # (batch, d_model) @ (d_model, d_att) -> (batch, d_att)
        keys_proj = keys @ W_k    # (batch, seq_len, d_model) @ (d_model, d_att) -> (batch, seq_len, d_att)
        
        # Add query to each key position
        combined = np.tanh(query_proj[:, None, :] + keys_proj)  # (batch, seq_len, d_att)
        
        # Compute attention scores
        scores = combined @ v  # (batch, seq_len, d_att) @ (d_att, 1) -> (batch, seq_len, 1)
        scores = scores.squeeze(-1)  # (batch, seq_len)
        
        # Softmax
        attention_weights = AttentionMechanisms.softmax(scores)
        
        # Apply attention
        output = np.sum(attention_weights[:, :, None] * values, axis=1)
        
        return output, attention_weights

# Demonstrate different attention mechanisms
def compare_attention_mechanisms():
    """Compare scaled dot-product vs additive attention."""
    
    batch_size, seq_len, d_model = 2, 5, 8
    
    # Create sample data
    Q = np.random.randn(batch_size, seq_len, d_model)
    K = np.random.randn(batch_size, seq_len, d_model)
    V = np.random.randn(batch_size, seq_len, d_model)
    
    # Scaled dot-product attention
    sdp_output, sdp_weights = AttentionMechanisms.scaled_dot_product_attention(Q, K, V)
    
    # Additive attention (simplified - using first query)
    d_att = 4
    W_q = np.random.randn(d_model, d_att)
    W_k = np.random.randn(d_model, d_att)
    v = np.random.randn(d_att, 1)
    
    add_output, add_weights = AttentionMechanisms.additive_attention(
        Q[:, 0, :], K, V, W_q, W_k, v)
    
    # Visualize attention weights
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Scaled dot-product attention weights
    im1 = axes[0].imshow(sdp_weights[0], cmap='Blues', aspect='auto')
    axes[0].set_title('Scaled Dot-Product Attention')
    axes[0].set_xlabel('Key Position')
    axes[0].set_ylabel('Query Position')
    plt.colorbar(im1, ax=axes[0])
    
    # Additive attention weights
    im2 = axes[1].imshow(add_weights[0:1], cmap='Blues', aspect='auto')
    axes[1].set_title('Additive Attention (Single Query)')
    axes[1].set_xlabel('Key Position')
    axes[1].set_ylabel('Query')
    plt.colorbar(im2, ax=axes[1])
    
    plt.tight_layout()
    plt.show()
    
    print("Attention Mechanisms Comparison:")
    print(f"Scaled Dot-Product: O(n²d) complexity, parallelizable")
    print(f"Additive: O(n²d) complexity, more parameters")
    print(f"SDP output shape: {sdp_output.shape}")
    print(f"Additive output shape: {add_output.shape}")

compare_attention_mechanisms()

ValueError: axes don't match array

## 2. Self-Attention

In [None]:
class SelfAttention:
    """Self-attention implementation."""
    
    def __init__(self, d_model: int, d_k: int, d_v: int):
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        
        # Linear projections for Q, K, V
        self.W_q = np.random.randn(d_model, d_k) / np.sqrt(d_model)
        self.W_k = np.random.randn(d_model, d_k) / np.sqrt(d_model)
        self.W_v = np.random.randn(d_model, d_v) / np.sqrt(d_model)
        
        # Output projection
        self.W_o = np.random.randn(d_v, d_model) / np.sqrt(d_v)
    
    def forward(self, x: np.ndarray, mask: Optional[np.ndarray] = None) -> Tuple[np.ndarray, np.ndarray]:
        """Forward pass of self-attention."""
        batch_size, seq_len, d_model = x.shape
        
        # Linear projections
        Q = x @ self.W_q  # (batch, seq_len, d_k)
        K = x @ self.W_k  # (batch, seq_len, d_k)
        V = x @ self.W_v  # (batch, seq_len, d_v)
        
        # Scaled dot-product attention
        attention_output, attention_weights = AttentionMechanisms.scaled_dot_product_attention(
            Q, K, V, mask)
        
        # Output projection
        output = attention_output @ self.W_o
        
        return output, attention_weights

def demonstrate_self_attention():
    """Demonstrate self-attention on a sequence."""
    
    # Create a simple sequence with patterns
    seq_len, d_model = 8, 16
    
    # Create sequence with some structure
    x = np.random.randn(1, seq_len, d_model)
    
    # Add some patterns (make positions 2,3 and 5,6 similar)
    x[0, 2] = x[0, 3] + 0.1 * np.random.randn(d_model)
    x[0, 5] = x[0, 6] + 0.1 * np.random.randn(d_model)
    
    # Initialize self-attention
    self_attn = SelfAttention(d_model=d_model, d_k=d_model//2, d_v=d_model//2)
    
    # Forward pass
    output, attention_weights = self_attn.forward(x)
    
    # Visualize attention pattern
    plt.figure(figsize=(10, 8))
    
    plt.subplot(2, 2, 1)
    plt.imshow(attention_weights[0], cmap='Blues', aspect='auto')
    plt.title('Self-Attention Weights')
    plt.xlabel('Key Position')
    plt.ylabel('Query Position')
    plt.colorbar()
    
    # Show input similarity matrix
    plt.subplot(2, 2, 2)
    similarity = x[0] @ x[0].T
    plt.imshow(similarity, cmap='Reds', aspect='auto')
    plt.title('Input Similarity Matrix')
    plt.xlabel('Position')
    plt.ylabel('Position')
    plt.colorbar()
    
    # Show attention weights for specific positions
    plt.subplot(2, 2, 3)
    positions = [0, 2, 5]
    for pos in positions:
        plt.plot(attention_weights[0, pos], label=f'Query {pos}', marker='o')
    plt.title('Attention Weights by Query Position')
    plt.xlabel('Key Position')
    plt.ylabel('Attention Weight')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Show output vs input norms
    plt.subplot(2, 2, 4)
    input_norms = np.linalg.norm(x[0], axis=1)
    output_norms = np.linalg.norm(output[0], axis=1)
    plt.plot(input_norms, label='Input', marker='o')
    plt.plot(output_norms, label='Output', marker='s')
    plt.title('Vector Norms')
    plt.xlabel('Position')
    plt.ylabel('L2 Norm')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("Self-Attention Properties:")
    print("- Each position attends to all positions in the sequence")
    print("- Attention weights show which positions are most relevant")
    print("- Similar positions tend to have higher attention weights")
    print(f"- Input shape: {x.shape}")
    print(f"- Output shape: {output.shape}")
    print(f"- Attention weights shape: {attention_weights.shape}")

demonstrate_self_attention()

## 3. Multi-Head Attention

In [None]:
class MultiHeadAttention:
    """Multi-head attention implementation."""
    
    def __init__(self, d_model: int, num_heads: int):
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections for all heads
        self.W_q = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_k = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_v = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        
        # Output projection
        self.W_o = np.random.randn(d_model, d_model) / np.sqrt(d_model)
    
    def forward(self, x: np.ndarray, mask: Optional[np.ndarray] = None) -> Tuple[np.ndarray, np.ndarray]:
        """Forward pass of multi-head attention."""
        batch_size, seq_len, d_model = x.shape
        
        # Linear projections
        Q = x @ self.W_q  # (batch, seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v
        
        # Reshape for multi-head attention
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        # Shape: (batch, num_heads, seq_len, d_k)
        
        # Apply attention to each head
        attention_outputs = []
        attention_weights_all = []
        
        for h in range(self.num_heads):
            attn_output, attn_weights = AttentionMechanisms.scaled_dot_product_attention(
                Q[:, h], K[:, h], V[:, h], mask)
            attention_outputs.append(attn_output)
            attention_weights_all.append(attn_weights)
        
        # Concatenate heads
        concat_output = np.concatenate(attention_outputs, axis=-1)
        
        # Final linear projection
        output = concat_output @ self.W_o
        
        # Stack attention weights
        attention_weights = np.stack(attention_weights_all, axis=1)
        
        return output, attention_weights

def visualize_multi_head_attention():
    """Visualize different attention heads."""
    
    # Create sample sequence
    batch_size, seq_len, d_model = 1, 10, 64
    num_heads = 8
    
    # Create structured input
    x = np.random.randn(batch_size, seq_len, d_model)
    
    # Add some patterns
    # Pattern 1: positions 1-3 are similar
    base_pattern1 = np.random.randn(d_model)
    for i in range(1, 4):
        x[0, i] = base_pattern1 + 0.1 * np.random.randn(d_model)
    
    # Pattern 2: positions 6-8 are similar
    base_pattern2 = np.random.randn(d_model)
    for i in range(6, 9):
        x[0, i] = base_pattern2 + 0.1 * np.random.randn(d_model)
    
    # Initialize multi-head attention
    mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
    
    # Forward pass
    output, attention_weights = mha.forward(x)
    
    # Visualize different heads
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    axes = axes.flatten()
    
    for head in range(num_heads):
        im = axes[head].imshow(attention_weights[0, head], cmap='Blues', aspect='auto')
        axes[head].set_title(f'Head {head + 1}')
        axes[head].set_xlabel('Key Position')
        axes[head].set_ylabel('Query Position')
        plt.colorbar(im, ax=axes[head])
    
    plt.tight_layout()
    plt.show()
    
    # Analyze head specialization
    print("Multi-Head Attention Analysis:")
    print(f"Number of heads: {num_heads}")
    print(f"Dimension per head: {d_model // num_heads}")
    
    # Compute attention entropy for each head (measure of focus)
    entropies = []
    for head in range(num_heads):
        head_weights = attention_weights[0, head]
        # Compute entropy for each query position
        head_entropy = -np.sum(head_weights * np.log(head_weights + 1e-10), axis=1)
        entropies.append(np.mean(head_entropy))
    
    plt.figure(figsize=(10, 4))
    plt.bar(range(num_heads), entropies)
    plt.title('Attention Entropy by Head (Lower = More Focused)')
    plt.xlabel('Head')
    plt.ylabel('Average Entropy')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print("\nHead Specialization:")
    for i, entropy in enumerate(entropies):
        focus_level = "High" if entropy < np.mean(entropies) else "Low"
        print(f"Head {i+1}: Entropy = {entropy:.3f} (Focus: {focus_level})")

visualize_multi_head_attention()

## 4. Positional Encoding

In [None]:
class PositionalEncoding:
    """Positional encoding implementations."""
    
    @staticmethod
    def sinusoidal_encoding(seq_len: int, d_model: int) -> np.ndarray:
        """Sinusoidal positional encoding."""
        pe = np.zeros((seq_len, d_model))
        
        position = np.arange(seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)
        
        return pe
    
    @staticmethod
    def learned_encoding(seq_len: int, d_model: int) -> np.ndarray:
        """Learned positional encoding (random initialization)."""
        return np.random.randn(seq_len, d_model) * 0.1
    
    @staticmethod
    def relative_encoding(seq_len: int, d_model: int, max_relative_position: int = 32) -> np.ndarray:
        """Relative positional encoding."""
        # Simplified relative encoding
        relative_positions = np.arange(seq_len)[:, None] - np.arange(seq_len)[None, :]
        relative_positions = np.clip(relative_positions, -max_relative_position, max_relative_position)
        
        # Convert to embeddings (simplified)
        relative_embeddings = np.random.randn(2 * max_relative_position + 1, d_model) * 0.1
        
        return relative_embeddings[relative_positions + max_relative_position]

def analyze_positional_encodings():
    """Analyze different positional encoding methods."""
    
    seq_len, d_model = 50, 64
    
    # Generate different encodings
    sin_pe = PositionalEncoding.sinusoidal_encoding(seq_len, d_model)
    learned_pe = PositionalEncoding.learned_encoding(seq_len, d_model)
    
    # Visualize encodings
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    
    # Sinusoidal encoding
    im1 = axes[0, 0].imshow(sin_pe.T, cmap='RdBu', aspect='auto')
    axes[0, 0].set_title('Sinusoidal Positional Encoding')
    axes[0, 0].set_xlabel('Position')
    axes[0, 0].set_ylabel('Dimension')
    plt.colorbar(im1, ax=axes[0, 0])
    
    # Learned encoding
    im2 = axes[0, 1].imshow(learned_pe.T, cmap='RdBu', aspect='auto')
    axes[0, 1].set_title('Learned Positional Encoding')
    axes[0, 1].set_xlabel('Position')
    axes[0, 1].set_ylabel('Dimension')
    plt.colorbar(im2, ax=axes[0, 1])
    
    # Show specific dimensions of sinusoidal encoding
    positions = np.arange(seq_len)
    for dim in [0, 1, 10, 20]:
        axes[0, 2].plot(positions, sin_pe[:, dim], label=f'Dim {dim}')
    axes[0, 2].set_title('Sinusoidal Encoding Dimensions')
    axes[0, 2].set_xlabel('Position')
    axes[0, 2].set_ylabel('Value')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Compute similarity matrices
    sin_similarity = sin_pe @ sin_pe.T
    learned_similarity = learned_pe @ learned_pe.T
    
    im3 = axes[1, 0].imshow(sin_similarity, cmap='Blues', aspect='auto')
    axes[1, 0].set_title('Sinusoidal PE Similarity')
    axes[1, 0].set_xlabel('Position')
    axes[1, 0].set_ylabel('Position')
    plt.colorbar(im3, ax=axes[1, 0])
    
    im4 = axes[1, 1].imshow(learned_similarity, cmap='Blues', aspect='auto')
    axes[1, 1].set_title('Learned PE Similarity')
    axes[1, 1].set_xlabel('Position')
    axes[1, 1].set_ylabel('Position')
    plt.colorbar(im4, ax=axes[1, 1])
    
    # Show distance decay for sinusoidal encoding
    distances = []
    similarities = []
    
    for i in range(seq_len):
        for j in range(i+1, min(i+20, seq_len)):
            distance = j - i
            similarity = np.dot(sin_pe[i], sin_pe[j]) / (np.linalg.norm(sin_pe[i]) * np.linalg.norm(sin_pe[j]))
            distances.append(distance)
            similarities.append(similarity)
    
    axes[1, 2].scatter(distances, similarities, alpha=0.6)
    axes[1, 2].set_title('Sinusoidal PE: Distance vs Similarity')
    axes[1, 2].set_xlabel('Position Distance')
    axes[1, 2].set_ylabel('Cosine Similarity')
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("Positional Encoding Properties:")
    print("\nSinusoidal Encoding:")
    print("- Fixed, deterministic patterns")
    print("- Can extrapolate to longer sequences")
    print("- Different frequencies for different dimensions")
    print("- Relative position information preserved")
    
    print("\nLearned Encoding:")
    print("- Trainable parameters")
    print("- Can adapt to specific tasks")
    print("- Limited to training sequence length")
    print("- May learn task-specific position patterns")

analyze_positional_encodings()

## 5. Complete Transformer Block

In [None]:
class LayerNorm:
    """Layer normalization."""
    
    def __init__(self, d_model: int, eps: float = 1e-6):
        self.eps = eps
        self.gamma = np.ones(d_model)
        self.beta = np.zeros(d_model)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Apply layer normalization."""
        mean = np.mean(x, axis=-1, keepdims=True)
        std = np.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

class FeedForward:
    """Position-wise feed-forward network."""
    
    def __init__(self, d_model: int, d_ff: int):
        self.W1 = np.random.randn(d_model, d_ff) / np.sqrt(d_model)
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) / np.sqrt(d_ff)
        self.b2 = np.zeros(d_model)
    
    def gelu(self, x: np.ndarray) -> np.ndarray:
        """GELU activation function."""
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Forward pass through feed-forward network."""
        hidden = self.gelu(x @ self.W1 + self.b1)
        output = hidden @ self.W2 + self.b2
        return output

class TransformerBlock:
    """Complete transformer block."""
    
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        self.d_model = d_model
        self.num_heads = num_heads
        
        # Multi-head attention
        self.mha = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        self.ffn = FeedForward(d_model, d_ff)
        
        # Layer normalization
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
        
        self.dropout = dropout
    
    def apply_dropout(self, x: np.ndarray, training: bool = True) -> np.ndarray:
        """Apply dropout (simplified)."""
        if training and self.dropout > 0:
            mask = np.random.binomial(1, 1 - self.dropout, x.shape) / (1 - self.dropout)
            return x * mask
        return x
    
    def forward(self, x: np.ndarray, mask: Optional[np.ndarray] = None, training: bool = True) -> np.ndarray:
        """Forward pass through transformer block."""
        # Multi-head attention with residual connection and layer norm
        attn_output, _ = self.mha.forward(x, mask)
        attn_output = self.apply_dropout(attn_output, training)
        x1 = self.ln1.forward(x + attn_output)  # Residual connection
        
        # Feed-forward with residual connection and layer norm
        ffn_output = self.ffn.forward(x1)
        ffn_output = self.apply_dropout(ffn_output, training)
        x2 = self.ln2.forward(x1 + ffn_output)  # Residual connection
        
        return x2

def demonstrate_transformer_block():
    """Demonstrate complete transformer block."""
    
    # Parameters
    batch_size, seq_len, d_model = 2, 16, 128
    num_heads, d_ff = 8, 512
    
    # Create input
    x = np.random.randn(batch_size, seq_len, d_model)
    
    # Add positional encoding
    pos_encoding = PositionalEncoding.sinusoidal_encoding(seq_len, d_model)
    x_with_pos = x + pos_encoding[None, :, :]
    
    # Initialize transformer block
    transformer = TransformerBlock(d_model, num_heads, d_ff)
    
    # Forward pass
    output = transformer.forward(x_with_pos, training=False)
    
    # Analyze the transformation
    plt.figure(figsize=(15, 10))
    
    # Input vs output statistics
    plt.subplot(2, 3, 1)
    plt.hist(x.flatten(), bins=50, alpha=0.7, label='Input', density=True)
    plt.hist(output.flatten(), bins=50, alpha=0.7, label='Output', density=True)
    plt.title('Value Distribution')
    plt.xlabel('Value')
    plt.ylabel('Density')
    plt.legend()
    
    # Sequence-wise norms
    plt.subplot(2, 3, 2)
    input_norms = np.linalg.norm(x[0], axis=1)
    output_norms = np.linalg.norm(output[0], axis=1)
    plt.plot(input_norms, label='Input', marker='o')
    plt.plot(output_norms, label='Output', marker='s')
    plt.title('Vector Norms by Position')
    plt.xlabel('Position')
    plt.ylabel('L2 Norm')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Attention pattern from MHA
    _, attention_weights = transformer.mha.forward(x_with_pos)
    avg_attention = np.mean(attention_weights[0], axis=0)  # Average over heads
    
    plt.subplot(2, 3, 3)
    plt.imshow(avg_attention, cmap='Blues', aspect='auto')
    plt.title('Average Attention Weights')
    plt.xlabel('Key Position')
    plt.ylabel('Query Position')
    plt.colorbar()
    
    # Feature similarity before and after
    input_sim = x[0] @ x[0].T
    output_sim = output[0] @ output[0].T
    
    plt.subplot(2, 3, 4)
    plt.imshow(input_sim, cmap='RdBu', aspect='auto')
    plt.title('Input Similarity Matrix')
    plt.xlabel('Position')
    plt.ylabel('Position')
    plt.colorbar()
    
    plt.subplot(2, 3, 5)
    plt.imshow(output_sim, cmap='RdBu', aspect='auto')
    plt.title('Output Similarity Matrix')
    plt.xlabel('Position')
    plt.ylabel('Position')
    plt.colorbar()
    
    # Layer norm effect
    plt.subplot(2, 3, 6)
    # Show mean and std before and after layer norm
    input_means = np.mean(x[0], axis=1)
    input_stds = np.std(x[0], axis=1)
    output_means = np.mean(output[0], axis=1)
    output_stds = np.std(output[0], axis=1)
    
    positions = np.arange(seq_len)
    plt.plot(positions, input_means, label='Input Mean', alpha=0.7)
    plt.plot(positions, output_means, label='Output Mean', alpha=0.7)
    plt.plot(positions, input_stds, label='Input Std', alpha=0.7, linestyle='--')
    plt.plot(positions, output_stds, label='Output Std', alpha=0.7, linestyle='--')
    plt.title('Statistics by Position')
    plt.xlabel('Position')
    plt.ylabel('Value')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("Transformer Block Analysis:")
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Parameters: ~{d_model * d_model * 4 + d_model * d_ff * 2:,} (approximate)")
    
    print("\nKey Components:")
    print("1. Multi-Head Attention: Captures relationships between positions")
    print("2. Feed-Forward Network: Processes each position independently")
    print("3. Residual Connections: Enable deep networks and gradient flow")
    print("4. Layer Normalization: Stabilizes training and normalizes features")
    
    print(f"\nOutput statistics:")
    print(f"Mean: {np.mean(output):.4f}, Std: {np.std(output):.4f}")
    print(f"Layer norm ensures each position has mean≈0, std≈1")

demonstrate_transformer_block()

## Summary

This notebook covered the transformer architecture and attention mechanisms:

### Key Innovations:

1. **Attention Mechanisms**:
   - Scaled dot-product attention: Efficient and parallelizable
   - Self-attention: Each position attends to all positions
   - Cross-attention: Attention between different sequences

2. **Multi-Head Attention**:
   - Multiple attention heads capture different types of relationships
   - Parallel computation of attention in different subspaces
   - Concatenation and projection of head outputs

3. **Positional Encoding**:
   - Sinusoidal: Fixed patterns with extrapolation capability
   - Learned: Trainable but limited to training length
   - Relative: Encodes relative position relationships

4. **Transformer Block Components**:
   - Multi-head attention for relationship modeling
   - Feed-forward networks for position-wise processing
   - Residual connections for gradient flow
   - Layer normalization for training stability

### Advantages over RNNs:
- **Parallelization**: All positions processed simultaneously
- **Long-range dependencies**: Direct connections between all positions
- **Scalability**: Efficient for large models and datasets
- **Interpretability**: Attention weights show model focus

### Impact:
The transformer architecture revolutionized NLP by:
- Enabling efficient training of very large models
- Achieving state-of-the-art results across many tasks
- Providing the foundation for modern LLMs (GPT, BERT, T5, etc.)
- Extending beyond NLP to vision, speech, and multimodal tasks

The next notebook will explore how transformers are used for language modeling and the training of large language models.