# Lecture 4 Self Attention - Multi Head Attention
## Example 1

#### Multi-Head Attention

Multi-head attention is a fundamental component of transformer architectures, allowing models to focus on different parts of the input simultaneously. It enhances the model's ability to capture various relationships and dependencies within the data.

Below are two Python examples demonstrating how multi-head attention works:
- From Scratch Implementation Using NumPy
- Using PyTorch's Built-in Modules

#### 1. Multi-Head Attention from Scratch Using NumPy

This example illustrates the core mechanics of multi-head attention without relying on deep learning libraries. It includes the creation of query, key, and value matrices, splitting them into multiple heads, performing scaled dot-product attention, and concatenating the results.

Step-by-Step Explanation
- Initialize Parameters: 
    - Define dimensions for embedding, queries, keys, values, and the number of heads.
    - Initialize random weights for linear transformations.

- Create Query, Key, and Value Matrices:
    - Input data is transformed into queries (Q), keys (K), and values (V) using the initialized weights.

- Split into Multiple Heads:
    - Q, K, and V are split into multiple heads to allow the model to attend to information from different representation subspaces.

- Scaled Dot-Product Attention:
    - For each head, compute attention scores using the dot product of Q and K, scale them, apply softmax, and multiply by V.

- Concatenate Heads and Final Linear Transformation:
    - Concatenate the outputs from all heads and apply a final linear transformation to produce the final output.

In [2]:
import numpy as np

def softmax(x, axis=-1):
    """Compute softmax values for each set of scores in x."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

# Parameters
np.random.seed(0)
batch_size = 1
seq_length = 4
embedding_dim = 8
num_heads = 2
head_dim = embedding_dim // num_heads

# Random input
X = np.random.rand(batch_size, seq_length, embedding_dim)

# Initialize weights
W_q = np.random.rand(embedding_dim, embedding_dim)
W_k = np.random.rand(embedding_dim, embedding_dim)
W_v = np.random.rand(embedding_dim, embedding_dim)
W_o = np.random.rand(embedding_dim, embedding_dim)

# Linear projections
Q = X @ W_q  # Shape: (batch_size, seq_length, embedding_dim)
K = X @ W_k
V = X @ W_v

# Split into heads
def split_heads(x, num_heads, head_dim):
    batch_size, seq_length, embed_dim = x.shape
    x = x.reshape(batch_size, seq_length, num_heads, head_dim)
    return x.transpose(0, 2, 1, 3)  # (batch_size, num_heads, seq_length, head_dim)

Q = split_heads(Q, num_heads, head_dim)
K = split_heads(K, num_heads, head_dim)
V = split_heads(V, num_heads, head_dim)

# Scaled Dot-Product Attention
scores = Q @ K.transpose(0, 1, 3, 2) / np.sqrt(head_dim)  # (batch_size, num_heads, seq_length, seq_length)
attention = softmax(scores, axis=-1)
context = attention @ V  # (batch_size, num_heads, seq_length, head_dim)

# Concatenate heads
context = context.transpose(0, 2, 1, 3).reshape(batch_size, seq_length, embedding_dim)

# Final linear layer
output = context @ W_o  # (batch_size, seq_length, embedding_dim)

print("Input X:\n", X)
print("\nOutput after Multi-Head Attention:\n", output)

Input X:
 [[[0.5488135  0.71518937 0.60276338 0.54488318 0.4236548  0.64589411
   0.43758721 0.891773  ]
  [0.96366276 0.38344152 0.79172504 0.52889492 0.56804456 0.92559664
   0.07103606 0.0871293 ]
  [0.0202184  0.83261985 0.77815675 0.87001215 0.97861834 0.79915856
   0.46147936 0.78052918]
  [0.11827443 0.63992102 0.14335329 0.94466892 0.52184832 0.41466194
   0.26455561 0.77423369]]]

Output after Multi-Head Attention:
 [[[ 9.19301463 10.44328382  9.2244454   8.05737673 10.98670376
    9.43520132 10.65160547  9.78990228]
  [ 9.10985062 10.36255368  9.14890231  7.99563435 10.88414448
    9.35370385 10.56035521  9.70807359]
  [ 9.21809014 10.45357218  9.24102286  8.0718153  11.01227309
    9.45719822 10.66877882  9.80798268]
  [ 9.05051238 10.29643008  9.09708768  7.94442927 10.80930673
    9.29190295 10.48942527  9.64204537]]]


**Output Explanation**
- Input X: Randomly generated input tensor with shape (batch_size, seq_length, embedding_dim).
- Output: The result of the multi-head attention mechanism, maintaining the same shape as the input.

Note: The actual numerical values may vary due to random initialization.