# Lecture 4 Self Attention - Multi Head Attention
## Example 2

#### Multi-Head Attention

Multi-head attention is a fundamental component of transformer architectures, allowing models to focus on different parts of the input simultaneously. It enhances the model's ability to capture various relationships and dependencies within the data.

Below are two Python examples demonstrating how multi-head attention works:
- From Scratch Implementation Using NumPy
- Using PyTorch's Built-in Modules

#### 2. Multi-Head Attention Using PyTorch

Leveraging PyTorch's built-in nn.MultiheadAttention module simplifies the implementation, handling much of the complexity internally. This example demonstrates how to use this module within a simple transformer-like setup.

**Step-by-Step Explanation**
- Import Libraries and Set Parameters:
    - Define dimensions, number of heads, and other necessary parameters.
- Initialize the MultiheadAttention Module:
    - Create an instance of nn.MultiheadAttention with the specified embedding dimension and number of heads.
- Prepare Input Data:
    - Torch expects input shapes as (seq_length, batch_size, embedding_dim).
    - Initialize random input tensors for queries, keys, and values.
- Perform Multi-Head Attention:
    - Pass the queries, keys, and values to the forward method of the MultiheadAttention module.
    - Obtain the output and attention weights.
- Output the Results:
    - Display the attention output and the attention weights.

In [1]:
import torch
import torch.nn as nn

# Parameters
torch.manual_seed(0)
batch_size = 1
seq_length = 4
embedding_dim = 8
num_heads = 2

# Initialize MultiheadAttention
multihead_attn = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=num_heads, batch_first=True)

# Random input
X = torch.rand(batch_size, seq_length, embedding_dim)

# In PyTorch's MultiheadAttention, inputs are of shape (batch_size, seq_length, embedding_dim) if batch_first=True
# Otherwise, shape should be (seq_length, batch_size, embedding_dim)

# Perform Multi-Head Attention
# Here, we use the same tensor for queries, keys, and values (self-attention)
attn_output, attn_weights = multihead_attn(X, X, X)

print("Input X:\n", X)
print("\nAttention Output:\n", attn_output)
print("\nAttention Weights:\n", attn_weights)

Input X:
 tensor([[[0.6814, 0.3330, 0.3603, 0.6477, 0.9110, 0.6359, 0.2634, 0.2650],
         [0.0273, 0.6080, 0.2194, 0.0542, 0.9384, 0.1753, 0.4431, 0.6432],
         [0.5159, 0.1636, 0.0958, 0.8985, 0.5814, 0.9148, 0.3324, 0.6473],
         [0.3857, 0.4778, 0.1955, 0.6691, 0.6581, 0.4897, 0.3875, 0.1918]]])

Attention Output:
 tensor([[[ 0.0371, -0.0003,  0.0308, -0.0156,  0.0638, -0.0011, -0.0370,
           0.0329],
         [ 0.0410, -0.0081,  0.0399, -0.0159,  0.0690, -0.0020, -0.0392,
           0.0409],
         [ 0.0385, -0.0004,  0.0322, -0.0143,  0.0650,  0.0003, -0.0388,
           0.0346],
         [ 0.0398, -0.0007,  0.0333, -0.0137,  0.0671,  0.0011, -0.0399,
           0.0359]]], grad_fn=<TransposeBackward0>)

Attention Weights:
 tensor([[[0.2563, 0.2470, 0.2437, 0.2530],
         [0.2554, 0.2512, 0.2436, 0.2498],
         [0.2536, 0.2448, 0.2509, 0.2507],
         [0.2554, 0.2424, 0.2496, 0.2525]]], grad_fn=<MeanBackward1>)


**Output Explanation**

- Input X: Randomly generated input tensor with shape (batch_size, seq_length, embedding_dim).
- Attention Output: The result of the multi-head attention operation.
- Attention Weights: The attention scores for each head, showing how much each position attends to others.

Note: The actual numerical values may vary due to random initialization.