# The Attention Mechanism

Welcome to the second notebook in the “Transformers Explained” series! In this notebook, we dive into the core of Transformer architecture: the attention mechanism, especially self-attention.

## What is Attention?

In neural networks, attention is a mechanism that allows the model to weigh the importance of different parts of the input (or output) sequence when making a prediction. This enables the model to focus on the most relevant parts instead of treating everything equally.

![Attention Mechanism](https://upload.wikimedia.org/wikipedia/commons/thumb/4/44/Attention-mechanism-01.svg/1280px-Attention-mechanism-01.svg.png)

## Self-Attention

Self-attention is a type of attention where the model learns the relationships between different positions in the *same* sequence. This means each word can "look at" other words in the sentence to better understand its context.

### How does it work?

Self-attention computes three vectors for each input word:

1. **Query (Q)** – What the word is looking for.
2. **Key (K)** – What the word offers.
3. **Value (V)** – The actual content.

The steps of self-attention are:

1. **Compute similarity scores** between Query and all Keys (dot product).
2. **Scale** the scores by \( \sqrt{d_k} \).
3. **Apply softmax** to get attention weights.
4. **Weight the Values** using the attention weights to get the output.

### Formula

$$ Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where:
- \( Q \): Query matrix
- \( K \): Key matrix
- \( V \): Value matrix
- \( d_k \): Key dimension

![Self Attention](https://upload.wikimedia.org/wikipedia/commons/9/96/TransformerSelfAttention.png)

## Multi-Head Self-Attention

Multi-head self-attention performs multiple attention operations in parallel (called “heads”). Their results are concatenated and linearly transformed. This allows the model to capture different types of relationships simultaneously.

![Multi-Head Attention](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Transformer_Multi-Head_Attention.svg/1920px-Transformer_Multi-Head_Attention.svg.png)


In [None]:
import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert self.head_dim * heads == embed_size, "Embed size must be divisible by number of heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask=None):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / math.sqrt(self.head_dim), dim=-1)
        out = torch.einsum("nhqk,nkhd->nqhd", [attention, values])
        out = out.reshape(N, query_len, self.heads * self.head_dim)
        return self.fc_out(out)

# Parameters
embed_size = 256
heads = 8
sequence_length = 10
batch_size = 2

attention_layer = SelfAttention(embed_size, heads)
input_data = torch.rand(batch_size, sequence_length, embed_size)
output = attention_layer(input_data, input_data, input_data)

print(f"Input shape: {input_data.shape}")
print(f"Output shape: {output.shape}")

## Conclusion

The attention mechanism is the foundation of Transformers. It enables non-sequential processing and captures complex dependencies. Multi-head self-attention further enhances this by allowing the model to focus on multiple relationships in parallel.

In the next notebook, we’ll use these concepts to build a Transformer from scratch.