# Attention Is All You Need

At its core, the self-attention mechanism revolves around the interplay of three components: **key**, **query**, and **value**. These are vital for understanding how information is weighted and propagated in attention models, such as the Transformer.

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right) \cdot V $$

When $Q = K$, the term $QK^{T}$ captures the self-attention, indicating how similar elements within the matrix $Q$ are to one another.

## Why Use $\sqrt{d_k}$?

[proof.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/5df62dc5-72a1-4bf6-ac20-b804fe52d000/proof.pdf)

### Lemma 1

Under the assumption that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1, their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_{i}k_{i}$ has mean 0 and variance $d_{k}$.

The mean can be determined using the **linearity of expectation**:

$$ E[q \cdot k] = E\left[\sum_{i=1}^{d_k} q_i k_i\right] $$

$$ = \sum_{i=1}^{d_k} E[q_ik_i] $$

Given the assumption that random variables are i.i.d (independently identically distributed):

$$ = \sum_{i=1}^{d_k} E[q_i]E[k_i] = 0 $$

Thus, the mean of $q \cdot k$ equals 0.

For variance, although variance is not strictly linear in the way that expectation is, in this context, since the random variables are independent, the variance of their sum is the sum of their variances. Hence, using a principle similar to the **linearity of expectation**:

$$ \text{var}[q \cdot k] = \text{var}\left[\sum_{i=1}^{d_k}q_ik_i\right] $$

$$ = \sum_{i=1}^{d_k}\text{var}[q_ik_i] = d_k $$

To make the dot product have a mean of 0 and standard deviation of 1, it's divided by $\sqrt{d_k}$. However, nowadays, this normalization is often omitted since a normal distribution is not always assumed, especially when layer normalization is not used.

### Scaled Dot Product Attention

**Scaled Dot Product Attention** refers to the process of this calculation.

Given that **Query**, **Key**, and **Value** are all $3 \times 1$ matrices:

$$ 
Q = K = V = \begin{bmatrix} 
v_1 \\ 
v_2 \\ 
v_3 
\end{bmatrix} 
$$

Since $QK^{T}$ results in a $3 \times 3$ matrix:

$$ 
QK^T = \begin{bmatrix} 
v_1 \cdot v_1 & v_1 \cdot v_2 & v_1 \cdot v_3 \\ 
v_2 \cdot v_1 & v_2 \cdot v_2 & v_2 \cdot v_3 \\ 
v_3 \cdot v_1 & v_3 \cdot v_2 & v_3 \cdot v_3 
\end{bmatrix} 
$$

We then divide $QK^{T}$ by $\sqrt{d_k}$, obtaining the **attention weight**:

$$ 
\text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) = \begin{bmatrix} 
w_{11} & w_{12} & w_{13} \\ 
w_{21} & w_{22} & w_{23} \\ 
w_{31} & w_{32} & w_{33} 
\end{bmatrix} 
$$

Given the value matrix, we compute:

$$ 
\text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V = \begin{bmatrix} 
y_1 \\ 
y_2 \\ 
y_3 
\end{bmatrix} 
$$

The attention mechanism gauges the similarity between a *query* (the word we're focusing on) and a *key* (the word we're comparing against). The resulting similarity scores are then used to weigh the importance of words in the **Value** matrix.

### Scaled Dot Product Attention

In [2]:
# !pip3 install -U torch

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self, temperature, attn_dropout=0.1):
        super(ScaledDotProductAttention, self).__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, q, k, v, mask=None):
        # matrix multiplication
        attn = torch.matmul(q / self.temperature, k.transpose(2, 3))

        # mask fill
        if mask is not None:
            attn = attn.masked_fill(mask == 0, -1e9)

        # softmax function (+ dropout optional)
        attn = self.dropout(F.softmax(attn, dim=-1))

        # matrix multiplication between KQ^T * V
        output = torch.matmul(attn, v)

        # output and attn weight returned
        return output, attn