# Multihead Self-Attention

*VietAI Advanced NLP*

In this exercise, we will build Multihead Self-Attention using Pytorch, fully parallelized with multiple queries, multiple heads and multiple sequences in a batch (using broadcasting). This component will then be used to build the Transformers architecture.

Notes:
- Lowercase characters (e.g., `q`, `k`, `v`) represent 1 vector / 1-dimensional tensor.
- Uppercase characters (e.g. `Q`, `K`, `V`) represent 2 or more dimensional tensor (The number of dimensions will be in the comments of each function).
- You need to complete the sections in the mark as follows:

```python
########### YOUR CODE HERE #################
###########################################
```

We start by installing the necessary libraries & some auxiliary functions:

In [None]:
!pip install einops
!wget -c https://gist.githubusercontent.com/Luvata/55f7b3e9ae451122b9e3faf0a7387b4f/raw/440fac5c6e7153fd39e4eb9ebec6e51c9520ef1f/visualize.py
!pip install --upgrade graphviz

Collecting einops
  Downloading https://files.pythonhosted.org/packages/5d/a0/9935e030634bf60ecd572c775f64ace82ceddf2f504a5fd3902438f07090/einops-0.3.0-py2.py3-none-any.whl
Installing collected packages: einops
Successfully installed einops-0.3.0
--2021-02-24 15:22:25--  https://gist.githubusercontent.com/Luvata/55f7b3e9ae451122b9e3faf0a7387b4f/raw/440fac5c6e7153fd39e4eb9ebec6e51c9520ef1f/visualize.py
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9008 (8.8K) [text/plain]
Saving to: ‘visualize.py’


2021-02-24 15:22:26 (39.9 MB/s) - ‘visualize.py’ saved [9008/9008]

Collecting graphviz
  Downloading https://files.pythonhosted.org/packages/86/86/89ba50ba65928001d3161f23bfa03945ed18ea13a1d1d44a772ff1fa4e7a/graphviz-0.16-py2.py3-none-any.whl
Installing col

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib inline
from visualize import display_module

In [None]:
def benchmark(input_generator, slow_function, fast_function, n_tries=100):
    """
    This function will verify that the result of `slow_function` and
    `fast_function` are equals for all `n_tries` inputs from `input_generator`
    It also prints out the average run time of each function
    """
    def _one_try(input, function):
        start_time = time.time()
        output = function(*input)
        return time.time() - start_time, output
    
    def _stat_str(list_times):
        return f"AVG: {np.mean(list_times)}+-{np.std(list_times)}"
    
    slow_times = []
    fast_times = []
    for i in range(n_tries):
        input = next(input_generator)
        slow_time, slow_result = _one_try(input, slow_function)
        fast_time, fast_result = _one_try(input, fast_function)
        assert torch.allclose(slow_result, fast_result)
        slow_times.append(slow_time)
        fast_times.append(fast_time)
        
    print("Your output is correct")
    print("Timing of slow function: ", _stat_str(slow_times))  
    print("Timing of fast function: ", _stat_str(fast_times))
    print(f"Speedup: {np.mean(slow_times) / np.mean(fast_times)} times")

$d_k$, $d_v$, $n_{head}$, .. are the hyperparameters in Transformers. N_queries, M_keys are configurations to test attention. `batch_size` is configuration to test broadcasting with input of multiple samples in a forward.

In [None]:
d_k = 64
d_v = 64
N_queries = 32
M_keys = 32
n_head = 8
batch_size = 32

The dot-product attention scale is calculated as follows:
$$Attention(q, K, V) = Softmax(\frac{K^Tq}{\sqrt{d_k}}) = \sum_{i}{}\frac{e^{score(q,k_i)}}{ \sum_j e^{score(q, k_j)} }v_i$$

with $$score(q, k) = \frac{q \cdot k}{\sqrt{d_k}}$$

If $score(q, k_i) = -\infty \rightarrow e^{score(q, k_i)} = e^{-\infty} = 0 $ then the weight of $v_i$ will be zero so $v_i$ will not contribute information to the attention result, in other words, $v_i$ has been `mask` when calculating attention.

```
sequence              <BOS>     I    go    to    school  <EOS> <PAD> <PAD>
                        |       |     |    |       |       |     |     |
                        v       v     v    v       v       v     v     v
PAD mask                0       0     0    0       0       0     1     1 
```

In the LSTM seq2seq + Attention architecture (Bahdanau et al., Luong et al.,), the hidden state's Attention at 1 decode step calculated on the embedding of $M$ tokens encoder, will be used `mask` with `PAD` tokens in the input sentence, because the token `PAD` does not carry the information of the sentence but only to normalize the equal length of strings.

In Attention, the use of `mask` is necessary when we need to remove the contribution of (k,v) that carries no information (e.g., `PAD` token) or has no link to the current query (e.g. `causal- mask` in the Transformer decoder).

In Pytorch, the `mask` on a tensor is done via the `masked_fill` function as follows:

In [None]:
print("Original:")
similar_score = torch.tensor([[0.5, 2.3, 1.4, -1.3, 3.1]])
print("similar score  :", similar_score)
print("softmax:", F.softmax(similar_score, dim=-1))

n_step = similar_score.shape[1]
mask = torch.ones_like(similar_score).bool() # [[True, True, True, True, True]]

for step_idx in range(n_step):
    mask[0, step_idx] = False
    masked_score = similar_score.masked_fill(mask, value=-1e9)
    print(f"step #{step_idx}:", "mask:", mask, "softmax out: ", F.softmax(masked_score, dim=-1))

Original:
similar score  : tensor([[ 0.5000,  2.3000,  1.4000, -1.3000,  3.1000]])
softmax: tensor([[0.0432, 0.2615, 0.1063, 0.0071, 0.5819]])
step #0: mask: tensor([[False,  True,  True,  True,  True]]) softmax out:  tensor([[1., 0., 0., 0., 0.]])
step #1: mask: tensor([[False, False,  True,  True,  True]]) softmax out:  tensor([[0.1419, 0.8581, 0.0000, 0.0000, 0.0000]])
step #2: mask: tensor([[False, False, False,  True,  True]]) softmax out:  tensor([[0.1052, 0.6362, 0.2587, 0.0000, 0.0000]])
step #3: mask: tensor([[False, False, False, False,  True]]) softmax out:  tensor([[0.1034, 0.6253, 0.2542, 0.0171, 0.0000]])
step #4: mask: tensor([[False, False, False, False, False]]) softmax out:  tensor([[0.0432, 0.2615, 0.1063, 0.0071, 0.5819]])


At step #4, the Softmax output is equal to the Softmax result without the mask. For the rest of the steps, when the keys at position $i$ are masked (`mask[0][i] == True`), the Softmax result is equivalent to that position equal to `0`. A Scale-dot Attention function using a mask in pytorch can be defined as follows:

![scale-dot-product attention](https://raw.githubusercontent.com/Luvata/gifs/main/figures/scale_dot_product.png)

In [None]:
def scale_dot_product_attention(q, K, V, mask):
    """Scale-dot attentionn on a single query
    Arguments:
        q: torch.Tensor shape (1, d_k)
        K: torch.Tensor shape (M, d_k)
        V: torch.Tensor shape (M, d_v)
        mask: torch.BoolTensor shape (1, M)
        
        if mask[0, i] == True, (k, q) at index `i` will be masked
        when calculating attention
    Return:
        scaled-dot attention: torch.Tensor shape (1, d_v)
    """
    _, d_k = q.shape
    scale = d_k ** -0.5
    similar_score = q @ K.T * scale # (1, d_k) @ (d_k, M) -> (1, M)
    similar_score = similar_score.masked_fill(mask, value=float("-inf"))
    attention_weight = F.softmax(similar_score, dim=-1) # (1, M)
    attention = attention_weight @ V # (1, M) @ (M, d_v) -> (1, d_v)
    return attention



- `@` is `dot` product in Pytorch
- Line 18: `dim=-1` is the softmax on the last axis (`M`)

In [None]:
q = torch.rand(1, d_k)
K = torch.rand(M_keys, d_k)
V = torch.rand(M_keys, d_v)
mask = torch.randint(low=0, high=2, size=(1, M_keys)).bool()

print(mask)
print(scale_dot_product_attention(q, K, V, mask).shape)

tensor([[False, False, False,  True, False,  True, False,  True,  True,  True,
          True,  True, False, False,  True, False,  True,  True, False,  True,
          True,  True, False, False, False, False, False, False,  True,  True,
          True,  True]])
torch.Size([1, 64])


## 1. Broadcasting with multiple queries

*Let's get started!*

The above `scale_dot_product_attention` function only works with a query. In this section you need to complete the `queries_attention` function in parallel (using broadcasting) with `N` queries. The output should be the same as `slow_queries_attention`. Note in this section, $mask \in \mathbb{R}^{N \times M}$ given `mask[i]` being the vector mask of `Q[i]` to `M` keys.

In [None]:
def slow_queries_attention(Q, K, V, mask):
    """Attentionn on many queries
    Arguments:
        Q: torch.Tensor shape (N, d_k)
        K: torch.Tensor shape (M, d_k)
        V: torch.Tensor shape (M, d_v)
        mask: torch.BoolTensor shape (N, M)
        mask[i, j] = True means K[j] was masked for Q[i]

    Return:
        scaled-dot attention: torch.Tensor shape (N, d_v)
    """
    attentions = []
    for query, single_mask in zip(Q, mask):
        query_vector = query.unsqueeze(0)  # (d_k) -> (1, d_k)
        mask_vector = single_mask.unsqueeze(0) # (M) -> (1, M)
        attentions.append(scale_dot_product_attention(query_vector, K, V, mask_vector))
    attentions = torch.stack(attentions)  # (N, 1, d_v)
    attentions = attentions.squeeze(1)  # (N, d_v)
    return attentions

In [None]:
def queries_attention(Q, K, V, mask):
    """Attentionn on many queries
    Arguments:
        Q: torch.Tensor shape (N, d_k)
        K: torch.Tensor shape (M, d_k)
        V: torch.Tensor shape (M, d_v)
        mask: torch.BoolTensor shape (N, M)
        mask[i, j] = True means K[j] was masked for Q[i]

    Return:
        scaled-dot attention: torch.Tensor shape (N, d_v)
    """
    N, d_k = Q.shape
    ########### YOUR CODE HERE #################
    
    ###########################################
    return attentions

In [None]:
## Test
Q = torch.rand(N_queries, d_k)
K = torch.rand(M_keys, d_k)
V = torch.rand(M_keys, d_v)
mask = torch.randint(low=0, high=2, size=(N_queries, M_keys)).bool()

slow_attn =  slow_queries_attention(Q, K, V, mask)
parl_attn =  queries_attention(Q, K, V, mask)

assert parl_attn.shape == slow_attn.shape
assert torch.allclose(parl_attn, slow_attn)

def a1_generator():
    while True:
        Q = torch.rand(N_queries, d_k)
        K = torch.rand(M_keys, d_k)
        V = torch.rand(M_keys, d_v)
        mask = torch.randint(low=0, high=2, size=(N_queries, M_keys))
        yield (Q, K, V, mask)
        
generator = a1_generator()
benchmark(generator, slow_queries_attention, queries_attention)

Your outputs is correct
Timing of slow function:  AVG: 0.001767120361328125+-0.0003138023833246862
Timing of fast function:  AVG: 7.760763168334961e-05+-1.664969291348885e-05
Speedup: 22.769930263279164 times


## 2. Broadcasting with multi-heads (Multi-head attention)
![heads](https://raw.githubusercontent.com/Luvata/gifs/main/figures/transformer_heads.png)

Similar to the previous exercise, Multi-head Attention will still calculate the Scale-dot Attention of `N` queries and `M` keys but in parallel on multiple `head`.

You need to complete the `heads_attention` function using broadcasting and the result should be the same as the `slow_heads_attention` function.

Suggestions:
   - You can use `transpose` to "shift the axis" of a tensor.
   - `mask` will be applied the same for all heads.

In [None]:
def slow_heads_attention(Q, K, V, mask):
    """Slow Attentionn on many queries and many heads
    Arguments:
        Q: torch.Tensor shape (N, n_head, d_k)
        K: torch.Tensor shape (M, n_head, d_k)
        V: torch.Tensor shape (M, n_head, d_k)
        mask: torch.BoolTensor shape (N, M)
        where mask[i, j] = 1 means K[j] was masked for Q[i]

    Return:
        scaled-dot attention: torch.Tensor shape (N, n_head, d_k)
    """
    N, n_head, d_k = Q.shape
    attentions = []

    for i in range(n_head):
        queries = Q[:, i, :]  # (N, d_k)
        keys = K[:, i, :]  # (M, d_k)
        values = V[:, i, :]  # (M, d_v)
        attentions.append(slow_queries_attention(queries, keys, values, mask)) # Apply the same mask for all heads

    attentions = torch.stack(attentions)  # (n_head, N, d_v)
    attentions = torch.transpose(attentions, 0, 1)  # (N, n_head, d_v)

    return attentions

In [None]:
def heads_attention(Q, K, V, mask):
    """Attentionn on many queries and many heads
    Arguments:
        Q: torch.Tensor shape (N, n_head, d_k)
        K: torch.Tensor shape (M, n_head, d_k)
        V: torch.Tensor shape (M, n_head, d_v)
        mask: torch.Tensor shape (N, M)
        where mask[i, j] = True means K[j] was masked for Q[i]
        
    Return:
        scaled-dot attention: torch.Tensor shape (N, n_head, d_v)
    """
    N, n_head, d_k = Q.shape 
    M, n_head, d_k = K.shape 
    ########### YOUR CODE HERE #################
    
    ###########################################
    return attentions

In [None]:
Q = torch.rand(N_queries, n_head, d_k)
K = torch.rand(M_keys, n_head, d_k)
V = torch.rand(M_keys, n_head, d_v)
mask = torch.randint(low=0, high=2, size=(N_queries, M_keys)).bool()

slow_attn = slow_heads_attention(Q, K, V, mask)
parl_attn = heads_attention(Q, K, V, mask)

assert parl_attn.shape == slow_attn.shape
assert torch.allclose(parl_attn, slow_attn)

def a2_generator():
    while True:
        Q = torch.rand(N_queries, n_head, d_k)
        K = torch.rand(M_keys, n_head, d_k)
        V = torch.rand(M_keys, n_head, d_v)
        mask = torch.randint(low=0, high=2, size=(N_queries, M_keys))
        
        yield (Q, K, V, mask)
        
generator = a2_generator()
benchmark(generator, slow_heads_attention, heads_attention)

Your outputs is correct
Timing of slow function:  AVG: 0.013750030994415283+-0.0009599313956945842
Timing of fast function:  AVG: 0.0003452897071838379+-5.010303190142049e-05
Speedup: 39.82172276885897 times


## 3. Batch broadcasting

In the previous exercise, you completed Multi-head Attention with a sequence! To make the most of parallel computing, in this exercise, you will build Multi-head Attention using broadcasting with multiple pairs of (Q, K, V) inputs in a batch.

You need to complete the `multi_head_attention` function using broadcasting, the output should be the same as the output from `slow_multi_head_attention`.

In [None]:
def slow_multi_head_attention(Q, K, V, mask):
    """Multi-head attention on a batch of Q, K, V
    Arguments:
        Q: torch.Tensor shape (B, N, n_head, d_k)
        K: torch.Tensor shape (B, M, n_head, d_k)
        V: torch.Tensor shape (B, M, n_head, d_v)
        mask: torch.BoolTensor shape (B, N, M)
        where mask[i] is `mask` for attention of record i: (Q[i], K[i], V[i])

    Return:
        scaled-dot attention: torch.Tensor shape (B, N, n_head, d_v)
    """
    B, N, n_head, d_k = Q.shape

    attentions = []
    for single_Q, single_K, single_V, single_mask in zip(Q, K, V, mask):
        # single_Q, single_K: (N, n_head, d_k)
        # single_V: (N, n_head, d_v)
        # single_mask: (N, M)
        attention = slow_heads_attention(single_Q, single_K, single_V, single_mask)
        attentions.append(attention)

    attentions = torch.stack(attentions)  # (B, N, n_head, d_v)
    return attentions

In [None]:
def multi_head_attention(Q, K, V, mask):
    """Multi-head attention on a batch of Q, K, V
    Arguments:
        Q: torch.Tensor shape (B, N, n_head, d_k)
        K: torch.Tensor shape (B, M, n_head, d_k)
        V: torch.Tensor shape (B, M, n_head, d_v)
        mask: torch.BoolTensor shape (B, N, M)
        where mask[i] is `mask` for attention of record i: (Q[i], K[i], V[i])

    Return:
        scaled-dot attention: torch.Tensor shape (B, N, n_head d_v)
    """
    B, N, n_head, d_k = Q.shape
    ########### YOUR CODE HERE #################
    
    ###########################################
    return attentions

In [None]:
Q = torch.rand(batch_size, N_queries, n_head, d_k)
K = torch.rand(batch_size, M_keys, n_head, d_k)
V = torch.rand(batch_size, M_keys, n_head, d_v)
mask = torch.randint(low=0, high=2, size=(batch_size, N_queries, M_keys)).bool()

slow_attn = slow_multi_head_attention(Q, K, V, mask)
parl_attn = multi_head_attention(Q, K, V, mask)

assert parl_attn.shape == slow_attn.shape
assert torch.allclose(slow_attn, parl_attn)

def a3_generator():
    while True:
        Q = torch.rand(batch_size, N_queries, n_head, d_k)
        K = torch.rand(batch_size, M_keys, n_head, d_k)
        V = torch.rand(batch_size, M_keys, n_head, d_v)
        mask = torch.randint(low=0, high=2, size=(batch_size, N_queries, M_keys))
        yield (Q, K, V, mask)

generator = a3_generator() # it's gonna take a while ...
benchmark(generator, slow_multi_head_attention, multi_head_attention)

Your outputs is correct
Timing of slow function:  AVG: 0.43443290710449217+-0.022188473290526433
Timing of fast function:  AVG: 0.008474128246307373+-0.0004484408840083568
Speedup: 51.26579330372981 times


## 4. `torch.einsum` and `einops.rearrange`

The calculations (e.g., `transpose`, `matmul`, `stack`, `view` ...) on tensors are often not explicitly written (imagine doing the exercises above without comments). In this exercise, you will familiarize yourself with einops's `rearrange` and Pytorch's `einsum` both using Einstein summation(https://en.wikipedia.org/wiki/Einstein_notation): calculations on tensor will be expressed represented by a string (names of axes) of input and output tensor(s). This exercise, besides introducing `torch.einsum`, also introduces the most popular function in `einops`, `einops.rearrange`.

Read more:
- Highly recommend: [Mat Kelcey : An illustrative einsum example](https://www.youtube.com/watch?v=SOaYrnQtd9g)
- [Writing a better code with pytorch and einops](http://einops.rocks/pytorch-examples.html)

`einops.rearrange` and `torch.einsum` have some differences: `einops` supports axes names that are **string** from multiple contiguous characters (maybe `d_k`, `n_head`) while in `torch.einsum` are **1 lower case letters** (`h` is equivalent to `n_head`) corresponding to 1 axis of the tensor.

`heads_attention` in Section 2 can be done with `einsum` & `rearrange` as follows:

In [None]:
def heads_attention_with_einops1(Q, K, V, mask):
    """Attentionn on a many queries and many heads
    More explicit, since we also introduce rearrange
    
    Arguments:
        Q: torch.Tensor shape (N, n_head, d_k)
        K: torch.Tensor shape (M, n_head, d_k)
        V: torch.Tensor shape (M, n_head, d_v)
        mask: torch.Tensor shape (N, M)
        where mask[i, j] = True means K[j] was masked for Q[i]
        
    Return:
        scaled-dot attention: torch.Tensor shape (N, n_head, d_v)
    """
    N, n_head, d_k = Q.shape
    
    # Similar with reshape/view, but more expressive
    Q = rearrange(Q, "N n_head d_k -> n_head N d_k")
    K = rearrange(K, "M n_head d_k -> n_head M d_k")
    V = rearrange(V, "M n_head d_k -> n_head M d_k")
    
    similar_score = torch.einsum('hnd,hmd->hnm', Q, K) / (d_k ** 0.5) # Keep dimension h, reduce on `d`
    similar_score = similar_score.masked_fill(mask, value=float("-inf"))
    
    attention_weight = F.softmax(similar_score, dim=-1)
    attentions = torch.einsum('hnm,hmd->hnd', attention_weight, V)

    # We do `transpose` without comment the shape
    attentions = rearrange(attentions, 'n_head N d_v -> N n_head d_v')
    return attentions

generator = a2_generator()
benchmark(generator, slow_heads_attention, heads_attention_with_einops1)

Your outputs is correct
Timing of slow function:  AVG: 0.0136411452293396+-0.0009147664697613982
Timing of fast function:  AVG: 0.0005348515510559082+-0.0003580011645660165
Speedup: 25.504544583275756 times


In the example above:
- `Q = rearrange(Q, "N n_head d_k -> n_head N d_k")` is equivalent to `Q.transpose(0, 1)`.
- String `hnd,hmd->hnm` describes calculation with 2 tensors (in our case, they are `Q` và `K`).
- `Q` has the shape `(n_head N d_k)`, abbreviated as `hnd`
- `K` has the shape `(n_head M d_k)`, abbreviated as `hmd`
- `similar_score` is calculated by doing dot-product on the `d_k` axis, resulting in a tensor of shape `(n_head N M)` abbreviated as `hnm` and located to the right of the `->`

You can also use `einsum` completely as follows (if you have already mastered `einsum`)

In [None]:
def heads_attention_with_einops2(Q, K, V, mask):
    """Attentionn on a many queries and many heads
    Arguments:
        Q: torch.Tensor shape (N, n_head, d_k)
        K: torch.Tensor shape (M, n_head, d_k)
        V: torch.Tensor shape (M, n_head, d_v)
        mask: torch.Tensor shape (N, M)
        where mask[i, j] = True means K[j] was masked for Q[i]
        
    Return:
        scaled-dot attention: torch.Tensor shape (N, n_head, d_v)
    """
    N, n_head, d_k = Q.shape
    similar_score = torch.einsum('nhd,mhd->hnm', Q, K) / (d_k ** 0.5)
    similar_score = similar_score.masked_fill(mask, value=float("-inf"))
    attention_weight = F.softmax(similar_score, dim=-1)
    attentions = torch.einsum('hnm, mhd->nhd', attention_weight, V)
    return attentions

generator = a2_generator()
benchmark(generator, slow_heads_attention, heads_attention_with_einops2)

Your outputs is correct
Timing of slow function:  AVG: 0.013489282131195069+-0.0006958524786590557
Timing of fast function:  AVG: 0.0003887104988098144+-4.368779148447676e-05
Speedup: 34.70264418506229 times


The speed of `einsum` and `rearrange` is on par with broadcasting implementation.

Now it's your turn to complete the `multi_head_attention_einops` function using `einsum` and `rearrange` like the example above!

Hint: Just adding 1 dimension `b` in einops is done!

In [None]:
def multi_head_attention_einops(Q, K, V, mask):
    """Multi-head attention on a batch of Q, K, V
    Arguments:
        Q: torch.Tensor shape (B, N, n_head, d_k)
        K: torch.Tensor shape (B, M, n_head, d_k)
        V: torch.Tensor shape (B, M, n_head, d_v)
        mask: torch.BoolTensor shape (B, N, M)
        where mask[i] is `mask` for attention of record i: (Q[i], K[i], V[i])
        
    Return:
        scaled-dot attention: torch.Tensor shape (B, N, n_head d_v)
    """
    B, N, n_head, d_k = Q.shape
    ########### YOUR CODE HERE #################
    
    ###########################################
    return attentions

In [None]:
generator = a3_generator()
benchmark(generator, slow_multi_head_attention, multi_head_attention_einops)

Your outputs is correct
Timing of slow function:  AVG: 0.4410352683067322+-0.03237853759710465
Timing of fast function:  AVG: 0.008850009441375732+-0.00023571950844938092
Speedup: 49.83444042949782 times


Well done! Keep up the good work!

## References

- [Thomas Viehmann - visualize-jit-models](https://github.com/t-vi/pytorch-tvmisc/blob/master/hacks/visualize-jit-models.ipynb)
- [Sasha Rush - The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)
- [Andrej Karpathy - min-gpt](https://github.com/karpathy/minGPT/tree/master/mingpt)