##### Example 1

In [27]:
import math
import torch
import torch.nn.functional as F

$\operatorname{Attention}(Q, K, V)= Z =\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

Write a function that calculate the attention, return the output $Z$ and attention weights

In [28]:
def scaled_dot_product_attention(Q, K, V, d_k):
    QK = torch.matmul(Q, K.T)

    matmul_scaled = QK / math.sqrt(d_k)

    attention_weights = F.softmax(matmul_scaled, dim=-1)
    
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

In [None]:
output, attention_weights = scaled_dot_product_attention(Q, K, V, d_k)

##### Example 2

In [29]:
temp_k = torch.Tensor([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]])  # (4, 3)

temp_v = torch.Tensor([[   1,0, 1],
                      [  10,0, 2],
                      [ 100,5, 0],
                      [1000,6, 0]])  # (4, 3)

In [30]:
temp_q = torch.Tensor([[0, 10, 0]])  # (1, 3)

`scaled_dot_product_attention` is a function calculate the self-attention

In [56]:
temp_q, temp_k, temp_v

(tensor([[ 0., 10.,  0.]]),
 tensor([[10.,  0.,  0.],
         [ 0., 10.,  0.],
         [ 0.,  0., 10.],
         [ 0.,  0., 10.]]),
 tensor([[   1.,    0.,    1.],
         [  10.,    0.,    2.],
         [ 100.,    5.,    0.],
         [1000.,    6.,    0.]]))

In [57]:
output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, d_k = 4)

In [58]:
torch.round(attention_weights, decimals=3), attention_weights.shape

(tensor([[0., 1., 0., 0.]]), torch.Size([1, 4]))

Interpret the output of `attention_weights`

**Explain**
- Given vector `temp_q`: `[ 0., 10.,  0.]`
- It will find all `4` vectors in `temp_k` that similar to `temp_q` by the dot product
    + High value => similar
    + Low value => not similar
- The output of `attention_weights` is `1 x 4`: each row represent the similar between each vector in `temp_k` to `temp_q`
    + Position `0`, `2` and `3` are zeros => no similar - (1)
    + Position `1` is highest value => similar - (2)

**Conclusion**
- `temp_k[0]`, `temp_k[2]`, `temp_k[3]` are not similar to `temp_q` - (1)
- `temp_k[1]` is similar to `temp_q` - (2)

##### Example 3

In [43]:
torch.round(output, decimals=3)

tensor([[10.,  0.,  2.]])

In [35]:
output.shape

torch.Size([1, 3])

In [22]:
n_words = 4
d_h = 5

https://youtu.be/1BFE1Tfs8tM?t=1928