# Illustrated: Self-Attention
Step-by-step guide to self-attention with illustrations and code

[medium article](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a)


> Colab made by: [Manuel Romero](https://twitter.com/mrm8488)

참고

0. [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
1. [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
2.  [Transformer 모델 분석 (Self-Attention)](https://wdprogrammer.tistory.com/72)
3. [어텐션 메커니즘과 transfomer(self-attention)](https://medium.com/platfarm/%EC%96%B4%ED%85%90%EC%85%98-%EB%A9%94%EC%BB%A4%EB%8B%88%EC%A6%98%EA%B3%BC-transfomer-self-attention-842498fd3225)
4. [Why multi-head self attention works: math, intuitions and 10+1 hidden insights](https://theaisummer.com/self-attention/)



![texto alternativo](https://miro.medium.com/max/1973/1*_92bnsMJy8Bl539G4v93yg.gif)

What do *BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, MobileBERT, TinyBERT and CamemBERT* all have in common? And I’m not looking for the answer “BERT” 🤭.
Answer: **self-attention** 🤗. We are not only talking about architectures bearing the name “BERT’, but more correctly **Transformer-based architectures**. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew the use of recurrence in neural network (RNNs) and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

The main content of this kernel is to walk you through the mathematical operations involved in a self-attention module.

### Step 0. What is self-attention?

If you’re thinking if self-attention is similar to attention, then the answer is yes! They fundamentally share the same concept and many common mathematical operations.
A self-attention module takes in n inputs, and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

<img src= 'https://pbs.twimg.com/media/Et1JSjMVEAAGgLT?format=png&name=900x900'>



### Self-Attention at a High Level

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

<img src='http://jalammar.github.io/images/t/transformer_self-attention_visualization.png'>

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".


### Self-Attention in Detail


**The first step** in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

<img src='http://jalammar.github.io/images/t/transformer_self_attention_vectors.png'>

  - Multiplying x1 by the $W^Q$ weight matrix produces $q1$, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

**The second step** in calculating self-attention is to calculate a score. 

Say we’re calculating the self-attention for the first word in this example, “Thinking”.

<img src='http://jalammar.github.io/images/t/transformer_self_attention_score.png'>

  -  We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

__The third and forth steps__ are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

<img src= 'http://jalammar.github.io/images/t/self-attention_softmax.png'>
  
  - This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

__The fifth step__ is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

__The sixth step__ is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

<img src='http://jalammar.github.io/images/t/self-attention-output.png'>



### The Math: __"Self-attention as two matrix multiplications"__

Given our well know inputs:

$$\textbf{X} \in R^{batch \times tokens \times dim}$$
 
And trainable weight matrices:

$$\textbf{W}_{i}^{Q}, \textbf{W}_{i}^{K}, \textbf{W}_{i}^{V} \in {R}^{d_{\text{model}} \times d_{k}}$$

__First Step__

We create 3 distinct representations ( the query, the key, and the value):

$$\textbf{Q} = \textbf{X} \textbf{W}^Q, \textbf{K} = \textbf{X} \textbf{W}^K, \textbf{V} = \textbf{X} \textbf{W}^V$$ 

<img src='http://jalammar.github.io/images/t/self-attention-matrix-calculation.png'>


__Second Step__

Then, we can define content-based (self) attention as:

$$\operatorname{Attention}(\textbf{Q}, \textbf{K}, \textbf{V})=\operatorname{softmax}\left(\frac{\textbf{Q} \textbf{K}^{T}}{\sqrt{d_{k}}}\right) \textbf{V}$$

Where the attention scoring function is calculated using the dot-product similarity and is happening right here:

$$\operatorname{Dot-scores} = \left(\frac{\textbf{Q} \textbf{K}^{T}}{\sqrt{d_{k}}}\right)$$

<img src='http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png'>

#### The Query-Key matrix multiplication

Content-based attention has distinct representations. 

  - The query matrix in the attention layer is conceptually the “search” in the database. 
  - The keys will account for where we will be looking while the values will actually give us the desired content. 
  - Consider the keys and values as components of our database.

Intuitively, __the keys are the bridge between the queries (what we are looking for) and the values (what we will actually get).__

Keep in mind that each vector to vector multiplication is a dot-product 
similarity. __We can use the keys to guide our “search” and tell us where to look with respect to the input elements.__

In other words, the keys will account for the computation of the attention on how to weigh the values based on our particular queries.


Following, we are going to explain and implement:
- Prepare inputs
- Initialise weights
- Derive key, query and value
- Calculate attention scores for Input 1
- Calculate softmax
- Multiply scores with values
- Sum weighted values to get Output 1
- Repeat steps 4–7 for Input 2 & Input 3

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np

### Step 1: Prepare inputs

For this tutorial, for the shake of simplicity, we start with 3 inputs, each with dimension 4.

![texto alternativo](https://miro.medium.com/max/1973/1*hmvdDXrxhJsGhOQClQdkBA.png)


In [None]:
x = [
  [1, 0, 1, 0], # Input 1
  [0, 2, 0, 2], # Input 2
  [1, 1, 1, 1]  # Input 3
 ]

x = tf.constant(x, dtype=tf.float32)
print(x)

tf.Tensor(
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]], shape=(3, 4), dtype=float32)


### Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called **key** (orange), **query** (red), and **value** (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3.

![texto del enlace](https://miro.medium.com/max/1975/1*VPvXYMGjv0kRuoYqgFvCag.gif)

In [None]:
w_key = [
  [0, 0, 1],
  [1, 1, 0],
  [0, 1, 0],
  [1, 1, 0]
]
w_query = [
  [1, 0, 1],
  [1, 0, 0],
  [0, 0, 1],
  [0, 1, 1]
]
w_value = [
  [0, 2, 0],
  [0, 3, 0],
  [1, 0, 3],
  [1, 1, 0]
]
w_key = tf.Variable(w_key, dtype=tf.float32)
w_query = tf.Variable(w_query, dtype=tf.float32)
w_value = tf.Variable(w_value, dtype=tf.float32)

print("Weights for key: \n", w_key)
print("Weights for query: \n", w_query)
print("Weights for value: \n", w_value)

Weights for key: 
 <tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[0., 0., 1.],
       [1., 1., 0.],
       [0., 1., 0.],
       [1., 1., 0.]], dtype=float32)>
Weights for query: 
 <tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 1.]], dtype=float32)>
Weights for value: 
 <tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[0., 2., 0.],
       [0., 3., 0.],
       [1., 0., 3.],
       [1., 1., 0.]], dtype=float32)>


Note: *In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions.*

### Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the **key**, **query** and **value** representations for every input.

Obtaining the keys:
```
               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]
```
![texto alternativo](https://miro.medium.com/max/1975/1*dr6NIaTfTxEWzxB2rc0JWg.gif)

Obtaining the values:
```
               [0, 2, 0]
[1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]
```
![texto alternativo](https://miro.medium.com/max/1975/1*5kqW7yEwvcC0tjDOW3Ia-A.gif)


Obtaining the querys:
```
               [1, 0, 1]
[1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]
```
![texto alternativo](https://miro.medium.com/max/1975/1*wO_UqfkWkv3WmGQVHvrMJw.gif)

Notes: *Notes
In practice, a bias vector may be added to the product of matrix multiplication.*

In [None]:
keys = tf.matmul(x, w_key)
querys = tf.matmul(x, w_query)
values = tf.matmul(x, w_value)
print(keys)
print(querys)
print(values)

tf.Tensor(
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]], shape=(3, 3), dtype=float32)


### Step 4: Calculate attention scores
![texto alternativo](https://miro.medium.com/max/1973/1*u27nhUppoWYIGkRDmYFN2A.gif)

To obtain **attention scores**, we start off with taking a dot product between Input 1’s **query** (red) with **all keys** (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

```
            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]
```
Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note: *The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.*            

In [None]:
attn_scores = tf.matmul(querys, keys, transpose_b=True)
print(attn_scores)

tf.Tensor(
[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]], shape=(3, 3), dtype=float32)


### Step 5: Calculate softmax
![texto alternativo](https://miro.medium.com/max/1973/1*jf__2D8RNCzefwS0TP1Kyg.gif)

Take the **softmax** across these **attention scores** (blue).
```
softmax([2, 4, 4]) = [0.0, 0.5, 0.5]
```

In [None]:
attn_scores_softmax = tf.nn.softmax(attn_scores, axis=-1)
print(np.around(attn_scores_softmax,2))

# For readability, approximate the above as follows
attn_scores_softmax = [
  [0.0, 0.5, 0.5],
  [0.0, 1.0, 0.0],
  [0.0, 0.9, 0.1]
]
attn_scores_softmax = tf.Variable(attn_scores_softmax)
print(attn_scores_softmax)

[[0.06 0.47 0.47]
 [0.   0.98 0.02]
 [0.   0.88 0.12]]
<tf.Variable 'Variable:0' shape=(3, 3) dtype=float32, numpy=
array([[0. , 0.5, 0.5],
       [0. , 1. , 0. ],
       [0. , 0.9, 0.1]], dtype=float32)>


### Step 6: Multiply scores with values
![texto alternativo](https://miro.medium.com/max/1973/1*9cTaJGgXPbiJ4AOCc6QHyA.gif)

The softmaxed attention scores for each input (blue) is multiplied with its corresponding **value** (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as **weighted values**.
```
1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]
``` 

In [None]:
weighted_values = values[:,None] * tf.transpose(attn_scores_softmax)[:,:,None]
print(weighted_values)

tf.Tensor(
[[[0.  0.  0. ]
  [0.  0.  0. ]
  [0.  0.  0. ]]

 [[1.  4.  0. ]
  [2.  8.  0. ]
  [1.8 7.2 0. ]]

 [[1.  3.  1.5]
  [0.  0.  0. ]
  [0.2 0.6 0.3]]], shape=(3, 3, 3), dtype=float32)


### Step 7: Sum weighted values
![texto alternativo](https://miro.medium.com/max/1973/1*1je5TwhVAwwnIeDFvww3ew.gif)

Take all the **weighted values** (yellow) and sum them element-wise:

```
  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]
```

The resulting vector ```[2.0, 7.0, 1.5]``` (dark green) **is Output 1**, which is based on the **query representation from Input 1** interacting with all other keys, including itself.


### Step 8: Repeat for Input 2 & Input 3
![texto alternativo](https://miro.medium.com/max/1973/1*G8thyDVqeD8WHim_QzjvFg.gif)

Note: *The dimension of **query** and **key** must always be the same because of the dot product score function. However, the dimension of **value** may be different from **query** and **key**. The resulting output will consequently follow the dimension of **value**.*

In [None]:
outputs = tf.reduce_sum(weighted_values, axis=0)  # 6
print(outputs)

# tensor([[2.0000, 7.0000, 1.5000],  # Output 1
#         [2.0000, 8.0000, 0.0000],  # Output 2
#         [2.0000, 7.8000, 0.3000]]) # Output 3

tf.Tensor(
[[2.        7.        1.5      ]
 [2.        8.        0.       ]
 [2.        7.7999997 0.3      ]], shape=(3, 3), dtype=float32)
