# Self-Attention

Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

The <a href="https://arxiv.org/pdf/1601.06733.pdf" target="_blank">Long short-term memory</a> network paper used self-attention to do machine reading. In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.

![title](img/sa.png)

### What is self attention?

> **Self-attention is similar to attention**

A self-attention module takes in n inputs, and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.



### Mathematical operations involved in a self-attention module

#### Step By Step

    1. Prepare inputs
    2. Initialise weights
    3. Derive key, query and value
    4. Calculate attention scores for Input 1
    5. Calculate softmax
    6. Multiply scores with values
    7. Sum weighted values to get Output 1
    8. Repeat steps 4–7 for Input 2 & Input 3

#### 1.Prepare Inputs

In this example, we start with 3 inputs, each with dimension 4.

![title](img/sa2.png)


    Input 1: [1, 0, 1, 0] 
    Input 2: [0, 2, 0, 2]
    Input 3: [1, 1, 1, 1]

#### 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3.


In order to obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the right spelling), and a set of weights for values. 


In our example, we initialise the three sets of weights as follows.

Weights for key:

    [[0, 0, 1],
     [1, 1, 0],
     [0, 1, 0],
     [1, 1, 0]]

Weights for query:

    [[1, 0, 1],
     [1, 0, 0],
     [0, 0, 1],
     [0, 1, 1]]

Weights for value:

    [[0, 2, 0],
     [0, 3, 0],
     [1, 0, 3],
     [1, 1, 0]]


#### 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the key, query and value representations for every input.

Key representation for Input 1:

                   [0, 0, 1]
    [1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
                   [0, 1, 0]
                   [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

                   [0, 0, 1]
    [0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
                   [0, 1, 0]
                   [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

                   [0, 0, 1]
    [1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
                   [0, 1, 0]
                   [1, 1, 0]

A faster way is to vectorise the above operations:

                   [0, 0, 1]
    [1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
    [0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
    [1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

![title](img/sa3.gif)

Let’s do the same to obtain the value representations for every input:

                   [0, 2, 0]
    [1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
    [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
    [1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]
    
    
![title](img/sa4.gif)    
    
and finally the query representations:

                   [1, 0, 1]
    [1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
    [0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
    [1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]
    
    
![title](img/sa5.gif)

Step 4: Calculate attention scores for Input 1

![title](img/sa6.gif)


To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

                [0, 4, 2]
    [1, 0, 2] x [1, 4, 3] = [2, 4, 4]
                [1, 0, 1]

This above operation is known as **dot product attention**.

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.


Step 5: Calculate softmax

![title](img/sa7.gif)

Take the softmax across these attention scores (blue).

    softmax([2, 4, 4]) = [0.0, 0.5, 0.5]
    
    
Step 6: Multiply scores with values

![title](img/sa8.gif)


The softmaxed attention scores for each input (blue) is multiplied with its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

    1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
    2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
    3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]



Step 7: Sum weighted values to get Output 1

![title](img/sa9.gif)

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself

Finally!!!

![title](img/sa10.gif)

>Notes : - 
>The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.