## What is attention?

Similar to we human-beings, attention is "How much we should focus on an element".

Consider we have an array [1, 2, 3, 4, 5], and we have 100% focus on this array. When we dig into each element, we could split our focus like:

[0.1, 0.2, 0.5, 0.1, 0.1]

In this example, value 3 has most attention.

## How we represent the attention

attention chould be calculated from an input -> target processed with a weight. Take an example, if we concat user embedding and an item embedding, and map to how an user could related to an item. The output we are looking for will be the attention for user -> item

$$
attention = (input -> target) \cdot weight
$$

## Now, let's talk about self-attention

Self attention is a unique condition, where the input and target is same object.

Take a look at following fomular:

$$
Softmax(X \cdot X.T) X
$$

Let's understand this step by step:

1. $X \cdot X.T$ will calculate the heapmap of how x focus on itself

| X | a | b | c |
|---|---|---|---|
| a | 2 | 1 | 4 |
| b | 3 | 3 | 1 |
| c | 8 | 1 | 2 |

2. Softmax will apply a normalization for this table, and make sum of all elments into 1
3. for each attention after softmax, we process another dot process, as we discussed above, dot product can represent internal relationship between two things.
4. After all these process, we are going to have a attention tensor, the higher score we have, it means the higher relation they have


# Self-Attention in NN in action

![self-attention](./assert/self-attentions.jpg)


In [5]:
import torch
from torch.nn.functional import softmax
import math

input = torch.tensor([
    [1, 0, 1, 0],  # input 1
    [0, 2, 0, 2],  # input 2
    [1, 1, 1, 1]   # input 3
], dtype=torch.float32)

w_query = torch.rand(4, 3, dtype=torch.float32)
w_key = torch.rand(4, 3, dtype=torch.float32)
w_value = torch.rand(4, 3, dtype=torch.float32)

query = input @ w_query
key = input @ w_key
value = input @ w_value

attention = softmax(query @ key.T / math.sqrt(key.shape[1]))

output = attention @ value

print(output)



tensor([[1.5887, 2.5548, 1.3881],
        [1.6586, 2.6938, 1.5145],
        [1.6646, 2.6732, 1.4455]])


  attention = softmax(query @ key.T / math.sqrt(key.shape[1]))
