<a href="https://colab.research.google.com/github/venkatasl/AIML_TRAINING_VENKAT/blob/venkat_creation/3_SelfAttention_Hands_On.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self Attention

Having explored attention in depth, we will now redirect our focus to self-attention. In the context of a self-attention module, it receives a certain number of inputs and produces an equal number of outputs. So, what exactly occurs within this module? Put simply, the self-attention mechanism enables the inputs to interact with each other (hence "self") and determine which elements deserve more attention ("attention"). The outputs are a result of these interactions and the corresponding attention scores, thereby representing aggregated information

The primary objective of this exercise is to guide you through the mathematical operations employed in a self-attention module. Upon completing this article, you will possess the knowledge and ability to construct a self-attention module either through writing or coding it from scratch.

Follow the diagram carefully to understand the details, we'll code sub sections interactively


![image.png](https://miro.medium.com/v2/resize:fit:1400/1*_92bnsMJy8Bl539G4v93yg.gif)


#Step1: Preparing the inputs




In [None]:

import torch

x = [
  [1, 0, 1, 0], # Input 1
  [0, 2, 0, 2], # Input 2
  [1, 1, 1, 1]  # Input 3
 ]
x = torch.tensor(x, dtype=torch.float32)
print(x)

tensor([[1., 0., 1., 0.],
        [0., 2., 0., 2.],
        [1., 1., 1., 1.]])


#Step 2: Initialise weights


Each input in the system necessitates three representations (as depicted in the diagram below). These representations are denoted as key (orange), query (red), and value (purple). For the purpose of this example, let's assume that we desire these representations to have a dimensionality of 3. Because every input has a dimension of 4, each set of the weights must have a shape of 4×3

In [None]:

w_key = [
  [0, 0, 1],
  [1, 1, 0],
  [0, 1, 0],
  [1, 1, 0]
]
w_query = [
  [1, 0, 1],
  [1, 0, 0],
  [0, 0, 1],
  [0, 1, 1]
]
w_value = [
  [0, 2, 0],
  [0, 3, 0],
  [1, 0, 3],
  [1, 1, 0]
]
w_key = torch.tensor(w_key, dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)

print("Weights for key: \n", w_key)
print("Weights for query: \n", w_query)
print("Weights for value: \n", w_value)

Weights for key: 
 tensor([[0., 0., 1.],
        [1., 1., 0.],
        [0., 1., 0.],
        [1., 1., 0.]])
Weights for query: 
 tensor([[1., 0., 1.],
        [1., 0., 0.],
        [0., 0., 1.],
        [0., 1., 1.]])
Weights for value: 
 tensor([[0., 2., 0.],
        [0., 3., 0.],
        [1., 0., 3.],
        [1., 1., 0.]])


# Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the **key**, **query** and **value** representations for every input.

Computing the keys:
```
               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]
```

In [None]:
keys = x @ w_key

#TO-DO
querys = x @ w_query

#TO-DO
values = x @ w_value

print("Keys: \n", keys)
print("Querys: \n", querys)
print("Values: \n", values)


Keys: 
 tensor([[0., 1., 1.],
        [4., 4., 0.],
        [2., 3., 1.]])
Querys: 
 tensor([[1., 0., 2.],
        [2., 2., 2.],
        [2., 1., 3.]])
Values: 
 tensor([[1., 2., 3.],
        [2., 8., 0.],
        [2., 6., 3.]])


# Step 4: Calculate attention scores

To derive attention scores, we initiate the process by computing the dot product between the query representation (red) of Input 1 and all the keys (orange), including itself. Considering that there are three key representations (due to the presence of three inputs), we generate three attention scores (depicted in blue).

```
            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]
```
We only use the query from Input 1. Later we'll work on repeating this same step for the other querys.

In [None]:
#TO-DO
attn_scores = torch.matmul(querys,keys.T)
print(attn_scores)

from torch.nn.functional import softmax

#TO-DO
attn_scores_softmax = softmax(attn_scores)
print(attn_scores_softmax)


tensor([[ 2.,  4.,  4.],
        [ 4., 16., 12.],
        [ 4., 12., 10.]])
tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
        [6.0337e-06, 9.8201e-01, 1.7986e-02],
        [2.9539e-04, 8.8054e-01, 1.1917e-01]])


  attn_scores_softmax = softmax(attn_scores)


# Step 5: Multiply scores with values

The softmaxed attention scores for each input (blue) is multiplied with its corresponding **value** (purple). This results in 3 alignment vectors (yellow). In this tutorial, we'll refer to them as **weighted values**.

```
1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]
```

In [None]:
attn_scores_softmax.unsqueeze(-1).shape

torch.Size([3, 3, 1])

In [None]:
#TO-DO
weighted_values = values * attn_scores_softmax.unsqueeze(-1)
print(weighted_values)

tensor([[[6.3379e-02, 1.2676e-01, 1.9014e-01],
         [9.3662e-01, 3.7465e+00, 0.0000e+00],
         [9.3662e-01, 2.8099e+00, 1.4049e+00]],

        [[6.0337e-06, 1.2067e-05, 1.8101e-05],
         [1.9640e+00, 7.8561e+00, 0.0000e+00],
         [3.5972e-02, 1.0792e-01, 5.3958e-02]],

        [[2.9539e-04, 5.9077e-04, 8.8616e-04],
         [1.7611e+00, 7.0443e+00, 0.0000e+00],
         [2.3834e-01, 7.1501e-01, 3.5750e-01]]])


# Step 6: Sum weighted values


Take all the **weighted values** (yellow) and sum them element-wise:

```
  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]
```

The resulting vector ```[2.0, 7.0, 1.5]``` (dark green) **is Output 1**, which is based on the **query representation from Input 1** interacting with all other keys, including itself.

In [None]:
#TO-DO
outputs = torch.sum(weighted_values, axis=1)
print(outputs)

tensor([[1.9366, 6.6831, 1.5951],
        [2.0000, 7.9640, 0.0540],
        [1.9997, 7.7599, 0.3584]])
