# Self Attention in Transformers

## Generate Data

In [1]:
import numpy as np
import math

L, d_k, d_v = 4, 8, 8
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

1. q: query vector, k: Key vector, v: Value vector 
2. L : Length of sequence. In this case our sequence is "My name is Ajay" Hence length of sequence is 4.
3. d_k, d_v : are the dimensions of k and v vectors respectively.
4. d_k = d_v = 8(you can chose any dimension)

In [2]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[-1.36415637  0.59692338  0.81469767 -0.17490486  0.60457722  0.97285416
   0.65119482 -0.09519678]
 [-1.30969525 -0.94479345 -0.3673079  -0.12212057  1.7617716   1.49037266
   0.02568805  1.07899398]
 [ 1.11876351  0.16859557 -0.33980155 -1.04705635 -0.85774411 -0.0475422
  -0.25589572  1.22947487]
 [-0.7141838  -0.01038527  2.44348975 -0.8276257  -0.62324254  0.4149842
  -0.47792485  0.06031733]]
K
 [[ 1.9203213   1.14009187  0.19172904  0.75344955 -0.87181049  1.53841683
   0.21507366  0.18913319]
 [ 0.04653665  0.62462718 -1.50235362 -0.42596872  0.06980895  0.34065727
  -0.22526977 -1.0466694 ]
 [ 1.50227701  0.73966022 -0.27537201  0.26149526 -0.53322466 -0.08374696
  -0.6863729  -1.31370446]
 [ 0.04921155 -0.52548654 -0.54641947  0.89833533  1.41068005 -0.77247713
   0.60306728  0.69031295]]
V
 [[ 1.02017695  0.96976545  0.61554372  1.69284593  0.71702866 -0.59883186
  -0.67487099  0.13410379]
 [-0.50207642  0.6248769  -0.10653077 -0.18484618 -0.42154398  1.19186699
  -0.231

In [3]:
print(q.shape)
print(k.shape)
print(v.shape)

(4, 8)
(4, 8)
(4, 8)


#### shape of the vectors
1. 4 >> length of the sequence
2. 8 >> each word in the sequence is represented by 8 dimensional vector.
3. For every single word we have 8 dimensional q vector, 8 dimensional k vector and 8 dimensional v vector.

## Self Attention

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)
$$

$$
\text{new V} = \text{self attention}.V
$$ 

In order to create an intial self attention matrix we need every word to look at every single other word to see if it has a higher effinity towards it or not. 

In [4]:
np.matmul(q, k.T)

array([[-0.82302342, -0.51352891, -2.60365311, -0.55474118],
       [-2.78814047, -0.55168962, -5.09648387,  2.61737547],
       [ 2.33869455, -0.19139374,  0.6469928 , -1.26734931],
       [-0.44800176, -3.21577647, -1.42350222, -3.55468917]])

q = 4 x 8 and k.T = 8 x 4 Hence the result is 4 x4

1. So each number is telling me the attention that each word will give to the other word.
2. second number[0,1] in the array is telling me how much "My" is focusing on "Name".

![](Data/attention_matrix.png)

In [6]:
# Why we need sqrt(d_k) in denominator
q.var(), k.var(), np.matmul(q, k.T).var()

(0.8189305952168686, 0.7046191727529998, 3.9175175362098447)

Answer: To minimise the variance of `np.matmul(q, k.T)`

In [7]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k)
q.var(), k.var(), scaled.var()

(0.8189305952168686, 0.7046191727529998, 0.4896896920262304)

Notice the reduction in variance of the product

In [8]:
scaled

array([[-0.29098272, -0.18155989, -0.92053039, -0.19613062],
       [-0.98575652, -0.19505174, -1.80187915,  0.92538197],
       [ 0.82685339, -0.06766791,  0.2287465 , -0.44807565],
       [-0.15839254, -1.13694867, -0.50328404, -1.25677241]])

![](Data/attention_matrix.png)

## Masking

- This is to ensure words don't get context from words generated in the future. 
- Not required in the encoders, but required in the decoders

Masking is done so that the netowrk dont look at a future word when generating current word.

"We dont look at future word while generating the current context of the current word"

In [9]:
mask = np.tril(np.ones( (L, L) ))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

* All of the values below the diagnol is one and above the diagnol is zero.
* This will simulate the fact that mentioned above

|      | My | Name| is |Ajay|
|------|----|-----|----|----|
| My   | 1  | 0   | 0  | 0  |
| Name | 1  | 1   | 0  | 0  |
| is   | 1  | 1   | 1  | 0  |
| Ajay | 1  | 1   | 1  | 1  |

* "My" can look at only "My". while generating the first word
* "Name" can look at "My" and "Name" 
* "Ajay" can look at "My", "Name", "is", "Ajay"

In [10]:
mask[mask == 0] = -np.infty
mask[mask == 1] = 0

In [11]:
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

This transformation of 0 >> - infinty and 1 >> 0 is required for softmax

In [12]:
# Applying the mask on the scaled output
scaled + mask

array([[-0.29098272,        -inf,        -inf,        -inf],
       [-0.98575652, -0.19505174,        -inf,        -inf],
       [ 0.82685339, -0.06766791,  0.2287465 ,        -inf],
       [-0.15839254, -1.13694867, -0.50328404, -1.25677241]])

- inf means we are not getting any context from future words while generating the current word.

## Softmax

$$
\text{softmax} = \frac{e^{x_i}}{\sum_j e^x_j}
$$

In [13]:
def softmax(x):
  return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [14]:
attention = softmax(scaled + mask)

In [15]:
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.31201736, 0.68798264, 0.        , 0.        ],
       [0.51055448, 0.20871633, 0.28072919, 0.        ],
       [0.41363996, 0.15546798, 0.29298003, 0.13791204]])

In [16]:
new_v = np.matmul(attention, v)
new_v
# after attention

array([[ 1.02017695,  0.96976545,  0.61554372,  1.69284593,  0.71702866,
        -0.59883186, -0.67487099,  0.13410379],
       [-0.02710694,  0.73248812,  0.11876901,  0.40102635, -0.06628955,
         0.63313787, -0.36965068, -0.38028982],
       [ 0.35546974,  0.13066996,  0.26648324,  1.16944436,  0.13436219,
        -0.63271586, -0.28240907, -0.14603994],
       [ 0.16824467, -0.14904602,  0.03000079,  1.16845379, -0.11712071,
        -0.73562332, -0.3235134 , -0.19843113]])

This new_v will be much aware of the context

In [None]:
# before attention
v

array([[-0.00368231,  1.43739233, -0.59614565, -1.23171219,  1.12030717,
        -0.98620738, -0.15461465, -1.03106383],
       [ 0.85585446, -1.79878344,  0.67321704,  0.05607552, -0.15542661,
        -1.41264124, -0.40136933, -1.17626611],
       [ 0.50465335,  2.28693419,  0.67128338,  0.2506863 ,  1.78802234,
         0.14775751, -0.11405725,  0.88026286],
       [-0.68069105,  0.68385101,  0.17994557, -1.68013201,  0.91543969,
        -0.19108312,  0.03160471,  1.40527326]])

# Function

In [17]:
def softmax(x):
  return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

def scaled_dot_product_attention(q, k, v, mask=None):
  d_k = q.shape[-1]
  scaled = np.matmul(q, k.T) / math.sqrt(d_k)
  if mask is not None:
    scaled = scaled + mask
  attention = softmax(scaled)
  out = np.matmul(attention, v)
  return out, attention

In [19]:
# Encoder
values, attention = scaled_dot_product_attention(q, k, v, mask=None)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.36415637  0.59692338  0.81469767 -0.17490486  0.60457722  0.97285416
   0.65119482 -0.09519678]
 [-1.30969525 -0.94479345 -0.3673079  -0.12212057  1.7617716   1.49037266
   0.02568805  1.07899398]
 [ 1.11876351  0.16859557 -0.33980155 -1.04705635 -0.85774411 -0.0475422
  -0.25589572  1.22947487]
 [-0.7141838  -0.01038527  2.44348975 -0.8276257  -0.62324254  0.4149842
  -0.47792485  0.06031733]]
K
 [[ 1.9203213   1.14009187  0.19172904  0.75344955 -0.87181049  1.53841683
   0.21507366  0.18913319]
 [ 0.04653665  0.62462718 -1.50235362 -0.42596872  0.06980895  0.34065727
  -0.22526977 -1.0466694 ]
 [ 1.50227701  0.73966022 -0.27537201  0.26149526 -0.53322466 -0.08374696
  -0.6863729  -1.31370446]
 [ 0.04921155 -0.52548654 -0.54641947  0.89833533  1.41068005 -0.77247713
   0.60306728  0.69031295]]
V
 [[ 1.02017695  0.96976545  0.61554372  1.69284593  0.71702866 -0.59883186
  -0.67487099  0.13410379]
 [-0.50207642  0.6248769  -0.10653077 -0.18484618 -0.42154398  1.19186699
  -0.231

`Attention
 [[0.2668116  0.29766409 0.14216596 0.29335835]
 [0.0960811  0.21185401 0.04248156 0.64958333]
 [0.44680644 0.18265592 0.24567723 0.12486042]
 [0.41363996 0.15546798 0.29298003 0.13791204]]`

 Notice that the every word here is looking at every word wheras below is not that case.

In [20]:
# Decoder
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.36415637  0.59692338  0.81469767 -0.17490486  0.60457722  0.97285416
   0.65119482 -0.09519678]
 [-1.30969525 -0.94479345 -0.3673079  -0.12212057  1.7617716   1.49037266
   0.02568805  1.07899398]
 [ 1.11876351  0.16859557 -0.33980155 -1.04705635 -0.85774411 -0.0475422
  -0.25589572  1.22947487]
 [-0.7141838  -0.01038527  2.44348975 -0.8276257  -0.62324254  0.4149842
  -0.47792485  0.06031733]]
K
 [[ 1.9203213   1.14009187  0.19172904  0.75344955 -0.87181049  1.53841683
   0.21507366  0.18913319]
 [ 0.04653665  0.62462718 -1.50235362 -0.42596872  0.06980895  0.34065727
  -0.22526977 -1.0466694 ]
 [ 1.50227701  0.73966022 -0.27537201  0.26149526 -0.53322466 -0.08374696
  -0.6863729  -1.31370446]
 [ 0.04921155 -0.52548654 -0.54641947  0.89833533  1.41068005 -0.77247713
   0.60306728  0.69031295]]
V
 [[ 1.02017695  0.96976545  0.61554372  1.69284593  0.71702866 -0.59883186
  -0.67487099  0.13410379]
 [-0.50207642  0.6248769  -0.10653077 -0.18484618 -0.42154398  1.19186699
  -0.231