https://www.youtube.com/watch?v=QCJQG4DuHT0&list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4

In [1]:
import numpy as np
import math

In [2]:
input_seq = "My name is Ajay"

In [3]:
L = 4 # length of the input sequence in terms of words
d_k = 8 # number of dimensions in the key vector
d_v = 8 #  number of dimensions in the value vector

In [4]:
# Generate for each word a query, key, and value vector
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

 For word "My"

In [5]:
for word in input_seq.split(" "):
    print(f"word : '{word}'")
    print("Q\n", q[0])
    print("K\n", k[0])
    print("V\n", k[0])
    print("----------------------------\n")

word : 'My'
Q
 [ 1.60870184 -0.33850311  0.61360379  0.90066393 -1.17032812 -0.08006709
 -0.74460439 -0.84119747]
K
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.44841173]
V
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.44841173]
----------------------------

word : 'name'
Q
 [ 1.60870184 -0.33850311  0.61360379  0.90066393 -1.17032812 -0.08006709
 -0.74460439 -0.84119747]
K
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.44841173]
V
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.44841173]
----------------------------

word : 'is'
Q
 [ 1.60870184 -0.33850311  0.61360379  0.90066393 -1.17032812 -0.08006709
 -0.74460439 -0.84119747]
K
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.44841173]
V
 [ 0.01330805 -0.80453143  0.13958882 -1.83548747 -2.00376175 -0.6174246
  0.5314595  -0.

## Self Attention

Create an attention matrix to let every word within the input sequence look at every single other word to see if it has a higher affinity towards it or not

$$softmax(\frac{Q.K^T}{\sqrt(d_k)} + M)V $$

where : 
- $Q$ = what I am looking for
- $K$ = what i currently have



In [10]:
np.matmul(q, k.T)

array([[ 1.10220978,  1.44093158,  0.9190615 ,  0.26651833],
       [-1.83120489, -4.00947714,  1.85162327,  3.19053453],
       [ 0.96064523, -0.32591597,  1.32951125, -1.43763971],
       [-0.6289311 ,  2.33419412, -1.39226873, -0.39662222]])

The product $Q.K^T$ leads to a 4x4 matrix displaying values proportional to how much attention we want to focus on each word. The values of the first ROW of the matrix show how much attention we pay to each word when the query = 'My' (the largest being for the word itself, "My"), second row is when query = "name", etc.

$\sqrt(d_k)$ is used to minimize the variance of $Q.K^T$, as per the paper Attention is All you Need "the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients". Helps to stabilize the values

**before scaling:**

In [7]:
np.matmul(q, k.T).var()

np.float64(3.083509890285003)

**after scaling:**

In [8]:
attention_matrix_scaled = np.matmul(q, k.T) / math.sqrt(d_k)
attention_matrix_scaled.var()

np.float64(0.38543873628562525)

## Masking

Masking words ahead of the current input word is required for the decoder (as, in the real world, we cannot predict a word based on the predictions that lie ahead of it. However, this restriction is not necessary when modelling embeddings using the encoder)

In [38]:
mask = np.tril(np.ones((L,L)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

The above matrix simulates the fact that a given word can only see the words that have come before it. "My" can only see "My", "name" can only see "My" and "name", etc

In [47]:
mask[mask == 0] = -np.inf 
mask[mask==1] = 0 # set to 0 as we will be replacing them with the self attention values

## Softmax

$$\frac{e^{x_i}}{\sum_{j}e^x_j}$$

For a given word, get the probability distribution of all its attention values. To do so, divide each exponentiated attention value by the sum of all exponentiated attention value. Repeat for each word

In [51]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [54]:
softmax(attention_matrix_scaled + mask)

array([[1.        , 0.        , 0.        , 0.        ],
       [0.65257547, 0.34742453, 0.        , 0.        ],
       [0.72320796, 0.05172176, 0.22507028, 0.        ],
       [0.08825281, 0.36919315, 0.28033537, 0.26221867]])

In [67]:
np.exp(x).T 

array([[1.93507226, 0.58845159, 4.99045031, 0.34683577],
       [0.        , 0.31328563, 0.3569027 , 1.45093838],
       [0.        , 0.        , 1.5530831 , 1.10172509],
       [0.        , 0.        , 0.        , 1.03052599]])

In [66]:
x = attention_matrix_scaled + mask

In [68]:
np.sum(np.exp(x), axis=-1)

array([1.93507226, 0.90173721, 6.90043611, 3.93002524])