In [1]:
import numpy as np
import tensorflow as tf

Referred link- https://machinelearningmastery.com/the-attention-mechanism-from-scratch/

Firstly, calculating the attention for the first word in a sequence of four. Then  we would be calculating an attention output for all four words in matrix form.

First defining the word embeddings of the four different words to calculate the attention. In actual practice, these word embeddings would have been generated by an encoder; however, for this particular example, you will define them manually.

In [2]:
word_1 = tf.constant([[0, 1, 1, 2]])
word_2 = tf.constant([[1, 0, 0, 2]])
word_3 = tf.constant([[1, 1, 0, 2]])
word_4 = tf.constant([[2, 1, 1, 1]])

Stacking  the word embeddings into a single array

In [3]:
word_emb_stack = tf.stack([word_1, word_2, word_3, word_4], axis = 1)

word_emb_stack

<tf.Tensor: shape=(1, 4, 4), dtype=int32, numpy=
array([[[0, 1, 1, 2],
        [1, 0, 0, 2],
        [1, 1, 0, 2],
        [2, 1, 1, 1]]], dtype=int32)>

generating the weight matrices, which you will eventually multiply to the word embeddings to generate the queries, keys, and values. Here, you shall generate these weight matrices randomly; however, in actual practice, these would have been learned during training.

In [4]:
shape = (4, 4)

W_Q = tf.random.uniform(shape = shape, minval = 0, maxval = 3, dtype = tf.int32)
W_K = tf.random.uniform(shape = shape, minval = 0, maxval = 3, dtype = tf.int32)
W_V = tf.random.uniform(shape = shape, minval = 0, maxval = 3, dtype = tf.int32)

W_Q, W_K, W_V

(<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[1, 1, 0, 0],
        [0, 0, 2, 2],
        [1, 2, 2, 2],
        [0, 0, 2, 2]], dtype=int32)>,
 <tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[1, 1, 0, 2],
        [1, 1, 2, 0],
        [2, 1, 0, 2],
        [0, 0, 0, 1]], dtype=int32)>,
 <tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[1, 2, 1, 1],
        [1, 2, 0, 1],
        [0, 0, 2, 0],
        [2, 1, 1, 2]], dtype=int32)>)

Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is four) to allow us to perform the matrix multiplication.

Subsequently, the query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.

In [5]:
# generating the queries, keys and values
query_1 = tf.matmul(word_1, W_Q)
key_1 = tf.matmul(word_1, W_K)
value_1 = tf.matmul(word_1, W_V)

query_2 = tf.matmul(word_2, W_Q)
key_2 = tf.matmul(word_2, W_K)
value_2 = tf.matmul(word_2, W_V)

query_3 = tf.matmul(word_3, W_Q)
key_3 = tf.matmul(word_3, W_K)
value_3 = tf.matmul(word_3, W_V)

query_4 = tf.matmul(word_4, W_Q)
key_4 = tf.matmul(word_4, W_K)
value_4 = tf.matmul(word_4, W_V)

Checking query, key and value for first word

In [6]:
query_1, key_1, value_1

(<tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[1, 2, 8, 8]], dtype=int32)>,
 <tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[3, 2, 2, 4]], dtype=int32)>,
 <tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[5, 4, 4, 5]], dtype=int32)>)

Scoring the first query vector against all key vectors

In [7]:
scores = tf.stack([
    tf.matmul(query_1, key_1, transpose_b = True),
    tf.matmul(query_1, key_2, transpose_b = True),
    tf.matmul(query_1, key_3, transpose_b = True),
    tf.matmul(query_1, key_4, transpose_b = True)]
)

scores

<tf.Tensor: shape=(4, 1, 1), dtype=int32, numpy=
array([[[55]],

       [[35]],

       [[54]],

       [[85]]], dtype=int32)>

The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three) to keep the gradients stable.

In [8]:
weights = tf.nn.softmax(
    tf.cast(tf.squeeze(scores), tf.float32) / tf.math.sqrt(tf.cast(tf.shape(key_1)[-1], tf.float32))
)

weights

<tf.Tensor: shape=(4,), dtype=float32, numpy=
array([3.0590218e-07, 1.3887937e-11, 1.8553905e-07, 9.9999952e-01],
      dtype=float32)>

Converting value tensors into float type

In [9]:
value_1 = tf.cast(value_1, tf.float32)
value_2 = tf.cast(value_2, tf.float32)
value_3 = tf.cast(value_3, tf.float32)
value_4 = tf.cast(value_4, tf.float32)

Computing the attention by a weighted sum of the value vectors.
This attention scores are for word_1

In [10]:
attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

tf.Tensor([[5.0000005 6.999999  4.9999995 5.0000005]], shape=(1, 4), dtype=float32)


For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:

Generating the queries, keys and values

In [11]:
word_emb_stack

<tf.Tensor: shape=(1, 4, 4), dtype=int32, numpy=
array([[[0, 1, 1, 2],
        [1, 0, 0, 2],
        [1, 1, 0, 2],
        [2, 1, 1, 1]]], dtype=int32)>

In [12]:
word_emb_stack = tf.squeeze(word_emb_stack)

word_emb_stack

<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
array([[0, 1, 1, 2],
       [1, 0, 0, 2],
       [1, 1, 0, 2],
       [2, 1, 1, 1]], dtype=int32)>

In [13]:
Query = word_emb_stack @ W_Q
Key = word_emb_stack @ W_K
Value = word_emb_stack @ W_V

Query, Key, Value

(<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[1, 2, 8, 8],
        [1, 1, 4, 4],
        [1, 1, 6, 6],
        [3, 4, 6, 6]], dtype=int32)>,
 <tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[3, 2, 2, 4],
        [1, 1, 0, 4],
        [2, 2, 2, 4],
        [5, 4, 2, 7]], dtype=int32)>,
 <tf.Tensor: shape=(4, 4), dtype=int32, numpy=
 array([[5, 4, 4, 5],
        [5, 4, 3, 5],
        [6, 6, 3, 6],
        [5, 7, 5, 5]], dtype=int32)>)

Getting the score vectors from the query vectors against all key vectors

In [14]:
scores = tf.matmul(Query, Key, transpose_b = True)

scores

<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
array([[55, 35, 54, 85],
       [29, 18, 28, 45],
       [41, 26, 40, 63],
       [53, 31, 50, 85]], dtype=int32)>

Calculating  the weights by a softmax operation

In [15]:
weights = tf.nn.softmax(tf.cast(scores, tf.float32) / tf.math.sqrt(tf.cast(tf.shape(Key)[-1], tf.float32)))

weights

<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[3.0590218e-07, 1.3887937e-11, 1.8553907e-07, 9.9999952e-01],
       [3.3528148e-04, 1.3702189e-06, 2.0335850e-04, 9.9946004e-01],
       [1.6701253e-05, 9.2372021e-09, 1.0129822e-05, 9.9997318e-01],
       [1.1253516e-07, 1.8795284e-12, 2.5109987e-08, 9.9999988e-01]],
      dtype=float32)>

Finally computing the attention by a weighted sum of the value vectors

In [16]:
attention = tf.matmul(weights, tf.cast(Value, tf.float32))

attention

<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[5.0000005, 6.999999 , 4.9999995, 5.0000005],
       [5.0002036, 6.998787 , 4.9992557, 5.0002036],
       [5.0000105, 6.99994  , 4.9999633, 5.0000105],
       [5.       , 6.9999995, 5.       , 5.       ]], dtype=float32)>