##### Self attention solves a major issue in RNN which faces unable to process large/long text/docs. RNN also suffers from vanishinng gradients and gradient explosions, often needs large training step to reach local/global minima
Ref: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
Self attention provides an attention mechanism to provide access to all sequence elements at each time step. It enables the model to weigh the importance/weights of elements in a sequence and adjust them for generating the output

Various variants: scaled dot product attention is popular

In [84]:
import torch
import math
import torch.nn.functional as F

In [76]:
# creating a sentence embeddeing
sentence = "history can be misleading guide to the future of the economy and stock market because it does not account for structural changes that are relevant to today's world"
#sentence = 'Life is short, eat dessert first'
dc = {s:i for i,s in enumerate(sorted(sentence.replace(",","").split()))}
print(dc)

{'account': 0, 'and': 1, 'are': 2, 'be': 3, 'because': 4, 'can': 5, 'changes': 6, 'does': 7, 'economy': 8, 'for': 9, 'future': 10, 'guide': 11, 'history': 12, 'it': 13, 'market': 14, 'misleading': 15, 'not': 16, 'of': 17, 'relevant': 18, 'stock': 19, 'structural': 20, 'that': 21, 'the': 23, 'to': 25, "today's": 26, 'world': 27}


In [77]:
# create a tensor

sentence_int = torch.tensor([dc[s] for s in sentence.replace(",","").split()])
#print(sentence_int)
print(sentence_int.shape)

torch.Size([28])


In [78]:
# use an embedding to encode the numerical representation of the sentence (sentence_int)
torch.manual_seed(123)
# contains 28 words or more as 50000. lets assume each word is represented by 16 dimensional vector
embedding = torch.nn.Embedding(50000, 16)
embedded_sentence = embedding(sentence_int).detach()
print(embedded_sentence.shape)
#print(embedded_sentence)

torch.Size([28, 16])


Self attention use 3 weights matrices, wq, wk, wv. These weights are adjusted during training. The matrices project the inputs into - query, key and value, components of sequence
Each of the sequence is obtained by dot product sequence and respective weights

Query and key vector should be of same dimension, as need to compute the dot product between them (i.e., d_q = d_k)
Value vector is arbitrary

Query, Key and Value - are 3 Linear layers. 
Query = the text which is searched.
Key = title/key of the artical or video.
Value = the content inside the artifact. 

The 3 matrices can be considered as a #### single attention #### head.

In [79]:
d = embedded_sentence.shape[1]
d_q, d_k, d_v = 24, 24, 28
# initializa the weight matrices
wq = torch.nn.Parameter(torch.rand(d, d_q))
wk = torch.nn.Parameter(torch.rand(d, d_k))
wv = torch.nn.Parameter(torch.rand(d, d_v))

print(wq.shape)

torch.Size([16, 24])


In [83]:
# compute (matmul or @ ) the unnormalized attention weights 
x = embedded_sentence
Q = x @ wq
K = x @ wk
V = x @ wv

#print(Q.shape)
#print(K.shape)
#print(V.shape)
# similarity cosine
omega = Q @ K.T
print(omega.shape)

torch.Size([28, 28])


In [92]:
attention_score = omega / math.sqrt(d_k)
attention_weights = F.softmax(attention_score, dim=1)
context_vector = attention_weights @ V
print(context_vector.shape)
#print(context_vector)
print(" the input dimension is {}, but the output dimension is {}".format(embedded_sentence.shape[1], context_vector.shape[1]))

torch.Size([28, 28])
 the input dimension is 16, but the output dimension is 28
