In [1]:
import torch

<div class="alert alert-block alert-info">
<b>Attention Mechanism</b> 
<p>    
RNN where used for natural language processing.
RNN was best for word to word transformation. However it posses a problem.

When we have the context bigger then the long range dependencies is not able to be captured to make 
sense of overall context of sentence or paragraph.

Hence there is a need for attention mechanism , to capture the long range dependencies in sentence or paragraph
</div>

In [None]:
sentence = "Your journey starts with one step"

<div class="alert alert-block alert-info">
<b>Attention Mechanism</b> 
<p>    
Lets take a look at the example sentence above = "Your journey starts with one step".
After token embedding , we are able to project the tokens (words) in this sentence into higher dimension vector space.

We can build and understand the semantics and meaning of the individual works .
However the problem arise , we are still not able to build the relationship or context of particular word with the overall sentence /paragraph 

This is why we need attention mechanism , to build the relationship of each token with each other token . meaning how much attention has to paid to other tokens with respect the one in question
Example :- if we take work journey , how is this journey work related to all the words in sentence , and need to understand how much of the weightage if to be assigned to each other word with journey in context.

As part of attention mechanism we create a context vector for each of the tokens (toekn embedding)
</div>

In [4]:
# Let consider below tensor is the vector embedding for the sentence "Your journey starts with one step"
# The size is basically  (# of Tokens * Embedding size) 
# but for simplicity we are using much lower vector dimension

inputs = torch.tensor(
 [[0.43, 0.15, 0.89],  # Your
 [0.55, 0.87, 0.66],  # journey
 [0.57, 0.85, 0.64],  # starts
 [0.22, 0.58, 0.33],  # with
 [0.77, 0.25, 0.10],  # one
 [0.05, 0.80, 0.55]] # step
)

In [3]:
inputs.shape

torch.Size([6, 3])

<div class="alert alert-block alert-info">
<b>Context vector</b> 
<p>    
First Step to create the context Vector for each of the work in the sentence , is to calculate the intermediate attention score.

The word selected for which the context Vector is to be generated is called "QUERY"
</div>

In [11]:
# We will create context vector initially for only one work "journey"
# So  the "journey" is or QUERY


query=inputs[1]

<div class="alert alert-block alert-info">
<b>Attention Score</b> 
<p>    
Next step is to calculate the attention score using dot product the attention score calculation using dot product .
It is a way to understand how closely the query is related to the other tokens .

The dot product quantifies how much 2 vectors are aligned.

If higher the dot product , the tokens are similar.
Lower the doit product they are not related to similar to each other
</p>
</div>

In [15]:
# We create empty tensor of the shape equal to # of tokens
# We want to calculate the attention score of "journey" token with respect to other tokens

attention_score_token2 = torch.empty(inputs.shape[0])
attention_score_token2.shape

torch.Size([6])

In [18]:
# This attention score is only calculate for 2nd Token to find out
# How this Token is related to other Tokens.

for i , token_vector in enumerate(inputs):
    attention_score_token2[i] = torch.dot(token_vector , query)

print(attention_score_token2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


<div class="alert alert-block alert-info">
<b>Normalization of Attention Score</b> 
<p>    
This process of normalization means, the attention scores which have value greater then 1 has to be brought between 0 and 1

This is required so that they can evaluated ith probabilities that LLM can very well generalize

Meaning we should be able to tell give this % of attention to this tken etc.

We use the torch softmax for this
</p>
</div>

In [25]:
def my_softmax(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

weights = my_softmax(attention_score_token2)
print(weights)

In [33]:
# Now we can tell the attention score in probabilities 
# if we sum then the value will be turned out to be = 1

attention_score_token2_norm=torch.softmax(attention_score_token2, dim=0)
print("Sum:", attention_score_token2_norm.sum())
print("Attention Scores:" ,attention_score_token2_norm)
print("Probabilities in % : {}".format(attention_score_token2_norm *100))

Sum: tensor(1.)
Attention Scores: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Probabilities in % : tensor([13.8548, 23.7891, 23.3274, 12.3992, 10.8182, 15.8114])


<div class="alert alert-block alert-info">
<b>Calculate the Context Vector</b> 
<p>    
Now we multiple Token embedding vectors with the corresponding attention weights and then summing the resultant Vectors  
</p>
</div>

In [41]:
# This Content Vector is only for 2nd Token
# This is enriched token embedding vector for 2nd Token
# Each toekn have there contributions to this enrich this based on there attention score

query = inputs[1]  # 2nd Token is the query = journey
print(query.shape)
context_vector_for_query = torch.zeros(query.shape)

torch.Size([3])


In [42]:
# this computation is only for 2nd Token 

for i , tokens_vector in enumerate(inputs):
    context_vector_temp=attention_score_token2_norm[i]*tokens_vector
    context_vector_for_query=context_vector_for_query+context_vector_temp

print(context_vector_for_query)

tensor([0.4419, 0.6515, 0.5683])


<div class="alert alert-block alert-info">
<b>Context Vector for all tokens</b> 
<p>    
We now generate the Context Vector for all the Tokens 

1. Create Attention Score for all tokens
2. Normalize the attention score using softmax
3. Multiple the Normalized attention score (weight) to scale down the Token Embedding
4. Sum all the vectors to create context vector for all Tokens
</p>
</div>

In [44]:
# Calculate the Attention score

attn_scores_all_tokens = inputs @ inputs.T
print(attn_scores_all_tokens)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [46]:
# Normalize using softmax

attn_scores_all_tokens_norms= torch.softmax(attn_scores_all_tokens , dim=1)
print(attn_scores_all_tokens_norms)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [54]:
# Calculate the Context vector for all toekns using matrix multiplication
# Dimensions are same as original token embedding vector

context_vector_all_tokens= attn_scores_all_tokens_norms @ inputs

print(context_vector_all_tokens)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])
