#### Casual Self Attention

We need to mask the upper triangle of the attention scores/ attention weights to not allow the tokens see the context for the future tokens. There are 2 ways to do it:

* Masking the attention weights after the softmax is applied. We mask the uppper triangle of the attention weights matrix with 0 and then peform another round of normalization.
* The 2nd method is to mask the uper triangle of the attention scores matrix and then apply the soft-max nornmalization technique.

The 1st method is more efficient as it avoids multiple normalisation and uses less computation.

**Efficinent method**

Attention scores --> Upper triangle -ve infinity mask --> softmax 

In [3]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np

In [11]:
inputs = torch.tensor(
    [[0.43,0.15,0.89], # Your
    [0.55, 0.87, 0.66], # journey
    [0.57, 0.85, 0.64], # starts
    [0.22, 0.58, 0.33], # with
    [0.77, 0.25, 0.10], # one
    [0.05, 0.80, 0.55]] # step
)

Defining elements

* A: The second input element
* B: Input embedding size, d_in=3
* C: Output embedding size, d_out=2 

In [12]:
d_in = inputs.shape[1]
print(f"Input shape is {d_in}")
d_out = 2
print(f"Output shape is {d_out}")

# Intialising the weight matrices

"""
requires_grad is set to False to reduce clutter. But if we were to use the weight matrices for training we would set up to be equal to True.
So that it updates the amtricews during model training.
"""
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad = False) 
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad = False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad = False)

print(f"Query weights:{W_query}")
print(" ")
print(f"Key weights:{W_key}")
print(" ")
print(f"Value weights:{W_value}")
print(" ")

"""
For GPT like models the input and the output dimensions are usually the same. But for demostration we are using different dimensions
"""
x_2 = inputs[1]
print(f"The tensor value for the second token is {x_2}")


Input shape is 3
Output shape is 2
Query weights:Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])
 
Key weights:Parameter containing:
tensor([[0.1366, 0.1025],
        [0.1841, 0.7264],
        [0.3153, 0.6871]])
 
Value weights:Parameter containing:
tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])
 
The tensor value for the second token is tensor([0.5500, 0.8700, 0.6600])


In [13]:
"""
We get a 1x2 dimensional query, key and value vector. Even though our temporary goal is to only compute the one context vector z(2).
We still require key and value vectors for all the input. As this is required for the calculation of the attention weights
with respect to the query q(2).
"""
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(f"The query vector for the 2nd token is {query_2}")
print(f"The key vector for the 2nd token is {key_2}")
print(f"The value vector for the 2nd token is {value_2}")

The query vector for the 2nd token is tensor([0.4306, 1.4551])
The key vector for the 2nd token is tensor([0.4433, 1.1419])
The value vector for the 2nd token is tensor([0.3951, 1.0037])


In [14]:
# Computing the query, key and value vectors

queries = inputs @ W_query
keys = inputs @ W_key
values = inputs @ W_value

# We have projected the 6 input tokens from a 3D space onto a 2D embedding space.
print(f"Shape of queries matrix: {queries.shape}")
print(f"Shape of keys matrix: {keys.shape}")
print(f"Shape of values matrix: {values.shape}")
print(" ")

# Computing the attention score for the 2nd token
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(f"Attention score between the 2nd token and the 2nd token: {attn_score_22}")
print(" ")

# Generalising the computation to get all attention scores by matrix multiplication for the 2nd token
attn_score_2 = query_2 @ keys.T #all attention scores for the 2nd token(query)
print(f"Attention score for the entire 2nd token: {attn_score_2}")
print(" ")

# Entire attention score matrix
attn_score = queries @ keys.T
print(f"Entire attention score matrix: {attn_score}")

Shape of queries matrix: torch.Size([6, 2])
Shape of keys matrix: torch.Size([6, 2])
Shape of values matrix: torch.Size([6, 2])
 
Attention score between the 2nd token and the 2nd token: 1.8523844480514526
 
Attention score for the entire 2nd token: tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])
 
Entire attention score matrix: tensor([[0.9231, 1.3545, 1.3241, 0.7910, 0.4032, 1.1330],
        [1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440],
        [1.2544, 1.8284, 1.7877, 1.0654, 0.5508, 1.5238],
        [0.6973, 1.0167, 0.9941, 0.5925, 0.3061, 0.8475],
        [0.6114, 0.8819, 0.8626, 0.5121, 0.2707, 0.7307],
        [0.8995, 1.3165, 1.2871, 0.7682, 0.3937, 1.0996]])


The next step is to calculate the attention weights by scaling the attention scores and performing a softmax operation. For causal attention we need to mask the upper triangle with -ve infinity. We can resue the class SelfAttention_v2 from multi_head_attention.ipynb notebook

In [15]:
inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Yopur
     [0.55, 0.87, 0.66], # journey
     [0.57, 0.85, 0.64], # starts
     [0.22, 0.58, 0.33], # with
     [0.77, 0.25, 0.10], # one
     [0.05, 0.80, 0.55]] # step
)

d_in = 3
d_out = 2

In [16]:
class SelfAttention_v2(nn.Module):
    
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias = qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias = qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias = qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x) 

        attn_scores = queries @ keys.T
        attn_weight = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim = -1)

        context_vec = attn_weight @ values
        return context_vec
        
sa_v2 = SelfAttention_v2(d_in, d_out)

In [17]:
queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim = 1 )
print(attn_weights)

tensor([[0.1362, 0.1730, 0.1736, 0.1713, 0.1792, 0.1666],
        [0.1359, 0.1730, 0.1735, 0.1716, 0.1790, 0.1670],
        [0.1366, 0.1729, 0.1734, 0.1714, 0.1788, 0.1669],
        [0.1493, 0.1701, 0.1704, 0.1697, 0.1732, 0.1674],
        [0.1589, 0.1690, 0.1692, 0.1667, 0.1712, 0.1649],
        [0.1408, 0.1715, 0.1718, 0.1717, 0.1758, 0.1684]],
       grad_fn=<SoftmaxBackward0>)


In [18]:
# 1st Method --> Updating attention weights above teh diagonal to zero and then normalising 

# We can use PyTorch tril function to create a mask where the values above the diagonal are zero
context_length = attn_scores.shape[0]
print(torch.ones(context_length, context_length))
print(" ")

mask_simple = torch.tril(torch.ones(context_length, context_length)) # Masking upper diagonal with zero
print(mask_simple)
print(" ")

# Multiplying the masked matrix with the attention weights to zsero out the upper diagonal values.
masked_simple = attn_weights * mask_simple
print(masked_simple)
print(" ")

# The elements above the diagonal are zeroed out but needs to be normalised
row_sums = masked_simple.sum(dim =1, keepdim = True)
masked_simple_norm = masked_simple/row_sums
print(masked_simple_norm)

tensor([[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]])
 
tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])
 
tensor([[0.1362, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1359, 0.1730, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1366, 0.1729, 0.1734, 0.0000, 0.0000, 0.0000],
        [0.1493, 0.1701, 0.1704, 0.1697, 0.0000, 0.0000],
        [0.1589, 0.1690, 0.1692, 0.1667, 0.1712, 0.0000],
        [0.1408, 0.1715, 0.1718, 0.1717, 0.1758, 0.1684]],
       grad_fn=<MulBackward0>)
 
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4400, 0.5600, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2830, 0.3580, 0.3590, 0.0000, 0.0000, 0.0000],
        [0.2264, 0.2579, 0.258

In [19]:
# 2nd Method --> Updating attention scores above the diagonal to -ve infinity and then applying scaling normalising(softmax) to get attention scores
print(attn_scores)  
print(" ")

mask = torch.triu(torch.ones(context_length,context_length), diagonal = 1 )
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
print(" ")

# applying softmax to the masked matrix, changes the -ve infinity to 0s and sum of every row = 1
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim = 1)
print(attn_weights) 

# Both the methods give us the same answer. But the 2nd method is more efficient than the 1st one.

tensor([[-0.2327,  0.1055,  0.1098,  0.0913,  0.1549,  0.0521],
        [-0.2396,  0.1015,  0.1057,  0.0902,  0.1501,  0.0518],
        [-0.2323,  0.1004,  0.1045,  0.0885,  0.1481,  0.0507],
        [-0.1344,  0.0502,  0.0523,  0.0470,  0.0753,  0.0272],
        [-0.0349,  0.0520,  0.0538,  0.0331,  0.0708,  0.0174],
        [-0.2142,  0.0650,  0.0679,  0.0668,  0.1004,  0.0395]],
       grad_fn=<MmBackward0>)
 
tensor([[-0.2327,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-0.2396,  0.1015,    -inf,    -inf,    -inf,    -inf],
        [-0.2323,  0.1004,  0.1045,    -inf,    -inf,    -inf],
        [-0.1344,  0.0502,  0.0523,  0.0470,    -inf,    -inf],
        [-0.0349,  0.0520,  0.0538,  0.0331,  0.0708,    -inf],
        [-0.2142,  0.0650,  0.0679,  0.0668,  0.1004,  0.0395]],
       grad_fn=<MaskedFillBackward0>)
 
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4400, 0.5600, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2830, 0.3580, 0.3590, 0.0000, 0

In [27]:
# Masking additional weights with dropout 
"""
Using 50% dropout rate, masking half of the attention weights. Ideally when training the GPT models a lower dropout rate is prefered (0.1 or 0.2).
Applying PyTorch's dropout implementation to a 6x6 tensor consisting of ones
"""

example = torch.ones(6,6)
print(example)
print(" ")

"""The dropout rate would be on an average, all the rows does not necessarily have to have 50% of the length being cut short.
With 0.5 dropout factor the neurons which is not put to zero would be scaled by 1/ 0.5 .The scaling is to maintain the overall balance of the
attention weights, ensuring that the average influence of the attention mechanism remains consistent during both the training and inference phase.
"""
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5)
print(f"Dropout example: {dropout(example)}\n")

# Applying dropout to the attention weigt matrix
torch.manual_seed(123)
print(f"Dropout example for the attention weights: {dropout(attn_weights)}")

tensor([[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]])
 
Dropout example: tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])

Dropout example for the attention weights: tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5659, 0.7160, 0.7181, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.5159, 0.5167, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4047, 0.0000, 0.3993, 0.0000, 0.0000],
        [0.0000, 0.3430, 0.3437, 0.3434, 0.3516, 0.0000]],
       grad_fn=<MulBackward0>)
