# Lecture 4 Self Attention - Masked Attention
## Example 1

#### Masked Attention

Masked attention is a crucial mechanism in transformer architectures, particularly in tasks like language modeling, where it's essential to prevent the model from accessing future tokens during training. This ensures that the predictions for a particular position depend only on the known outputs at positions before it.

Below are two Python examples demonstrating masked attention:
- From Scratch Implementation Using NumPy
- Using PyTorch's Built-in Modules

#### 1. Masked Attention from Scratch Using NumPy
This example illustrates how to implement masked attention manually using NumPy. It demonstrates how to prevent each position in the sequence from attending to future positions by applying a mask to the attention scores.

Step-by-Step Explanation
- Initialize Parameters: 
    - Define dimensions for embedding, queries, keys, values, and the sequence length.
    - Initialize random weights for linear transformations.
- Create Query, Key, and Value Matrices:
    - Transform the input data into queries (Q), keys (K), and values (V) using the initialized weights.
- Compute Scaled Dot-Product Attention with Masking:
    - Calculate attention scores using the dot product of Q and K.
    - Apply a mask to prevent attention to future positions by setting those scores to a large negative value.
    - Scale the scores, apply softmax, and multiply by V to obtain the context.
- Final Linear Transformation:
    - Apply a final linear transformation to produce the output.

In [1]:
import numpy as np

def softmax(x, axis=-1):
    """Compute softmax values for each set of scores in x."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

# Parameters
np.random.seed(42)  # For reproducibility
batch_size = 1
seq_length = 4
embedding_dim = 8

# Random input
X = np.random.rand(batch_size, seq_length, embedding_dim)

# Initialize weights
W_q = np.random.rand(embedding_dim, embedding_dim)
W_k = np.random.rand(embedding_dim, embedding_dim)
W_v = np.random.rand(embedding_dim, embedding_dim)
W_o = np.random.rand(embedding_dim, embedding_dim)

# Linear projections
Q = X @ W_q  # Shape: (batch_size, seq_length, embedding_dim)
K = X @ W_k
V = X @ W_v

# Scaled Dot-Product Attention with Masking
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(embedding_dim)  # (batch_size, seq_length, seq_length)

# Create mask: prevent attention to future positions
mask = np.triu(np.ones((seq_length, seq_length)), k=1)  # Upper triangular matrix with zeros on and below the diagonal
scores = scores - mask * 1e9  # Large negative value to mask

# Apply softmax
attention_weights = softmax(scores, axis=-1)  # (batch_size, seq_length, seq_length)

# Multiply by V
context = attention_weights @ V  # (batch_size, seq_length, embedding_dim)

# Final linear layer
output = context @ W_o  # (batch_size, seq_length, embedding_dim)

# Display Results
np.set_printoptions(precision=4, suppress=True)
print("Input X:\n", X)
print("\nAttention Weights:\n", attention_weights)
print("\nOutput after Masked Attention:\n", output)

Input X:
 [[[0.3745 0.9507 0.732  0.5987 0.156  0.156  0.0581 0.8662]
  [0.6011 0.7081 0.0206 0.9699 0.8324 0.2123 0.1818 0.1834]
  [0.3042 0.5248 0.4319 0.2912 0.6119 0.1395 0.2921 0.3664]
  [0.4561 0.7852 0.1997 0.5142 0.5924 0.0465 0.6075 0.1705]]]

Attention Weights:
 [[[1.     0.     0.     0.    ]
  [0.7894 0.2106 0.     0.    ]
  [0.7407 0.1907 0.0686 0.    ]
  [0.6808 0.182  0.0573 0.0799]]]

Output after Masked Attention:
 [[[12.0616 10.0441  8.4291  6.0198  7.4703  7.4587  7.5161  9.2008]
  [12.0264  9.9897  8.401   6.0266  7.5096  7.3866  7.5354  9.1647]
  [11.8611  9.8539  8.293   5.9457  7.4273  7.2924  7.443   9.0455]
  [11.7684  9.7834  8.2206  5.9009  7.3752  7.2382  7.3845  8.977 ]]]


**Explanation of Output**
- Input X: A randomly generated input tensor with shape (batch_size, seq_length, embedding_dim).
- Attention Weights: After applying the mask, each position can only attend to itself and previous positions. For example, the first position attends only to itself, the second attends to the first and second, and so on.
- Output: The result of the masked attention mechanism, maintaining the same shape as the input.

Note: The actual numerical values may vary slightly due to random initialization.