# Lecture 5 Transformer - Cross Attention

To calculate cross-attention between the decoder and the encoder in a transformer model from scratch, we'll need to implement key steps of the attention mechanism, particularly focusing on how attention weights are computed between a query (from the decoder) and key-value pairs (from the encoder). Here's an example of how this can be done using Python and NumPy:

**Overview of Cross-Attention**
- Query (Q) comes from the decoder.
- Key (K) and Value (V) come from the encoder.
- Compute the attention scores as the dot product between Q and K.
- Scale the attention scores by dividing by the square root of the dimensionality of the keys.
- Apply the softmax function to obtain the attention weights.
- Use these attention weights to get a weighted sum of the V (values) to produce the final context.

In [None]:
import numpy as np

# Helper function for softmax
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))  # subtract max for numerical stability
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

# Function to compute cross-attention
def cross_attention(Q, K, V, d_k):
    # Step 1: Compute raw attention scores (Q * K^T)
    attention_scores = Q @ K.T  
    #(seq_len_dec x d_model)(d_model x seq_len_enc)
    
    # Step 2: Scale the attention scores by sqrt(d_k)
    attention_scores /= np.sqrt(d_k)
    
    # Step 3: Apply softmax to get attention weights
    attention_weights = softmax(attention_scores)
     #(seq_len_dec x seq_len_enc)
    
    # Step 4: Compute the final output (attention_weights * V)
    attention_output = attention_weights @ V
    # (seq_len_dec x seq_len_enc)(seq_len_enc x d_model)
    # seq_len_dec x d_model
    
    return attention_output, attention_weights

# Example dimensions
batch_size = 1    # Single example
seq_len_enc = 5   # Encoder sequence length
seq_len_dec = 3   # Decoder sequence length
d_model = 4       # Model dimensionality
d_k = d_model     # Key dimension (same as model dimensionality here)

# Example encoder output (K, V) and decoder input (Q)
np.random.seed(42)  # For reproducibility

# Encoder output (K, V)
K = np.random.rand(seq_len_enc, d_model)  # Key matrix from encoder
V = np.random.rand(seq_len_enc, d_model)  # Value matrix from encoder

# Decoder input (Q)
Q = np.random.rand(seq_len_dec, d_model)  # Query matrix from decoder

# Compute cross-attention
attention_output, attention_weights = cross_attention(Q, K, V, d_k)

print("Attention Weights:\n", attention_weights)
print("\nAttention Output (context vectors):\n", attention_output)

**Explanation:**
- Q (Query) represents the input from the decoder, which attends to the encoder's output.
- K (Key) and V (Value) are derived from the encoder's output.
- The dot product of Q and K gives us the attention scores.
- These scores are scaled by the square root of the key's dimension (d_k) to stabilize gradients.
- Softmax is applied to convert scores into probabilities (attention weights).
Finally, the attention weights are used to compute a weighted sum of the value vectors, resulting in the context vector, which will be passed back to the decoder.<br>

**Output:**
- Attention Weights: Probability distribution over the encoder's sequence, representing how much focus the decoder puts on each encoder token.
- Attention Output: Contextualized vectors for each decoder token based on the encoder's output.

This is a simplified version without multi-head attention and other optimizations. The core idea of cross-attention is to align the decoder's queries with the encoder's keys to extract relevant information (values) from the encoder.