# Chapter 3 - Interactive

### This notebook will contain code blocks, images, and gifs to further enhance your understanding and intuition of specific topics listed below:

- #### attention

## Attention

#### Instead of processing a word in isolation or in a fixed window, attention allows the model to:

* #### dynamically focus on other words in the input sequence

* #### weigh the importance of those words based on context

* #### capture long-range dependencies, even between words far apart in the text


### For example, in the sentence:
#### "The cat, which had been hiding under the couch, finally emerged."

#### When predicting the word "emerged", the model may attend more to "cat" than to "couch", even though "cat" is far away. This is only possible because of attention.

#### An example is shown below:

<div style="max-width:700px">
    
![](images/interactive_1.gif)

</div>

#### Below is an example of calculating the context vector for a token in a sequence.

<div style="max-width:800px">
    
![](images/interactive_3.gif)

</div>

#### Now lets get into the difference between single-head and multi-head attention.

#### Single-head attention:

#### In single-head attention, each token uses a single query vector to compare against all key vectors. It produces a single set of attention scores, which are then softmaxed into attention weights. These weights are used to compute a single context vector as a weighted sum of the value vectors. This process helps the model focus on the most relevant parts of the input, but it only attends from a single representation subspace. Limitation: Single-head attention can miss important contextual relationships because it only uses one perspective or feature space to calculate attention.

#### Multi-head attention:

#### Multi-head attention overcomes the limitations of single-head attention by projecting the same token into multiple sets of query, key, and value spaces (via learned weight matrices). Each head computes its own attention scores and context vector independently. Each head can focus on different types of relationships—e.g., syntactic structure in one head, and semantic similarity in another. After computing each head’s context vector, they are concatenated (or averaged) to form the final representation. This gives the model a richer understanding of context.


<div style="max-width:800px">
    
![](images/interactive_2.gif)

</div>

In [8]:
import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))  # numerical stability
    return e_x / e_x.sum()

# === SINGLE-HEAD ATTENTION ===
q = np.array([1, 0])  # query for the token
keys = np.array([[1, 1], [1, 0]])
values = np.array([[2, 0], [0, 2]])

attn_scores = q @ keys.T  # [q · k1, q · k2]
attn_weights = softmax(attn_scores)
context_vector = attn_weights @ values

print("=== SINGLE-HEAD ATTENTION ===")
print(f"Query: {q}")
print(f"Attention scores: {attn_scores}")
print(f"Softmaxed weights: {attn_weights}")
print(f"Context vector (single-head): {context_vector}\n")

# === MULTI-HEAD ATTENTION ===
token_embedding = np.array([1, 1])  # original embedding for a single token
print("=== MULTI-HEAD ATTENTION (2 heads on same token) ===")
print(f"Original token embedding: {token_embedding}\n")

# --- Head 1 ---
W_q1 = np.array([[1, 0], [0, 1]])  # Identity
q1 = token_embedding @ W_q1
attn_scores_1 = q1 @ keys.T
attn_weights_1 = softmax(attn_scores_1)
context_vector_1 = attn_weights_1 @ values

print("Head 1:")
print(f"  q1 = token_embedding @ W_q1 = {q1}")
print(f"  Attention weights: {attn_weights_1}")
print(f"  Context vector: {context_vector_1}")

# --- Head 2 ---
W_q2 = np.array([[2, -1], [0, 0.5]])
q2 = token_embedding @ W_q2
attn_scores_2 = q2 @ keys.T
attn_weights_2 = softmax(attn_scores_2)
context_vector_2 = attn_weights_2 @ values

print("Head 2:")
print(f"  q2 = token_embedding @ W_q2 = {q2}")
print(f"  Attention weights: {attn_weights_2}")
print(f"  Context vector: {context_vector_2}")

# Final output
multihead_output = np.concatenate([context_vector_1, context_vector_2])
print(f"\nFinal Multi-head Context (concatenated): {multihead_output}")


=== SINGLE-HEAD ATTENTION ===
Query: [1 0]
Attention scores: [1 1]
Softmaxed weights: [0.5 0.5]
Context vector (single-head): [1. 1.]

=== MULTI-HEAD ATTENTION (2 heads on same token) ===
Original token embedding: [1 1]

Head 1:
  q1 = token_embedding @ W_q1 = [1 1]
  Attention weights: [0.73105858 0.26894142]
  Context vector: [1.46211716 0.53788284]
Head 2:
  q2 = token_embedding @ W_q2 = [ 2.  -0.5]
  Attention weights: [0.37754067 0.62245933]
  Context vector: [0.75508134 1.24491866]

Final Multi-head Context (concatenated): [1.46211716 0.53788284 0.75508134 1.24491866]
