# Chapter 03 : Coding Attention Mechanism

In [1]:
from importlib.metadata import version

print(" torch version:", version("torch"))

 torch version: 2.9.1



Workflow till now

<div align = "center">
    <img src = "../Ref_images/3.0.png" center width="500">
</div>

- In this chapter we will code 

<img src="../Ref_images/3.2.png" width=500>

### 3.3.1 A Simple self-attention mechanism without trainable weights
<img src ="../Ref_images/3.7.png">

- Input sequence $x$, consisting of $T$ elements represented as $x^{(1)}$ to $x^{(T)}$

Example: "Your journey starts with one step"
- $x^{(1)}$, corresponds to a d-dimentional embedding vector representing a specific token, like "Your" has 3 dimentional embeddings.

In **self-attention** your goal is to calculate context vectors $z^{(i)}$ for each element in the input sequence.

- A ***Context Vector*** can be interpreted as an enriched embedding vector.

- We will illustrate this with taking the word "journey" $x^{(2)}$ and the corresponding context vector $z^{(2)}$ which contains the infrmation about $x^{(2)}$ and all other input elements, $x^{(1)}$ to $x^{(T)}$.

- First we will do simplified attention and later we will add trainable weights.


In [1]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2) # query
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

- First step of implementing self-attention is to comppute the intermediate values *w*, referred as ***attention scores***.

<img src="../Ref_images/3.8.png">

### Attention Scores

In [2]:
query = inputs[1] #  journey  (x^2)

attn_scores_2 = torch.empty(inputs.shape[0]) # 
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


```
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
        x^1     x^2     x^3     x^4     x^5     x^6
        Your   journey starts   with    one    step
```

### Understanding the dot product (bilinear - 2 vectors)
An dot product is essentially a concise way of multiplying two vectors element wise and then summing the products. Which is demonstrated as follows: 

In [3]:
inputs[0] # Yours

tensor([0.4300, 0.1500, 0.8900])

In [4]:
### Understanding the dot product (bilinear - 2 vectors)
print("=== Standard Attention (dot product of 2 vectors) ===")
res = 0
for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx]*query[idx] # element wise multiplication and addition
print(res)
print(torch.dot(inputs[0],query))

=== Standard Attention (dot product of 2 vectors) ===
tensor(0.9544)
tensor(0.9544)


#### 2-simplicial Attention scores: 2D scores (for single query)


In [5]:
attn_scores_2s = torch.empty(inputs.shape[0],inputs.shape[0]) # 6*6 = number of tokes * number of tokens
for j, k_j in enumerate(inputs):
    for k, k_prime_k in enumerate(inputs):
        attn_scores_2s[j,k] = torch.sum(query* k_j* k_prime_k)

print("\n2-Simplicial attention scores: ")
print(attn_scores_2s)


2-Simplicial attention scores: 
tensor([[0.6441, 0.6313, 0.6217, 0.3216, 0.2735, 0.4393],
        [0.6313, 1.1124, 1.0946, 0.6493, 0.4657, 0.8602],
        [0.6217, 1.0946, 1.0776, 0.6373, 0.4685, 0.8396],
        [0.3216, 0.6493, 0.6373, 0.3912, 0.2411, 0.5295],
        [0.2735, 0.4657, 0.4685, 0.2411, 0.3871, 0.2315],
        [0.4393, 0.8602, 0.8396, 0.5295, 0.2315, 0.7578]])


```
                k=0     k=1     k=2     k=3     k=4     k=5
                Your   journey starts   with    one    step
            ┌──────────────────────────────────────────────  ┐
j=0  Your   │ 0.6441  0.6313  0.6217  0.3216  0.2735  0.4393 │
j=1 journey │ 0.6313  1.1124  1.0946  0.6493  0.4657  0.8602 │
j=2 starts  │ 0.6217  1.0946  1.0776  0.6373  0.4685  0.8396 │
j=3  with   │ 0.3216  0.6493  0.6373  0.3912  0.2411  0.5295 │
j=4  one    │ 0.2735  0.4657  0.4685  0.2411  0.3871  0.2315 │
j=5  step   │ 0.4393  0.8602  0.8396  0.5295  0.2315  0.7578 │
            └──────────────────────────────────────────────  ┘
```

**Interpretation:** 36 scores (6×6) representing how much query (journey) attends to each pair of tokens.

- [0,0] = 0.6441 (Your, Your)journey attending to "Your" paired with "Your"
- [0,2] = 0.6217(Your, starts)journey attending to "Your" paired with "starts"
- [1,2] = 1.0946(journey, starts)journey attending to "journey" paired with "starts"
- [4,5] = 0.2315(one, step)journey attending to "one" paired with "step"

```
Standard:     query → token           (6 pairs)
              journey → Your, journey → starts, ...

2-Simplicial: query → (token, token)  (36 triplets)
              journey → (Your, Your), journey → (Your, journey), journey → (Your, starts), ...

```

So instead of 6 attention scores, we have 36 attention scores - one for every possible pair of tokens that the query can attend to together!


#### Understanding the trilinear product (3 vectors)

In [6]:
print("\n=== 2-Simplicial Attention (trilinear product of 3 vectors) ===")
# For A[query, j=0, k=2] = trilinear(query, inputs[0], inputs[2])
j = 0  # Your
k = 2  # starts

res = 0
for idx in range(len(query)):
    res += query[idx]* inputs[j][idx] * inputs[k][idx] # # element wise multiplication of 3 vectors and addition

print(res)
print(torch.sum(query * inputs[j] * inputs[k]))



=== 2-Simplicial Attention (trilinear product of 3 vectors) ===
tensor(0.6217)
tensor(0.6217)


#### Standard Attention (dot product -2 vectors)

```
query = inputs[1] = [0.55, 0.87, 0.66]  (journey)
inputs[0] = [0.43, 0.15, 0.89]  (Your)

attn_score[0] = (0.43*0.55) + (0.15*0.87) + (0.89*0.66) = 0.9544
```

**Interpretation:** This is the attention score for inputs[0] (Your) with respect to inputs[1] (journey). It measures how much "journey" should attend to "Your".

#### 2-Simplicial Attention (trilinear product - 3 vectors)

```
query = inputs[1] = [0.55, 0.87, 0.66]  (journey)
inputs[0] = [0.43, 0.15, 0.89]  (Your)
inputs[2] = [0.57, 0.85, 0.64]  (starts)

attn_score_2s[0,2] = (0.55*0.43*0.57) + (0.87*0.15*0.85) + (0.66*0.89*0.64) = 0.6217
```
**Interpretation:** This is the attention score for the pair (inputs[0], inputs[2]) i.e. (Your, starts) with respect to inputs[1] (journey). It measures how much "journey" should attend to the combination of "Your" AND "starts" together.


### Comparision of Text vs Point Transformer

Text Transformer Pipeline

```
Raw Text: "Your journey starts with one step"
    ↓
Tokenization: [token_1, token_2, token_3, token_4, token_5, token_6]
    ↓
Token Embedding Layer: nn.Embedding(vocab_size, d_model)
    → Each token → lookup → embedding vector
    ↓
+ Positional Encoding: (RoPE / Sinusoidal / Learned)
    ↓
Features for Attention: [feat_1, feat_2, ..., feat_6]  shape: (6, d_model)

```

Point Transformer V3 Pipeline

```
Raw Point Cloud: N points with (x, y, z, r, g, b) or (x, y, z, intensity)
    ↓
Voxelization / Grid Sampling: 
    → coord: (N, 3) - 3D coordinates
    → feat: (N, C) - initial features (RGB, intensity, normals)
    ↓
Serialization: (Z-order / Hilbert curve)
    → Converts unordered points to ordered sequence
    ↓
Embedding Layer (Sparse Conv): 
    → Projects initial features to higher dimension
    → Acts like "token embedding" for point clouds
    → Also serves as positional encoding (xCPE - enhanced Conditional Position Encoding)
    ↓
Features for Attention: [feat_1, feat_2, ..., feat_N]  shape: (N, d_model)
```



In [7]:
import torch

# === RAW POINT CLOUD INPUT ===
# 6 points from a simple scene

# coord: 3D spatial coordinates (x, y, z) - WHERE the point is in space
coord = torch.tensor([
    [0.0, 0.0, 0.0],   # Point 1: floor corner
    [1.0, 0.0, 0.0],   # Point 2: floor edge        # query point
    [1.0, 1.0, 0.0],   # Point 3: floor edge
    [0.0, 0.0, 1.0],   # Point 4: wall corner
    [1.0, 0.0, 1.0],   # Point 5: wall edge
    [0.5, 0.5, 0.5]    # Point 6: mid-air point
])

# Initial features (e.g., RGB color, intensity) - raw input
raw_feat = torch.tensor([
    [0.8, 0.2, 0.1],   # Point 1: reddish
    [0.7, 0.3, 0.2],   # Point 2: reddish
    [0.6, 0.4, 0.3],   # Point 3: brownish
    [0.9, 0.9, 0.9],   # Point 4: white (wall)
    [0.85, 0.85, 0.85],# Point 5: white (wall)
    [0.1, 0.5, 0.8]    # Point 6: bluish (object)
])

# === AFTER SPARSE CONV EMBEDDING (like token embedding + pos encoding) ===
# In PTv3, a sparse conv layer projects raw_feat to higher dimension
# and implicitly encodes positional information from coord

# Simulated output after sparse conv embedding layer
# This is what actually goes into attention!
feat = torch.tensor([
    [0.43, 0.15, 0.89],  # Point 1: embedded feature
    [0.55, 0.87, 0.66],  # Point 2: embedded feature  # query point
    [0.57, 0.85, 0.64],  # Point 3: embedded feature
    [0.22, 0.58, 0.33],  # Point 4: embedded feature
    [0.77, 0.25, 0.10],  # Point 5: embedded feature
    [0.05, 0.80, 0.55]   # Point 6: embedded feature
])

# This "feat" is analogous to your text "inputs" tensor!
# Now attention operates on these embedded features

![Point Transformer v3 vs Text data](/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/PTv3_vs_Text.png)

## Attention Weights

#### 1-simplicial (Standard) Attention Weights

![Attention weights](../Ref_images/3.9.png)

- We normalize each attention scores  we computed previously. The main goal behind the normalization is to obtain attention weights that sum upto 1.
- This normalization is a convention that is useful for interpretation and maintaining training stability in an LLM. Here is a straight forward method for achieving this normalization step:

In [60]:
attn_weights_2_tmp = attn_scores_2/attn_scores_2.sum()
print("\nAttention scores:", attn_scores_2)
print("\nsum of attention scores:", attn_scores_2.sum())
print("\nAttention weights:", attn_weights_2_tmp)
print("\nSum of Attention weights:", attn_weights_2_tmp.sum())



Attention scores: tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

sum of attention scores: tensor(6.5617)

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

Sum of Attention weights: tensor(1.0000)


#### For 2-simplicial Attention Weights

- **Key Difference:** 
In 2-simplicial attention, we have a 2D matrix of scores (T × T), so we normalize over both dimensions (all T² values sum to 1).

In [61]:
# 2-Simplicial attention scores (6x6 matrix)
attn_scores_2s = torch.empty(inputs.shape[0], inputs.shape[0])
for j, k_j in enumerate(inputs):
    for k, k_prime_k in enumerate(inputs):
        attn_scores_2s[j, k] = torch.sum(query * k_j * k_prime_k)

# Normalize: divide by sum of ALL elements in the matrix
attn_weights_2s_tmp = attn_scores_2s / attn_scores_2s.sum()

print("\nAttention scores (2D matrix):\n", attn_scores_2s)
print("\nSum of attention scores:", attn_scores_2s.sum())
print("\nAttention weights (2D matrix):\n", attn_weights_2s_tmp)
print("\nSum of Attention weights:", attn_weights_2s_tmp.sum())


Attention scores (2D matrix):
 tensor([[0.6441, 0.6313, 0.6217, 0.3216, 0.2735, 0.4393],
        [0.6313, 1.1124, 1.0946, 0.6493, 0.4657, 0.8602],
        [0.6217, 1.0946, 1.0776, 0.6373, 0.4685, 0.8396],
        [0.3216, 0.6493, 0.6373, 0.3912, 0.2411, 0.5295],
        [0.2735, 0.4657, 0.4685, 0.2411, 0.3871, 0.2315],
        [0.4393, 0.8602, 0.8396, 0.5295, 0.2315, 0.7578]])

Sum of attention scores: tensor(20.9792)

Attention weights (2D matrix):
 tensor([[0.0307, 0.0301, 0.0296, 0.0153, 0.0130, 0.0209],
        [0.0301, 0.0530, 0.0522, 0.0309, 0.0222, 0.0410],
        [0.0296, 0.0522, 0.0514, 0.0304, 0.0223, 0.0400],
        [0.0153, 0.0309, 0.0304, 0.0186, 0.0115, 0.0252],
        [0.0130, 0.0222, 0.0223, 0.0115, 0.0185, 0.0110],
        [0.0209, 0.0410, 0.0400, 0.0252, 0.0110, 0.0361]])

Sum of Attention weights: tensor(1.)


#### **Softmax Normalization**: 
In Practice, its more common and advisable to use softmax function for normalization. This approach is better at managing extreme values and offer more favorable gradient properties during training. The following is a basic implementation of the softmax funtion for normalizing the attention scores. 

In [62]:
def softmax_naive(x):
    return torch.exp(x)/torch.exp(x).sum(dim=0)

In [63]:

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights:",attn_weights_2_naive)
print("\nAttention weights Sum:",attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Attention weights Sum: tensor(1.)


#### **Softmax Normalization for 2-Simplicial Attention:**

- For 2-simplicial attention, we apply softmax over both dimensions (flatten the matrix, apply softmax, reshape back). This ensures all T×T weights sum to 1.

**Why Flatten?**

- In standard attention: query chooses among T tokens --> softmax over T options.
- In 2-Simplicial: query chooses among T*T pairs --> softmax over $T^2$ options.

In [64]:
def softmax_naive_2d(x):
    # Flatten the 2D matrix, apply softmax, reshape back
    flat = x.flatten()
    softmax_flat = torch.exp(flat) / torch.exp(flat).sum()
    return softmax_flat.view(x.shape)

attn_weights_2s_naive = softmax_naive_2d(attn_scores_2s)

print("Attention weights (2D matrix):\n", attn_weights_2s_naive)
print("\Attention weights Sum:",attn_weights_2_naive.sum())

Attention weights (2D matrix):
 tensor([[0.0285, 0.0282, 0.0279, 0.0207, 0.0197, 0.0233],
        [0.0282, 0.0456, 0.0448, 0.0287, 0.0239, 0.0354],
        [0.0279, 0.0448, 0.0440, 0.0283, 0.0239, 0.0347],
        [0.0207, 0.0287, 0.0283, 0.0222, 0.0191, 0.0254],
        [0.0197, 0.0239, 0.0239, 0.0191, 0.0221, 0.0189],
        [0.0233, 0.0354, 0.0347, 0.0254, 0.0189, 0.0320]])
\Attention weights Sum: tensor(1.)


- Softmax function also meets the objective and normalize the attention weights such that they sum to 1.
-  In addition, the softmax function ensure that attention weights are always possitive. This makes the output interpretable as probilities or relative importance, where higher weights indicate greater importance.
- **Note**: This naive softmax implementation (softmax_naive) may enconter the numerical instability problem.such as overflow and underflow, when dealing with large or small input values. Therefore it is advisable to use the Pytorch implementation of softmax, which has been extensively optimized for performance:

In [65]:
attn_weights_2_torch = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:",attn_weights_2_torch)
print("\nAttention weights Sum:",attn_weights_2_torch.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Attention weights Sum: tensor(1.)


- In this case, it yields the same results as our previous softmax_naive function

#### PyTorch Softmax for 2-Simplicial Attention
- **Note:** The naive softmax implementation (softmax_naive_2d) may encounter numerical instability problems such as overflow and underflow when dealing with large or small input values. Therefore it is advisable to use the PyTorch implementation of softmax, which has been extensively optimized for performance.
- **Key Difference:** For 2-simplicial attention, we need to flatten the 2D matrix before applying softmax, then reshape back. We use dim=None by flattening first, or reshape and use dim=-1.

In [66]:
attn_weights_2s_torch = torch.softmax(attn_scores_2s.flatten(),dim=0).view(attn_scores_2s.shape)
print("Attention weights (2D matrix):\n",attn_weights_2s_torch)
print("Attention weights Sum:", attn_weights_2s_torch.sum())

Attention weights (2D matrix):
 tensor([[0.0285, 0.0282, 0.0279, 0.0207, 0.0197, 0.0233],
        [0.0282, 0.0456, 0.0448, 0.0287, 0.0239, 0.0354],
        [0.0279, 0.0448, 0.0440, 0.0283, 0.0239, 0.0347],
        [0.0207, 0.0287, 0.0283, 0.0222, 0.0191, 0.0254],
        [0.0197, 0.0239, 0.0239, 0.0191, 0.0221, 0.0189],
        [0.0233, 0.0354, 0.0347, 0.0254, 0.0189, 0.0320]])
Attention weights Sum: tensor(1.0000)


#### **Stable Softmax Normalization**: (ICLR'25)

- Helps in groking (delayed generalization), [Paper Link](https://arxiv.org/pdf/2501.04697). As we have discussed there is problem of overflow and underflow in case of large and small values of input.
- [Github Link](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)

    ![Stable max](../Ref_images/Stable_max.png)



- both the above versions can be used
    - For faster implementation use linear transformation (x+1) and 1/(1-x)
    - For better compatibility use log(x+1) and -log(-x+1) as later we want to use other properties of softmax.
    - Here we will implement only the log version 

In [67]:
def g(x):
    return torch.where(
        x>=0,
        torch.log(x+1),
        -torch.log(-x+1)
    )
attn_weights_2_stablemax = torch.softmax(g(attn_scores_2), dim=0)
print("Attention weights:",attn_weights_2_stablemax)
print("\nAttention weights Sum:",attn_weights_2_stablemax.sum())

Attention weights: tensor([0.1556, 0.1986, 0.1971, 0.1467, 0.1359, 0.1661])

Attention weights Sum: tensor(1.)


#### **Stable Softmax Normalization for 2-Simplicial Attention:** (ICLR'25)


In [68]:
def g(x):
    return torch.where(
        x >= 0,
        torch.log(x + 1),
        -torch.log(-x + 1)
    )

# Apply g() to the 2D scores matrix, then flatten, softmax, reshape
attn_weights_2s_stablemax = torch.softmax(g(attn_scores_2s).flatten(), dim = 0).view(attn_scores_2s.shape)

print("Attention weights (2D matrix):\n",attn_weights_2s_stablemax)
print("\nAttention weights Sum:", attn_weights_2_stablemax.sum())

Attention weights (2D matrix):
 tensor([[0.0289, 0.0286, 0.0285, 0.0232, 0.0223, 0.0253],
        [0.0286, 0.0371, 0.0368, 0.0289, 0.0257, 0.0326],
        [0.0285, 0.0368, 0.0365, 0.0287, 0.0258, 0.0323],
        [0.0232, 0.0289, 0.0287, 0.0244, 0.0218, 0.0268],
        [0.0223, 0.0257, 0.0258, 0.0218, 0.0243, 0.0216],
        [0.0253, 0.0326, 0.0323, 0.0268, 0.0216, 0.0309]])

Attention weights Sum: tensor(1.)


**Why stable softmax matters for 2-Simplicial:**

- 2-simplicial has T×T scores instead of T scores
- More values → higher chance of extreme values
- Trilinear products can produce larger/smaller values than bilinear (dot product)
- Stable transformation g(x) compresses the range before softmax



## **Context Vector**:
Now that we have computed the normalized attention weights, we are ready for the final step, as shown in the figure 3.10, calculating the **Context Vector** $z(2)$ by multiplying the embedded input tokens, $x(i)$ with the corresonding attention weights and then summing the resulting vectors.

- Thus, Context vector $z(2)$ is weighted sum of all input vectors, obtained by multiplying each input vector by its corresponding attention weight:

![3.10](../Ref_images/3.10.png)

In [69]:
# Context vector with torch softmax
query = inputs[1] # The second input token is the query

context_vec_2_torch = torch.zeros(query.shape) # query.shape: torch.Size([3])
for i, x_i in enumerate(inputs): # going throgh whole input ( 0 to 5)
    context_vec_2_torch += attn_weights_2_torch[i]*x_i

print(context_vec_2_torch)

tensor([0.4419, 0.6515, 0.5683])


```

i | attn_weight | input_vector          | Weight × Vector
--+-------------+-----------------------+-------------------------
0 | 0.1385      | [0.43, 0.15, 0.89]   | [0.0595, 0.0207, 0.1232]
1 | 0.2379      | [0.55, 0.87, 0.66]   | [0.1308, 0.2069, 0.1570]
2 | 0.2333      | [0.57, 0.85, 0.64]   | [0.1329, 0.1983, 0.1493]
3 | 0.1240      | [0.22, 0.58, 0.33]   | [0.0272, 0.0719, 0.0409]
4 | 0.1082      | [0.77, 0.25, 0.10]   | [0.0833, 0.0270, 0.0108]
5 | 0.1581      | [0.05, 0.80, 0.55]   | [0.0079, 0.1264, 0.0869]
--+-------------+-----------------------+---------------------------
  | SUM         |                      | [0.4419, 0.6515, 0.5683]

```

*Manual Calculations*: for better understandng



In [70]:
# Context vector with torch stablemax
query = inputs[1] # The second input token is the query

context_vec_2_satblemax = torch.zeros(query.shape) # query.shape: torch.Size([3])

for i, x_i in enumerate(inputs):
    context_vec_2_satblemax += attn_weights_2_stablemax[i]*x_i

print(context_vec_2_satblemax)

tensor([0.4337, 0.6156, 0.5490])


### Computing Attention Weights for all inputs