# Chapter 03 : Coding Attention Mechanism

In [1]:
from importlib.metadata import version

print(" torch version:", version("torch"))

 torch version: 2.9.1



Workflow till now

<div align = "center">
    <img src = "/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/3.0.png" center width="500">
</div>

- In this chapter we will code 

<img src="/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/3.2.png" width=500>

### 3.3.1 A Simple self-attention mechanism without trainable weights
<img src ="/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/3.7.png">

- Input sequence $x$, consisting of $T$ elements represented as $x^{(1)}$ to $x^{(T)}$

Example: "Your journey starts with one step"
- $x^{(1)}$, corresponds to a d-dimentional embedding vector representing a specific token, like "Your" has 3 dimentional embeddings.

In **self-attention** your goal is to calculate context vectors $z^{(i)}$ for each element in the input sequence.

- A ***Context Vector*** can be interpreted as an enriched embedding vector.

- We will illustrate this with taking the word "journey" $x^{(2)}$ and the corresponding context vector $z^{(2)}$ which contains the infrmation about $x^{(2)}$ and all other input elements, $x^{(1)}$ to $x^{(T)}$.

- First we will do simplified attention and later we will add trainable weights.


In [None]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2) # query
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

- First step of implementing self-attention is to comppute the intermediate values *w*, referred as ***attention scores***.

<img src="/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/3.8.png">

### Attention Scores

In [7]:
query = inputs[1] #  journey  (x^2)

attn_scores_2 = torch.empty(inputs.shape[0]) # 
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


```
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
        x^1     x^2     x^3     x^4     x^5     x^6
        Your   journey starts   with    one    step
```

### Understanding the dot product (bilinear - 2 vectors)
An dot product is essentially a concise way of multiplying two vectors element wise and then summing the products. Which is demonstrated as follows: 

In [8]:
inputs[0] # Yours

tensor([0.4300, 0.1500, 0.8900])

In [16]:
### Understanding the dot product (bilinear - 2 vectors)
print("=== Standard Attention (dot product of 2 vectors) ===")
res = 0
for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx]*query[idx] # element wise multiplication and addition
print(res)
print(torch.dot(inputs[0],query))

=== Standard Attention (dot product of 2 vectors) ===
tensor(0.9544)
tensor(0.9544)


#### 2-simplicial Attention scores: 2D scores (for single query)


In [14]:
attn_scores_2s = torch.empty(inputs.shape[0],inputs.shape[0]) # 6*6 = number of tokes * number of tokens
for j, k_j in enumerate(inputs):
    for k, k_prime_k in enumerate(inputs):
        attn_scores_2s[j,k] = torch.sum(query* k_j* k_prime_k)

print("\n2-Simplicial attention scores: ")
print(attn_scores_2s)


2-Simplicial attention scores: 
tensor([[0.6441, 0.6313, 0.6217, 0.3216, 0.2735, 0.4393],
        [0.6313, 1.1124, 1.0946, 0.6493, 0.4657, 0.8602],
        [0.6217, 1.0946, 1.0776, 0.6373, 0.4685, 0.8396],
        [0.3216, 0.6493, 0.6373, 0.3912, 0.2411, 0.5295],
        [0.2735, 0.4657, 0.4685, 0.2411, 0.3871, 0.2315],
        [0.4393, 0.8602, 0.8396, 0.5295, 0.2315, 0.7578]])


```
                k=0     k=1     k=2     k=3     k=4     k=5
                Your   journey starts   with    one    step
            ┌──────────────────────────────────────────────  ┐
j=0  Your   │ 0.6441  0.6313  0.6217  0.3216  0.2735  0.4393 │
j=1 journey │ 0.6313  1.1124  1.0946  0.6493  0.4657  0.8602 │
j=2 starts  │ 0.6217  1.0946  1.0776  0.6373  0.4685  0.8396 │
j=3  with   │ 0.3216  0.6493  0.6373  0.3912  0.2411  0.5295 │
j=4  one    │ 0.2735  0.4657  0.4685  0.2411  0.3871  0.2315 │
j=5  step   │ 0.4393  0.8602  0.8396  0.5295  0.2315  0.7578 │
            └──────────────────────────────────────────────  ┘
```

**Interpretation:** 36 scores (6×6) representing how much query (journey) attends to each pair of tokens.

- [0,0] = 0.6441 (Your, Your)journey attending to "Your" paired with "Your"
- [0,2] = 0.6217(Your, starts)journey attending to "Your" paired with "starts"
- [1,2] = 1.0946(journey, starts)journey attending to "journey" paired with "starts"
- [4,5] = 0.2315(one, step)journey attending to "one" paired with "step"

```
Standard:     query → token           (6 pairs)
              journey → Your, journey → starts, ...

2-Simplicial: query → (token, token)  (36 triplets)
              journey → (Your, Your), journey → (Your, journey), journey → (Your, starts), ...

```

So instead of 6 attention scores, we have 36 attention scores - one for every possible pair of tokens that the query can attend to together!


#### Understanding the trilinear product (3 vectors)

In [15]:
print("\n=== 2-Simplicial Attention (trilinear product of 3 vectors) ===")
# For A[query, j=0, k=2] = trilinear(query, inputs[0], inputs[2])
j = 0  # Your
k = 2  # starts

res = 0
for idx in range(len(query)):
    res += query[idx]* inputs[j][idx] * inputs[k][idx] # # element wise multiplication of 3 vectors and addition

print(res)
print(torch.sum(query * inputs[j] * inputs[k]))



=== 2-Simplicial Attention (trilinear product of 3 vectors) ===
tensor(0.6217)
tensor(0.6217)


#### Standard Attention (dot product -2 vectors)

```
query = inputs[1] = [0.55, 0.87, 0.66]  (journey)
inputs[0] = [0.43, 0.15, 0.89]  (Your)

attn_score[0] = (0.43*0.55) + (0.15*0.87) + (0.89*0.66) = 0.9544
```

**Interpretation:** This is the attention score for inputs[0] (Your) with respect to inputs[1] (journey). It measures how much "journey" should attend to "Your".

#### 2-Simplicial Attention (trilinear product - 3 vectors)

```
query = inputs[1] = [0.55, 0.87, 0.66]  (journey)
inputs[0] = [0.43, 0.15, 0.89]  (Your)
inputs[2] = [0.57, 0.85, 0.64]  (starts)

attn_score_2s[0,2] = (0.55*0.43*0.57) + (0.87*0.15*0.85) + (0.66*0.89*0.64) = 0.6217
```
**Interpretation:** This is the attention score for the pair (inputs[0], inputs[2]) i.e. (Your, starts) with respect to inputs[1] (journey). It measures how much "journey" should attend to the combination of "Your" AND "starts" together.


### Comparision of Text vs Point Transformer

Text Transformer Pipeline

```
Raw Text: "Your journey starts with one step"
    ↓
Tokenization: [token_1, token_2, token_3, token_4, token_5, token_6]
    ↓
Token Embedding Layer: nn.Embedding(vocab_size, d_model)
    → Each token → lookup → embedding vector
    ↓
+ Positional Encoding: (RoPE / Sinusoidal / Learned)
    ↓
Features for Attention: [feat_1, feat_2, ..., feat_6]  shape: (6, d_model)

```

Point Transformer V3 Pipeline

```
Raw Point Cloud: N points with (x, y, z, r, g, b) or (x, y, z, intensity)
    ↓
Voxelization / Grid Sampling: 
    → coord: (N, 3) - 3D coordinates
    → feat: (N, C) - initial features (RGB, intensity, normals)
    ↓
Serialization: (Z-order / Hilbert curve)
    → Converts unordered points to ordered sequence
    ↓
Embedding Layer (Sparse Conv): 
    → Projects initial features to higher dimension
    → Acts like "token embedding" for point clouds
    → Also serves as positional encoding (xCPE - enhanced Conditional Position Encoding)
    ↓
Features for Attention: [feat_1, feat_2, ..., feat_N]  shape: (N, d_model)
```



In [1]:
import torch

# === RAW POINT CLOUD INPUT ===
# 6 points from a simple scene

# coord: 3D spatial coordinates (x, y, z) - WHERE the point is in space
coord = torch.tensor([
    [0.0, 0.0, 0.0],   # Point 1: floor corner
    [1.0, 0.0, 0.0],   # Point 2: floor edge        # query point
    [1.0, 1.0, 0.0],   # Point 3: floor edge
    [0.0, 0.0, 1.0],   # Point 4: wall corner
    [1.0, 0.0, 1.0],   # Point 5: wall edge
    [0.5, 0.5, 0.5]    # Point 6: mid-air point
])

# Initial features (e.g., RGB color, intensity) - raw input
raw_feat = torch.tensor([
    [0.8, 0.2, 0.1],   # Point 1: reddish
    [0.7, 0.3, 0.2],   # Point 2: reddish
    [0.6, 0.4, 0.3],   # Point 3: brownish
    [0.9, 0.9, 0.9],   # Point 4: white (wall)
    [0.85, 0.85, 0.85],# Point 5: white (wall)
    [0.1, 0.5, 0.8]    # Point 6: bluish (object)
])

# === AFTER SPARSE CONV EMBEDDING (like token embedding + pos encoding) ===
# In PTv3, a sparse conv layer projects raw_feat to higher dimension
# and implicitly encodes positional information from coord

# Simulated output after sparse conv embedding layer
# This is what actually goes into attention!
feat = torch.tensor([
    [0.43, 0.15, 0.89],  # Point 1: embedded feature
    [0.55, 0.87, 0.66],  # Point 2: embedded feature  # query point
    [0.57, 0.85, 0.64],  # Point 3: embedded feature
    [0.22, 0.58, 0.33],  # Point 4: embedded feature
    [0.77, 0.25, 0.10],  # Point 5: embedded feature
    [0.05, 0.80, 0.55]   # Point 6: embedded feature
])

# This "feat" is analogous to your text "inputs" tensor!
# Now attention operates on these embedded features

![Point Transformer v3 vs Text data](/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-3-coding-attention/Ref_images/PTv3_vs_Text.png)

## Attention Weights

- We normalize each attention scores  we computed previously.