```{contents}
```

## Positional Encodings

Transformers use **self-attention**, which treats all tokens **independently** and **in parallel**.

Unlike RNNs or CNNs:

* There is **no recurrence** (no left→right order)
* There is **no convolution window** (no locality)
* Tokens have **no inherent notion of position**

This means:

```
["The", "cat", "sat"]
```

and

```
["sat", "cat", "The"]
```

produce the **same attention behavior** if embeddings are identical.

### Problem:

**Self-attention alone cannot understand sequences.**
There is no way to know:

* which word comes first
* which word comes after
* long-range dependencies
* grammar structure

Thus, Transformers need a mechanism to inject **order information** into token embeddings.

---

### What Are Positional Encodings?

Positional encodings are **vectors added to token embeddings** to give the model information about **word positions in a sequence**.

If:

* token embedding = *content meaning*
* positional encoding = *position meaning*

Then:

```
final_embedding = token_embedding + positional_encoding
```

This preserves:

* semantic meaning (from token)
* sequence order (from position)

---

### Types of Positional Encodings

#### **Absolute Positional Encoding** (Original Transformer)

Uses **sinusoidal patterns**:

For each position (pos) and dimension (i):

$$
PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$

**Why sinusoidal?**

* Allows model to generalize to longer sequences
* Same pattern at different scales
* Easy for model to compute relative distances

Example of intuition:

* sin-wave lets the model detect periodic patterns
* difference of sin/cos encodings gives position offsets

---

#### **Learned Positional Embeddings**

Instead of predefined sin/cos, the model learns a position embedding table:

```
position_embedding = nn.Embedding(max_length, embedding_dim)
```

This is used in BERT, GPT, RoBERTa, etc.

---

#### **Relative Positional Encodings**

Used in Transformer-XL, T5, and modern LLMs.

These represent **distance between tokens**, not absolute positions.

Example:

* “7 tokens away” matters more than “token #73”

These are better for:

* long sequences
* efficient memory reuse

---

### PyTorch Implementation: Absolute Sinusoidal Positional Encodings

#### Minimal and faithful version:

```python
import torch
import math

def sinusoidal_positional_encoding(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)                    # (seq_len, 1)
    i = torch.arange(d_model).unsqueeze(0)                      # (1, d_model)

    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model)
    angles = pos * angle_rates                                  # (seq_len, d_model)

    # Apply sin to even indices (0,2,4,...), cos to odd (1,3,5,...)
    angles[:, 0::2] = torch.sin(angles[:, 0::2])
    angles[:, 1::2] = torch.cos(angles[:, 1::2])

    return angles

# Example usage
seq_len = 10
d_model = 16

pe = sinusoidal_positional_encoding(seq_len, d_model)
print(pe.shape)     # (10, 16)
print(pe)           # positional encodings
```

You would **add** these to token embeddings:

```python
x = torch.randn(seq_len, d_model)
x = x + pe
```

---

### PyTorch Implementation: Learned Positional Embeddings

```python
import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.pos_embed = nn.Embedding(max_len, d_model)

    def forward(self, x):
        seq_len = x.size(1)               # batch, seq, d_model
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
        pos_encoding = self.pos_embed(positions)
        return x + pos_encoding

# Example
batch_size = 2
seq_len = 5
d_model = 32

x = torch.randn(batch_size, seq_len, d_model)
pe_layer = LearnedPositionalEncoding(max_len=100, d_model=d_model)
out = pe_layer(x)
print(out.shape)   # (2, 5, 32)
```

---

### 6. Demonstration: Why Positional Encoding Helps

Without positional encoding:

```
"The dog chased the cat"
"The cat chased the dog"
```

After embedding → self-attention → both sentences look similar because:

* self-attention sees tokens but not order
* no directional flow
* no way to know who is subject or object

With positional encoding added:

```
emb("The") + pos[0]
emb("dog") + pos[1]
emb("chased") + pos[2]
...
```

The model now learns:

* position 1 usually holds the subject
* position 2 holds the verb
* relative relationships
* dependency chains

This enables grammatical understanding and correct generation.

---

**Summary**

| Feature             | Self-Attention Alone | With Positional Encoding |
| ------------------- | -------------------- | ------------------------ |
| Order Awareness     | None                 | Yes                      |
| Understand syntax   | No                   | Yes                      |
| Parallelism         | High                 | High (unchanged)         |
| Long-range modeling | Strong               | Stronger                 |

Positional encodings **solve the core limitation of attention**:
the inability to encode order on its own.


```python
import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.position_embed = nn.Embedding(max_len, d_model)

    def forward(self, x):
        """
        x shape: (batch, seq_len, d_model)
        """
        batch, seq_len, d_model = x.shape

        # positions: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)  
        # shape: (1, seq_len)

        pos_enc = self.position_embed(positions)  # (1, seq_len, d_model)

        # Add positional encoding to token embeddings
        return x + pos_enc

batch = 2
seq_len = 5
d_model = 16

x = torch.randn(batch, seq_len, d_model)

pos_enc_layer = LearnedPositionalEncoding(max_len=100, d_model=d_model)
out = pos_enc_layer(x)

print(out.shape)   # (2, 5, 16)

```