<a href="https://colab.research.google.com/github/shreyans-sureja/llm-101/blob/main/part5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Position Embeddings

1. Until now, we looked at token embedding.

**Issue** -

*   The cat sat on the mat
*   On the mat the cat sat


2. In the embedding layer, same token ID gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence.

3. It is helpful to inject additional position information to LLM.

There are two types of positional embeddings.

1.   Absolute - For each position in input sequence, a unique embedding is added to the token's embedding to convey its exact location. Positional vectors have the same dimension as original token embeddings.
2.   Relative - The emphasis is on the relative position or distance b/w tokens. The model learns the relationships in terms of "how far apart" rather than at which exact position. \
    - Advantage : model can generalize better to sequence varying lengths, even if it has not seen such lengths during training.
    - longer sequence also better in relative positional embeddings.



Both types of positional encodings enable LLM to understand the order and relationship b/w tokens, ensuring accurate and context aware predictions.

Choice b/w the two depends on specific application and nature of data being processed.

Absolute - suitable when fixed order of token is crucial, such as sequence generation (GPT trained on this)

Relative - suitable for tasks like language modeling over long sequences, where the same phrase can appear in different parts of the sequence.

# Absolute vs Relative Positional Embeddings

---

## 🔹 Absolute Positional Embeddings
**Definition**: Assigns each position in the sequence a unique encoding (added to token embeddings).  
- **Fixed (sinusoidal)**: deterministic function of position.  
- **Learned**: trainable embedding per position index.

### ✅ Pros
- Simple to implement.  
- Fixed sinusoidal → extrapolates to longer sequences.  
- Stable and widely adopted in early models (BERT, GPT-2).

### ❌ Cons
- Position is treated as a *global index* (e.g., “position 7”), not relative to others.  
- Learned version fails to generalize beyond max training length.  
- Less expressive for local dependencies (e.g., “token just before this one”).

### 📌 Where Used
- NLP models with bounded input length (BERT, GPT-2).  
- Tasks where **absolute position matters** (e.g., machine translation, classification).  

---

## 🔹 Relative Positional Embeddings
**Definition**: Encodes *distance between tokens* directly into the attention mechanism.  
- Example: “this token is 2 steps behind” instead of “I’m at index 7”.

### ✅ Pros
- Captures **local order relations** naturally.  
- Generalizes better to longer/unseen sequences.  
- Strong performance in long-context tasks.  
- More robust to shifts in input.  

### ❌ Cons
- More complex to implement (modifies attention calculation).  
- Slightly higher compute cost.  
- Can overweight nearby positions if not balanced.

### 📌 Where Used
- Long-context or extrapolative tasks (Transformer-XL, T5, DeBERTa, LLaMA).  
- Language modeling, document QA, music/audio, protein sequences.  

---

## 🔹 Quick Rule of Thumb
- **Absolute** → use if you want **simplicity** and your task has a fixed maximum input length.  
- **Relative** → use if you want **scalability**, **generalization to longer sequences**, or your task is highly dependent on relative order.  


### Practical

In [3]:
vocab_size = 50257
output_dim = 256

In [5]:
import torch

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [9]:
print(token_embedding_layer)

Embedding(50257, 256)


In [10]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [11]:
import tiktoken

def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [13]:
import requests

url = "https://raw.githubusercontent.com/shreyans-sureja/llm-101/main/data/the-verdict.txt"
response = requests.get(url)
raw_text = response.text

print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [14]:
max_length = 4

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [15]:
print("Token IDs:\n", inputs)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])


In [17]:
print(inputs.shape)

torch.Size([8, 4])


In [22]:
print(token_embedding_layer(torch.tensor(1)).shape)

torch.Size([256])


In [23]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


Now we need to add positional embedding into this token embedding.

So we will create another embedding layer for positional encoding.

here max_length is 4 so each time only 4 vectors will be processed. so we need [4, 256] positional vector.

In [24]:
context_length = max_length
positional_embedding_layer = torch.nn.Embedding(max_length, output_dim)

In [25]:
print(positional_embedding_layer)

Embedding(4, 256)


In [27]:
positional_embedding = positional_embedding_layer(torch.arange(max_length))
print(positional_embedding.shape)

torch.Size([4, 256])


In [31]:
print(positional_embedding[1])

tensor([-0.4954, -0.0626, -1.9312,  1.2758, -0.4081,  0.5276,  1.0432, -1.1776,
         0.8742,  1.9163,  1.0273,  1.2036,  1.9936, -0.3877, -0.7666,  0.1392,
        -0.0487, -0.9630, -1.0842,  0.1706,  1.2167,  2.7928,  0.7048, -0.3764,
        -0.5912, -0.6708, -0.6660,  0.4840,  0.3936, -1.1552,  1.0938, -1.9101,
        -1.0789,  1.6486,  1.5196, -0.3195,  0.4708, -1.7364,  0.0060,  0.1909,
         0.4287,  0.9278,  0.6019,  0.5021, -0.8357, -1.1665,  0.4483, -0.8482,
        -1.6166,  1.2787, -0.6010, -0.7984,  0.9516, -1.1353,  0.8852,  1.5233,
         0.2669, -0.4390, -0.6122,  1.1514,  0.8585,  0.5500,  0.1262,  0.0712,
         0.1361, -0.6033,  1.2824,  0.7581,  0.6890, -0.6455, -0.1948, -0.0847,
         2.3061,  0.0301,  0.1472, -0.8104,  2.0441, -0.4239, -0.0200, -2.0380,
        -0.7801, -1.4706,  0.2495,  0.5418,  0.0195,  1.2017, -1.6508,  1.1710,
        -0.1590, -2.2909, -0.1686,  0.5532, -1.3528,  0.0075,  1.4115,  0.7623,
         1.6240, -0.1515, -2.0096, -0.14

In [33]:
# Python broadcasting
input_embeddings = token_embeddings + positional_embedding
print(input_embeddings.shape)

torch.Size([8, 4, 256])
