```{contents}
```
## Rotary Positional Embeddings (RoPE)

**Rotary Embeddings** are a method of encoding token positions directly into the **attention mechanism** by rotating the query and key vectors based on their position.

They are the default positional encoding used in many modern LLMs such as **LLaMA, GPT-NeoX, Mistral, Falcon, and PaLM**.

---

### **Core Intuition**

Language understanding depends on **relative position**:
how far apart two tokens are matters more than their absolute index.

RoPE injects position information by **rotating vectors in a geometric space**, so that the dot-product attention score naturally encodes relative distance.

> **Position becomes part of the geometry of attention.**

---

### **How It Works (Conceptually)**

Each pair of dimensions in a query or key vector is treated as a 2D vector and **rotated** by an angle determined by the token’s position.

For position (p):

[
R(p) =
\begin{pmatrix}
\cos(\theta p) & -\sin(\theta p) \
\sin(\theta p) & \cos(\theta p)
\end{pmatrix}
]

This rotation is applied to the query and key vectors **before** computing attention.

As a result, the dot-product between two tokens depends on their **relative distance**.

---

### **Why RoPE Is Powerful**

| Property                           | Benefit                     |
| ---------------------------------- | --------------------------- |
| Encodes relative positions         | Better long-range reasoning |
| No extra parameters                | Memory efficient            |
| Extrapolates to longer context     | Supports long sequences     |
| Integrates directly into attention | No additive embeddings      |

---

### **Architecture Placement**

```
Token Embedding
   ↓
Linear Projection → Q, K
   ↓
Apply Rotary Rotation (position-dependent)
   ↓
Attention(Q, K, V)
```

---

### **Applications**

#### Large Language Models

GPT-NeoX, LLaMA, Mistral, Falcon, PaLM

#### Long-Context Modeling

Documents, books, codebases

#### Multimodal Transformers

Vision-language models and audio models

#### Efficient Attention Systems

Helps preserve ordering information without large embedding tables

---

### **RoPE vs Traditional Positional Encoding**

| Feature                      | Sinusoidal | Learned | RoPE   |
| ---------------------------- | ---------- | ------- | ------ |
| Encodes relative position    | Weak       | Weak    | Strong |
| Extrapolates to long context | Poor       | Poor    | Good   |
| Integrated into attention    | No         | No      | Yes    |
| Parameter free               | Yes        | No      | Yes    |

---

**Intuition Summary**

Rotary embeddings allow attention to **feel distance and direction between tokens** by turning position into geometry.

