# Appendix: Advanced Positional Embeddings for Long-Context Generalization

This appendix extends the material from Chapters 01–10 by surveying three modern positional embedding strategies designed to enhance transformer performance on long contexts: **No Position Embedding (NoPE)**, **Rotary Position Embedding (RoPE)**, and **Yet another RoPE extension (YaRN)**. We cover the intuition, mathematical formulation, and practical considerations for each method, and we conclude with minimal reference implementations.

## Motivation for Alternative Positional Embeddings

Classical transformers rely on either learned absolute position embeddings or deterministic sinusoidal embeddings (Vaswani et al., 2017). These approaches bind each token to a unique absolute index, which limits extrapolation beyond the context window seen during training. Long-context applications—retrieval-augmented generation, document understanding, and code completion—demand inductive biases that extrapolate gracefully when sequences exceed training lengths.

Modern position embedding schemes focus on either *relative* or *functionally continuous* encodings, preserving translational invariance or enabling smooth extension to longer contexts. Below we detail three representative methods that illustrate these principles.

## 1. No Position Embedding (NoPE)

NoPE (Press et al., 2021) removes **explicit** positional encodings and relies purely on learned attention biases. The key insight is that attention itself can capture sequential structure when the model is trained on tasks with causal masking and left-to-right decoding.

### Theory
For a standard transformer layer, the attention logits are

$$	ext{Attn}(Q, K) = rac{QK^	op}{\sqrt{d_k}} + B,$$

where $B$ optionally encodes relative position biases. In NoPE, there is no positional encoding added to the token embeddings. Instead, an untied learned bias term $b_{ij}$ is added directly to the attention logits:

$$	ext{Attn}(Q, K) = rac{QK^	op}{\sqrt{d_k}} + b_{ij}.$$

The bias matrix $b_{ij}$ depends only on the relative distance $(i - j)$, enabling extrapolation to longer contexts when combined with appropriate parameter sharing (e.g., ALiBi; Press et al., 2021).

### Practical Considerations
* Works best when paired with monotonic attention biases such as ALiBi.
* Requires carefully initializing or regularizing biases to prevent degenerate solutions where positional information is lost.
* Enables models to generalize beyond training lengths because the learned bias function can be evaluated at larger relative distances.

## 2. Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) introduces relative positions by rotating query and key vectors in complex space. Each dimension pair is treated as a complex number and rotated by an angle proportional to the token index.

### Theory
Given position $n$ and feature index $2i$, define frequencies $	heta_i = \omega^{2i/d}$ with $\omega$ as a base (commonly $10{,}000$). Represent a 2-D feature pair $(x_{2i}, x_{2i+1})$ as $z_i = x_{2i} + j x_{2i+1}$. RoPE applies a rotation:

$$	ext{RoPE}(z_i, n) = z_i \cdot e^{j n 	heta_i}.$$

The rotated queries and keys yield attention logits:

$$	ext{Attn}(q_n, k_m) = \Re\left[ 	ext{RoPE}(q_n, n) \cdot \overline{	ext{RoPE}(k_m, m)} ight].$$

Because the attention depends on $n-m$, RoPE implicitly encodes relative positions. Moreover, extending to longer contexts only requires evaluating the rotation for larger $n$.

### Practical Considerations
* Maintains rotational invariance and preserves dot-product magnitudes.
* Supports interpolation and extrapolation by rescaling angles (e.g., NTK-aware scaling).
* Widely adopted in GPT-NeoX, LLaMA, and other open-source models.

## 3. YaRN: Yet another RoPE extensioN

YaRN (Peng et al., 2023) refines RoPE by combining extrapolation-friendly rescaling with interpolation for shorter contexts. It blends multiple rotation scales to reduce phase distortion when sequences exceed the training window.

### Theory
YaRN introduces two scaling factors: an *interpolation* factor $lpha$ applied within the training window and an *extrapolation* factor $eta$ for longer contexts. For a position $n$, frequencies are scaled as

$$	ilde{	heta}_i(n) = egin{cases}
lpha 	heta_i, & n \leq N_{	ext{train}} \
eta 	heta_i, & n > N_{	ext{train}}
\end{cases}$$

and the rotations become

$$	ext{YaRN}(z_i, n) = z_i \cdot e^{j n 	ilde{	heta}_i(n)}.$$

The method smoothly transitions between scales using interpolation weights, ensuring continuity at $N_{	ext{train}}$. This approach stabilizes training while allowing evaluation on contexts far beyond what the model has seen.

### Practical Considerations
* Choose $(lpha, eta)$ to preserve attention spectra (e.g., $lpha < 1$ for better fit on short contexts, $eta > 1$ for long contexts).
* Implementation can share the same kernel as RoPE with position-dependent scaling coefficients.
* Demonstrated to extend LLaMA-2 from 4k to 128k tokens when combined with curriculum fine-tuning.

## Minimal Implementations

The following code snippets illustrate the mechanics of each approach using PyTorch. These are reference implementations for experimentation; production systems should use fused kernels for efficiency.

In [None]:
import torch

def apply_nope_bias(attn_scores, bias):
    """Add a relative bias matrix (NoPE-style).
    attn_scores: (batch, heads, seq, seq)
    bias: (seq, seq) where bias[i, j] depends on i - j
    """
    return attn_scores + bias

seq_len = 8
distance = torch.arange(seq_len)[:, None] - torch.arange(seq_len)[None, :]
bias = -torch.abs(distance).float()  # simple ALiBi-style slope
scores = torch.zeros(1, 1, seq_len, seq_len)
nope_scores = apply_nope_bias(scores, bias)
nope_scores[0, 0, :3, :3]

In [None]:
import math

def rotary_angles(dim, base=10000):
    return torch.tensor([base ** (-2 * (i // 2) / dim) for i in range(dim)])

def apply_rope(x, positions, base=10000):
    dim = x.size(-1)
    theta = rotary_angles(dim, base=base).to(x.device)
    freqs = torch.einsum('n,d->nd', positions.float(), theta)
    cos = torch.cos(freqs).unsqueeze(-1)
    sin = torch.sin(freqs).unsqueeze(-1)
    x_reshaped = x.view(*x.shape[:-1], dim // 2, 2)
    x1, x2 = x_reshaped[..., 0], x_reshaped[..., 1]
    rotated = torch.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
    return rotated.view_as(x)

q = torch.randn(1, 4, 64)  # (seq, head_dim)
pos = torch.arange(q.size(0))
rope_q = apply_rope(q, pos)
rope_q.shape

In [None]:
def apply_yarn(x, positions, base=10000, alpha=0.8, beta=1.2, train_len=2048):
    dim = x.size(-1)
    theta = rotary_angles(dim, base=base).to(x.device)
    scale = torch.where(positions[:, None] <= train_len, alpha, beta)
    freqs = torch.einsum('n,d->nd', positions.float(), theta) * scale
    cos = torch.cos(freqs).unsqueeze(-1)
    sin = torch.sin(freqs).unsqueeze(-1)
    x_reshaped = x.view(*x.shape[:-1], dim // 2, 2)
    x1, x2 = x_reshaped[..., 0], x_reshaped[..., 1]
    rotated = torch.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
    return rotated.view_as(x)

long_positions = torch.arange(4096)
x = torch.randn(4096, 64)
yarn_x = apply_yarn(x, long_positions)
yarn_x.shape

## Comparison and Best Practices

| Method | Core Idea | Strengths | Weaknesses | Typical Use Cases |
| --- | --- | --- | --- | --- |
| NoPE | Remove explicit positional encodings; rely on biases | Simplicity, compatibility with ALiBi | Requires careful bias design; implicit structure | Autoregressive decoders, efficient inference |
| RoPE | Complex rotations encoding relative positions | Smooth extrapolation, widely adopted | Requires even-dimensional head sizes | General-purpose LLMs (GPT-NeoX, LLaMA) |
| YaRN | Scale RoPE frequencies for interpolation + extrapolation | Extends context without retraining from scratch | Additional hyperparameters, modest compute overhead | Long-context fine-tuning of pretrained RoPE models |

**Implementation tips:**
* Align head dimensions to multiples of 2 for RoPE/YaRN.
* When extending context windows, adjust attention masking and KV-cache sizes accordingly.
* Validate extrapolation empirically using synthetic tasks (e.g., copy or needle-in-a-haystack tests).

## References

* Press, I., Smith, N. A., & Levy, O. (2021). "Train Short, Test Long: Attention with Linear Biases." *arXiv:2108.12409*.
* Su, J., Lu, Y., Pan, S., & Wen, L. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *arXiv:2104.09864*.
* Peng, B., et al. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." *arXiv:2309.00071*.
* Vaswani, A., et al. (2017). "Attention Is All You Need." *NeurIPS*.