```{contents}
```
## Context Window and Sliding Window

### 1. Context Window

A **context window** is the *maximum number of tokens* an LLM or Transformer model can read **at once**.

Example:

* GPT-4 (old versions): ~8k tokens
* GPT-4-Turbo: 128k tokens
* Claude 3 Opus: 200k tokens
* GPT-5-level models: multi-million token windows

#### Why does a context window exist?

Because the Transformer architecture uses:

* **Self-attention → O(n²) memory and compute**
* Increasing sequence length grows cost quadratically

Thus, models cannot take unlimited text.
They read only **a limited chunk** → the context window.

---

#### Why Context Windows Matter

The context window determines:

* **How much text the model can understand at once**
* **How much previous conversation is remembered**
* **How much long document processing is possible**

Example:
If a model has a 4,096-token window:

```
Input > 4096 tokens → older tokens get truncated
```

Meaning the model:

* Cannot remember earlier conversation
* Cannot process long PDFs directly
* Cannot operate on extremely long code files at once

This is why you sometimes see “model forgot earlier lines” → context overflow.

---

#### 3. What Happens When the Context Window Is Exceeded?

Two scenarios:

##### A. **Truncation (Current LLM behavior)**

Oldest tokens are dropped first.

```
[ Too old ] [ middle ] [ latest prompt ]
      X         ✓            ✓
```

##### B. **Window Reset**

Models sometimes treat it as a fresh conversation when too much history is lost.

---

### Sliding Window (Intuition)

A **sliding window** is a technique to process long documents that exceed the context window **by splitting text into overlapping chunks**.

Think of it as reading a long book using a limited-size magnifying glass.
You read section by section, moving the magnifying glass forward.

Example with window size = 100 tokens, step size = 50:

```
Chunk 1: tokens 0–99
Chunk 2: tokens 50–149
Chunk 3: tokens 100–199
```

Each chunk:

* Shares some overlap (to preserve continuity)
* Is processed independently or sequentially

Sliding window solves:

* Long document summarization
* Large codebase analysis
* Long context retrieval

---

#### Why Sliding Windows Are Important

Transformers cannot natively handle long sequences, but sliding windows allow:

#### **Scalable document processing**

Summarizing a 300k-token PDF with a model that has only 50k window.

#### **Retrieval augmentation**

Extracting relevant chunks for RAG pipelines.

#### **Local attention in LLMs**

Many efficient Transformers use internal sliding windows:

* Longformer
* BigBird
* Mistral
* FlashAttention-2 windowed mode

These models attend only to nearby tokens, not the entire sequence.

---

### Concrete Example of Sliding Window

Say you want to process 1000-token text but the model only supports 200-token windows.

Let:

* window size = 200
* stride = 100

Chunks:

```
0–199
100–299
200–399
300–499
...
800–999
```

Each window is fed to the model separately, e.g. for:

* Embedding
* Summarization
* Classification

Then the outputs are combined.

---

### Sliding Window vs Context Window

| Concept            | Meaning                                              | Purpose                                    |
| ------------------ | ---------------------------------------------------- | ------------------------------------------ |
| **Context Window** | Max tokens model can see at once                     | Model capability limit                     |
| **Sliding Window** | Technique to break long text into overlapping chunks | Process text longer than the model’s limit |

---

###  Visual Summary

#### Context window:

```
[ A fixed-size box around text the model can see ]
```

#### Sliding window:

```
[Chunk1: 0–100]  
          [Chunk2: 50–150]  
                    [Chunk3: 100–200]
```