```{contents}
```
## Autoregressive Models

An **autoregressive (AR) model** generates data **one step at a time**, where **each output depends on previous outputs**.

In Generative AI, this means:

> **Predict the next token given all previous tokens.**

---

### Core Intuition

Think of writing a sentence:

You don’t write the entire paragraph at once.
You choose each next word based on what you’ve already written.

That is exactly how autoregressive models work.

---

### Mathematical Formulation

For a sequence (x_1, x_2, ..., x_T):

$$
P(x) = \prod_{t=1}^{T} P(x_t | x_1, ..., x_{t-1})
$$

Each token is generated conditionally on previous tokens.

---

### Architecture

Modern autoregressive models are usually **Transformer decoder-only models**:

```
x₁ → x₂ → x₃ → ... → xₜ → Next Token
        ↑
   Masked Self-Attention
```

**Masked attention** ensures the model cannot see future tokens.

---

### Training Process

Train by next-token prediction:

Input: "The cat is"
Target: "sleeping"

The model learns probability distributions over the vocabulary at each step.

---

### Generation Process

1. Start with prompt
2. Predict next token
3. Append token
4. Repeat until completion

---

### Why Autoregressive Models Are Powerful

| Advantage                 | Explanation                          |
| ------------------------- | ------------------------------------ |
| General-purpose           | Works for any sequence               |
| Flexible                  | Text, code, audio, images            |
| Scales well               | Large models show emergent abilities |
| Simple training objective | Next-token prediction                |

---

### Applications

| Domain | Application                            |
| ------ | -------------------------------------- |
| Text   | Chatbots, translation, summarization   |
| Code   | Code generation, review, debugging     |
| Speech | Text-to-speech, speech synthesis       |
| Music  | Melody generation                      |
| Images | PixelCNN, autoregressive vision models |
| Video  | Frame-by-frame generation              |

---

### Strengths & Limitations

| Strengths                 | Limitations                  |
| ------------------------- | ---------------------------- |
| Very high quality outputs | Slow generation (sequential) |
| Stable training           | Costly for long sequences    |
| General-purpose           | Context window limit         |

---

### Autoregressive vs Non-Autoregressive

| Feature    | Autoregressive | Non-Autoregressive |
| ---------- | -------------- | ------------------ |
| Generation | Sequential     | Parallel           |
| Quality    | Very high      | Moderate           |
| Latency    | Higher         | Lower              |

---

### Why Modern LLMs Are Autoregressive

LLMs like GPT, Claude, and LLaMA are all **autoregressive**, because this framework scales extremely well and supports:

* Natural language
* Reasoning
* Tool use
* Multi-task learning