```{contents}
```
## Decoder-Only Models (Autoregressive Architecture)

A **Decoder-only model** is a neural network architecture that generates text **one token at a time**, using only previously generated tokens as context.
This is the architecture behind modern chat models such as **GPT**, **Claude**, **LLaMA**, **Mistral**, and **Gemini**.

---

### High-Level Idea

**Previous tokens → Decoder → Next token → Append → Repeat**

The model predicts the **next token** given all previous tokens.

---

### Core Structure

A decoder-only model consists of stacked **Transformer decoder blocks**.

Each block contains:

1. **Masked self-attention**
   Prevents the model from seeing future tokens.
2. **Feedforward network**
3. **Layer normalization**
4. **Residual connections**

No encoder. No cross-attention.

---

### Autoregressive Generation

At each step:

1. The model reads the full context so far
2. Computes probability distribution for the next token
3. Samples the next token
4. Appends it to the context
5. Repeats until completion

---

### Why Decoder-Only Models Dominate GenAI

| Advantage             | Explanation                               |
| --------------------- | ----------------------------------------- |
| General-purpose       | Can do almost any language task           |
| Simple interface      | Single sequence input                     |
| Scales extremely well | Parallelizable training                   |
| Flexible prompting    | Instructions, memory, tools in one stream |
| Powerful reasoning    | Emergent abilities with scale             |

---

### Training Process

* Pretrained with next-token prediction
* Instruction-tuned
* Aligned using RLHF or similar methods

---

### Key Characteristics

| Feature            | Decoder-Only                 |
| ------------------ | ---------------------------- |
| Architecture       | Single stack                 |
| Context handling   | Entire history in one stream |
| Task specification | Via prompt                   |
| Few-shot learning  | Native                       |
| Cross-attention    | None                         |

---

### Limitations

* Inefficient for very long inputs
* No explicit separation of input and output
* Context window is finite
* Cost grows with context size

---

### Examples

* GPT-3 / GPT-4 / GPT-4o
* Claude
* LLaMA
* Mistral
* Falcon
* Gemma

---

### Encoder–Decoder vs Decoder-Only

| Aspect              | Encoder–Decoder | Decoder-Only |
| ------------------- | --------------- | ------------ |
| Task specialization | High            | Universal    |
| Prompt simplicity   | Moderate        | Very high    |
| Multi-task learning | Limited         | Excellent    |
| System complexity   | Higher          | Lower        |

