```{contents}
```
## Model Routing

### 1. Definition

**Model Routing** is the system-level mechanism that dynamically selects **which model (or combination of models)** should handle a given input request based on task characteristics, constraints, and objectives such as cost, latency, accuracy, and reliability.

It acts as the **control plane** of multi-model AI systems.

> **Input → Router → Selected Model(s) → Output**

---

### 2. Why Model Routing Is Needed

Modern AI stacks rarely rely on a single model.

Different models specialize in:

* reasoning vs. generation
* long-context vs. low-latency
* code vs. vision vs. speech
* cheap inference vs. high accuracy

**Model routing solves:**

| Problem     | Without Routing         | With Routing           |
| ----------- | ----------------------- | ---------------------- |
| Cost        | Always expensive        | Cheapest acceptable    |
| Latency     | Always slow             | Fast when possible     |
| Accuracy    | Overkill or underkill   | Task-appropriate       |
| Scalability | Monolithic bottleneck   | Distributed load       |
| Reliability | Single point of failure | Fallbacks & redundancy |

---

### 3. Core Routing Dimensions

Routing decisions are typically based on:

| Signal                  | Description                             |
| ----------------------- | --------------------------------------- |
| **Task type**           | QA, coding, chat, vision, summarization |
| **Complexity**          | Shallow vs deep reasoning               |
| **Context length**      | Short vs long documents                 |
| **Quality requirement** | Draft vs production-grade               |
| **Latency budget**      | Real-time vs offline                    |
| **Cost budget**         | Cheap vs premium                        |
| **User tier**           | Free vs enterprise                      |
| **System load**         | Dynamic traffic balancing               |

---

### 4. Main Types of Model Routing

### A. Rule-Based Routing

Deterministic logic.

```
if tokens < 1k and no code:
    use small_model
elif code or reasoning:
    use large_model
```

**Pros:** predictable, simple
**Cons:** brittle, hard to scale

---

### B. Classifier-Based Routing

A learned model predicts the best model.

```
input → router_model → model_id
```

Features may include:

* embedding of input
* length, language, domain
* historical performance

**Pros:** adaptive, data-driven
**Cons:** requires training data

---

### C. Multi-Objective Optimization Routing

Selects model by optimizing a cost function:

[
\arg\min_m \big( \alpha \cdot \text{latency}_m + \beta \cdot \text{cost}_m - \gamma \cdot \text{quality}_m \big)
]

Used in production AI platforms.

---

### D. Cascaded Routing (Progressive Inference)

Try cheap model first → escalate if confidence low.

```
small_model → confidence < τ → large_model
```

**Huge cost savings in practice.**

---

### 5. End-to-End Routing Workflow

```
User Request
     ↓
Feature Extraction
     ↓
Routing Decision Engine
     ↓
Model Selection
     ↓
Model Inference
     ↓
Post-Processing
     ↓
Final Output
```

---

### 6. Practical Demonstration

### Simple Rule-Based Router

```python
def route(prompt):
    tokens = len(prompt.split())
    
    if tokens < 100 and "code" not in prompt.lower():
        return "fast_model"
    elif tokens < 1000:
        return "balanced_model"
    else:
        return "large_context_model"
```

---

### Cascaded Router with Confidence Check

```python
def answer(prompt):
    out, confidence = small_model(prompt)
    
    if confidence > 0.8:
        return out
    else:
        return large_model(prompt)
```

---

### Classifier-Based Router (Sketch)

```python
import torch

class Router(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 3)  # 3 candidate models

    def forward(self, embedding):
        return torch.softmax(self.fc(embedding), dim=-1)

# inference
model_probs = router(embed(prompt))
chosen_model = models[torch.argmax(model_probs)]
```

---

### 7. Advanced: Multi-Model Composition

Routing does not always pick **one** model:

| Pattern                  | Description                           |
| ------------------------ | ------------------------------------- |
| **Ensemble**             | Combine multiple model outputs        |
| **Tool routing**         | LLM → tool → specialist model         |
| **Hierarchical routing** | global router → domain router → model |
| **Speculative routing**  | run fast & slow models in parallel    |

---

### 8. Industrial Use Cases

| System                | Routing Strategy              |
| --------------------- | ----------------------------- |
| Search engines        | Query complexity routing      |
| Customer support bots | Tier-based routing            |
| Autonomous agents     | Tool & skill routing          |
| Code assistants       | Language + difficulty routing |
| Multimodal AI         | Modality-based routing        |

---

### 9. Summary Table

| Aspect     | Description                                |
| ---------- | ------------------------------------------ |
| Purpose    | Optimal model selection                    |
| Inputs     | Task, cost, latency, quality               |
| Output     | Model or model pipeline                    |
| Techniques | Rules, classifiers, optimization, cascades |
| Benefits   | Cost ↓, speed ↑, accuracy ↑, reliability ↑ |

---

### 10. Key Insight

> **Model routing is the operating system of modern AI platforms.**

Without it, multi-model systems are inefficient, expensive, and unscalable.
