```{contents}
```
## Model Serving

Model **serving** is the engineering discipline of making trained GenAI models reliably available for real-world use: low-latency inference, scalable traffic handling, observability, safety controls, and continuous updates.

---

### 1. Why Model Serving Matters

Training builds intelligence.
**Serving delivers intelligence.**

| Challenge    | Why It matters                                 |
| ------------ | ---------------------------------------------- |
| Latency      | Users expect responses in milliseconds–seconds |
| Scalability  | Traffic is bursty and unpredictable            |
| Cost control | GPUs are expensive                             |
| Reliability  | Production failures destroy trust              |
| Versioning   | Models evolve continuously                     |
| Safety       | Output must be filtered and governed           |

---

### 2. Core Serving Pipeline

```
User → API Gateway → Preprocessing → Model Inference → Postprocessing → Response
                          ↑                ↓
                     Feature Store     Vector DB / Cache
```

**Responsibilities**

1. **Request handling**
2. **Tokenization / input shaping**
3. **Inference execution**
4. **Decoding & safety filters**
5. **Caching & logging**
6. **Monitoring & rollback**

---

### 3. Serving Architectures

| Architecture         | Description                   | When to use         |
| -------------------- | ----------------------------- | ------------------- |
| Single Model Server  | One service hosts one model   | Simple deployments  |
| Multi-Model Server   | One service hosts many models | Cost optimization   |
| Microservice Mesh    | Each component is separate    | Large-scale systems |
| Serverless Inference | Auto-scaling functions        | Spiky workloads     |
| Edge Serving         | Model runs near users         | Ultra-low latency   |

---

### 4. Inference Modes

| Mode      | Behavior                    | Use case           |
| --------- | --------------------------- | ------------------ |
| Batch     | Process many inputs at once | Offline jobs       |
| Online    | One request at a time       | Chat, APIs         |
| Streaming | Token-by-token output       | ChatGPT-style UX   |
| Async     | Fire-and-forget             | Long running tasks |

---

### 5. Key Serving Techniques for GenAI

#### A. Dynamic Batching

Combine multiple user requests into one GPU batch.

```
Requests → Batch → GPU → Split responses
```

Improves throughput without hurting latency.

#### B. KV Cache Reuse

Stores attention keys/values to avoid recomputation during decoding.

Massive speedup for long prompts and streaming.

#### C. Quantization

Reduce model precision:

| Type | Precision |
| ---- | --------- |
| FP32 | 32-bit    |
| FP16 | 16-bit    |
| INT8 | 8-bit     |
| INT4 | 4-bit     |

Reduces memory and increases throughput.

#### D. Speculative Decoding

Small model drafts tokens → large model verifies.
2–4× decoding speedup.

---

### 6. End-to-End Example (FastAPI + vLLM)

```python
from fastapi import FastAPI
from vllm import LLM, SamplingParams

app = FastAPI()

llm = LLM(model="meta-llama/Llama-3-8B")

@app.post("/generate")
async def generate(prompt: str):
    params = SamplingParams(temperature=0.7, max_tokens=200)
    output = llm.generate(prompt, params)
    return {"text": output[0].outputs[0].text}
```

Run:

```bash
uvicorn app:app --host 0.0.0.0 --port 8000
```

This provides:

* Continuous batching
* GPU scheduling
* Streaming support
* KV caching

---

### 7. Production Tooling Ecosystem

| Layer             | Popular Tools               |
| ----------------- | --------------------------- |
| Inference engine  | vLLM, TensorRT-LLM, TGI     |
| Serving framework | FastAPI, Ray Serve, BentoML |
| Orchestration     | Kubernetes, KServe          |
| Monitoring        | Prometheus, Grafana         |
| Caching           | Redis                       |
| Vector store      | FAISS, Milvus, Pinecone     |

---

### 8. Model Versioning & Rollout

| Technique         | Purpose                        |
| ----------------- | ------------------------------ |
| Shadow deployment | Test new model on live traffic |
| Canary release    | Gradual rollout                |
| A/B testing       | Compare performance            |
| Rollback          | Immediate recovery             |

---

### 9. Safety & Governance in Serving

* Prompt injection detection
* Content filtering
* Rate limiting
* Logging & auditing
* Output moderation pipelines

---

### 10. Performance Metrics

| Metric            | Meaning             |
| ----------------- | ------------------- |
| P50 / P99 latency | Response time       |
| Throughput        | Requests/sec        |
| GPU utilization   | Hardware efficiency |
| Token/sec         | Generation speed    |
| Cost/request      | Business viability  |

---

### 11. Serving vs Training

| Aspect             | Training            | Serving                  |
| ------------------ | ------------------- | ------------------------ |
| Primary goal       | Learn parameters    | Deliver predictions      |
| Compute pattern    | Heavy batch compute | Low-latency compute      |
| Failure tolerance  | High                | Very low                 |
| Optimization focus | Convergence         | Speed, cost, reliability |

---

### 12. Mental Model

> **Model training creates intelligence.
> Model serving operationalizes intelligence.**

Without robust serving, GenAI remains a research artifact.

---

If you'd like, the next logical topic is **RAG serving pipelines** or **high-throughput LLM inference optimization**.
