```{contents}
```
## Latency Optimization

Latency is the **end-to-end response time** between a user request and the generated output.
For Generative AI systems, low latency is essential for **usability, interactivity, and scalability**.

---

### 1. Where Latency Comes From

| Component         | Description                               |
| ----------------- | ----------------------------------------- |
| Model Inference   | Neural network forward pass               |
| Tokenization      | Text → tokens and tokens → text           |
| Model Loading     | Weight initialization and memory transfer |
| Network           | Request/response transfer                 |
| Decoding Strategy | Greedy, beam search, sampling             |
| Post-processing   | Formatting, safety checks, logging        |

Total Latency:

$$
T = T_{network} + T_{tokenization} + T_{inference} + T_{decoding} + T_{post}
$$

---

### 2. Key Latency Metrics

| Metric     | Meaning                    |
| ---------- | -------------------------- |
| TTFT       | Time To First Token        |
| TPOT       | Time Per Output Token      |
| End-to-End | Total response time        |
| Throughput | Tokens/sec or requests/sec |

---

### 3. Optimization Layers

### A. **Model-Level Optimization**

| Technique                | Effect                       |
| ------------------------ | ---------------------------- |
| Quantization (INT8/INT4) | Faster compute, lower memory |
| Pruning                  | Removes redundant weights    |
| Distillation             | Smaller student model        |
| LoRA Adapters            | Lightweight fine-tuning      |
| Flash Attention          | Faster attention computation |

Example: 8-bit Quantization (PyTorch)

```python
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto"
)
```

---

### B. **Inference Optimization**

#### 1. KV Caching

Avoid recomputing attention for past tokens.

```python
outputs = model.generate(
    input_ids,
    use_cache=True
)
```

#### 2. Efficient Decoding

| Strategy      | Latency | Quality |
| ------------- | ------- | ------- |
| Greedy        | Fastest | Lower   |
| Top-k / Top-p | Medium  | High    |
| Beam Search   | Slow    | Highest |

Use greedy or small top-k for chat systems.

---

### C. **System-Level Optimization**

| Method              | Benefit           |
| ------------------- | ----------------- |
| Model Sharding      | Fits large models |
| Tensor Parallelism  | Parallel compute  |
| Batching            | Higher throughput |
| Continuous Batching | Stable latency    |
| GPU Utilization     | Maximizes compute |

Example: Dynamic batching with vLLM

```bash
vllm serve meta-llama/Llama-2-7b \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9
```

---

### D. **Serving & Infrastructure**

| Technique            | Description                      |
| -------------------- | -------------------------------- |
| Warm Models          | Avoid cold starts                |
| Async I/O            | Overlap compute & network        |
| Pinned Memory        | Faster CPU↔GPU transfer          |
| Speculative Decoding | Predict tokens using small model |
| Edge Deployment      | Reduce network delay             |

Speculative Decoding Concept:

```text
Small Model predicts → Large Model verifies → Accept tokens → Skip compute
```

---

### 4. End-to-End Latency Optimization Workflow

```text
User Request
   ↓
Prompt Optimization (shorter context)
   ↓
Tokenization Optimization
   ↓
KV Cache + Efficient Decoding
   ↓
Quantized Model on GPU
   ↓
Streaming First Token
   ↓
Continuous Batching + Async Serving
```

---

### 5. Practical Example: Fast Chat Pipeline

```python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="mistralai/Mistral-7B",
    torch_dtype=torch.float16,
    device_map="auto"
)

result = generator(
    "Explain transformers in one sentence:",
    max_new_tokens=50,
    do_sample=False   # greedy decoding for low latency
)
```

---

### 6. Common Latency Trade-offs

| Faster          | Slower          |
| --------------- | --------------- |
| Smaller model   | Larger model    |
| Quantized       | Full precision  |
| Greedy decoding | Beam search     |
| Short context   | Long context    |
| Edge inference  | Cloud inference |

---

### 7. Target Latency Benchmarks (Interactive AI)

| Use Case            | Ideal Latency |
| ------------------- | ------------- |
| Chat UI             | < 200 ms TTFT |
| Voice Assistant     | < 300 ms      |
| Search Assistant    | < 500 ms      |
| Document Generation | < 1–2 s       |

---

### 8. Summary

Latency optimization in Generative AI is achieved by **co-designing**:

* **Models** (quantization, distillation)
* **Inference** (KV cache, decoding strategies)
* **Systems** (batching, parallelism)
* **Infrastructure** (GPU utilization, edge deployment)