```{contents}
```
## Cost Optimization

Cost optimization is the systematic design of **models, data pipelines, infrastructure, and inference workflows** to minimize total operational expense **while preserving output quality and latency constraints**.

---

### 1. Why Cost Optimization Matters

Generative AI systems incur costs from:

| Cost Component | Source                                       |
| -------------- | -------------------------------------------- |
| Training       | GPUs, storage, electricity, engineering time |
| Inference      | GPU/CPU runtime, memory, bandwidth           |
| Model size     | Storage, loading latency, memory footprint   |
| Prompting      | Token usage in API-based systems             |
| Data           | Collection, labeling, cleaning               |
| Deployment     | Autoscaling, monitoring, redundancy          |

**Observation:**
For production systems, **inference cost dominates** (often >80% of total spend).

---

### 2. Cost Structure of a Generative AI System

```
Total Cost =
Training Cost
+ Inference Cost
+ Data Cost
+ Infrastructure Overhead
+ Maintenance & Monitoring
```

#### Typical Breakdown (Production)

| Category    | % of Spend |
| ----------- | ---------- |
| Inference   | 60–85%     |
| Training    | 10–25%     |
| Data        | 5–10%      |
| Infra & Ops | 5–10%      |

---

### 3. Core Cost Optimization Strategies

| Layer          | Strategy                            | Purpose                     |
| -------------- | ----------------------------------- | --------------------------- |
| Model          | Distillation, quantization, pruning | Reduce model size & compute |
| Data           | Dataset curation, synthetic data    | Reduce training cost        |
| Training       | Mixed precision, early stopping     | Reduce GPU hours            |
| Inference      | Caching, batching, routing          | Reduce per-request cost     |
| Prompting      | Prompt compression, RAG             | Reduce token usage          |
| Infrastructure | Autoscaling, spot instances         | Reduce idle cost            |

---

### 4. Model-Level Optimization

#### A. Knowledge Distillation

Train a **small model (student)** to imitate a **large model (teacher)**.

```python
loss = α * CE(student_logits, labels) + (1-α) * KL(student_logits, teacher_logits)
```

**Benefit:**
Up to **10× cheaper inference** with minimal quality loss.

---

#### B. Quantization

Reduce numeric precision.

| Type | Precision | Speedup | Memory Reduction |
| ---- | --------- | ------- | ---------------- |
| FP32 | 32-bit    | 1×      | baseline         |
| FP16 | 16-bit    | ~2×     | 2×               |
| INT8 | 8-bit     | ~4×     | 4×               |
| INT4 | 4-bit     | ~6–8×   | 8×               |

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "llama",
    load_in_8bit=True,
    device_map="auto"
)
```

---

#### C. Pruning

Remove unimportant weights or neurons.

**Effect:**
Reduces FLOPs → faster inference → lower energy & hardware cost.

---

### 5. Training-Level Optimization

#### A. Mixed Precision Training

```python
with torch.cuda.amp.autocast():
    loss = model(inputs)
```

**Result:** ~50% memory reduction, ~30–50% speedup.

#### B. Early Stopping

Stop when validation loss plateaus.

```python
if val_loss > best_loss for 5 epochs:
    stop_training()
```

Prevents unnecessary GPU spending.

---

### 6. Inference-Level Optimization (Biggest Savings)

#### A. Request Batching

Combine requests.

```
8 separate requests → 1 batched request → ~5× throughput
```

#### B. KV Cache Reuse

Reuse attention keys/values for long conversations.

```python
model.generate(input_ids, use_cache=True)
```

#### C. Response Caching

```
Same query → cached output → zero inference cost
```

#### D. Model Routing

Use small model by default; escalate only when needed.

```python
if confidence < threshold:
    use_large_model()
else:
    use_small_model()
```

**Result:** 50–80% cost reduction.

---

### 7. Prompt & Token Optimization

#### A. Prompt Compression

```
Verbose prompt → compressed instructions → fewer tokens
```

#### B. Retrieval-Augmented Generation (RAG)

Instead of huge context windows:

```
Query → Retrieve relevant docs → Small context → Generate
```

Reduces token count by **5–20×**.

---

### 8. Infrastructure Optimization

| Technique          | Impact                 |
| ------------------ | ---------------------- |
| Spot Instances     | 60–90% cheaper GPUs    |
| Autoscaling        | No idle GPU cost       |
| Cold-start control | Reduce wasted runtime  |
| Model sharding     | Efficient memory usage |

---

### 9. Cost vs Quality Trade-off Curve

```
High cost ──● Large model
           │
           │
           ● Optimized model
           │
Low cost ──● Over-optimized (quality loss)
```

Goal: operate near the **Pareto frontier**.

---

### 10. Example: End-to-End Cost Optimization Pipeline

```
Large model (70B)
    ↓ Distillation
Medium model (13B)
    ↓ Quantization (INT8)
Small efficient model (13B INT8)
    ↓ Routing + Caching + RAG
Production deployment
```

**Observed Results (Typical):**

| Metric             | Before | After   |
| ------------------ | ------ | ------- |
| Latency            | 900 ms | 150 ms  |
| Cost per 1K tokens | $0.12  | $0.02   |
| Throughput         | 50 rps | 400 rps |

---

### 11. Summary Table

| Layer     | Main Techniques                     |
| --------- | ----------------------------------- |
| Model     | Distillation, quantization, pruning |
| Training  | Mixed precision, early stopping     |
| Inference | Caching, batching, routing          |
| Prompting | Token compression, RAG              |
| Infra     | Autoscaling, spot GPUs              |

---

### Key Principle

> **Most real-world GenAI savings come from inference optimization, not model training.**

Optimizing the serving pipeline yields the largest and most sustainable cost reductions.
