```{contents}
```
## Horizontal Scaling

**Horizontal scaling** (scale-out) means **adding more machines or service instances** instead of making a single machine bigger.

```
Scale Up  = Bigger server
Scale Out = More servers
```

Horizontal scaling is the foundation of cloud-native and AI systems.

---

### Where It Fits in the Architecture

```
Clients → Load Balancer → Service Instances (N) → Workers / LLM / DB
```

---

### Why Horizontal Scaling Is Essential for LLM Systems

* LLM requests are compute-intensive
* User traffic is unpredictable
* Failures must not impact availability
* Supports rapid growth

---

### Basic Horizontal Scaling with FastAPI

#### Demonstration

Run multiple instances of the same app:

```bash
uvicorn main:app --port 8001
uvicorn main:app --port 8002
uvicorn main:app --port 8003
```

---

### Load Balancing Across Instances

```nginx
upstream api_pool {
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
    server 127.0.0.1:8003;
}
```

---

### Stateless Application Design

Horizontal scaling requires **stateless servers**.

#### Demonstration

```python
# BAD: storing session locally
session = {}

# GOOD: store session in Redis
redis.set(session_id, data)
```

---

### Worker Pool Scaling

```bash
uvicorn main:app --workers 4
```

Each worker handles requests independently.

---

### Kubernetes Horizontal Scaling

#### Demonstration

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 10
```

---

### LLM System Scaling Model

| Layer     | Scaling      |
| --------- | ------------ |
| API       | Horizontal   |
| Workers   | Horizontal   |
| Vector DB | Sharded      |
| Cache     | Distributed  |
| LLM       | Rate-limited |

---

### Mental Model

```
Horizontal Scaling = Adding lanes to a highway
```

---

### Key Takeaways

* Core principle of cloud & AI systems
* Requires stateless architecture
* Works with load balancing
* Enables massive growth with stability