```{contents}
```
## Load Balancing 

### What Is Load Balancing

**Load balancing** is the process of **distributing incoming traffic across multiple servers** so that no single machine becomes overloaded.

It ensures:

* High availability
* Scalability
* Fault tolerance
* Low latency

---

### Where Load Balancing Fits

```
Clients → Load Balancer → Server Pool → Services / LLM / DB
```

---

### Why Load Balancing Is Critical for AI Systems

* LLM requests are compute-heavy
* Traffic patterns are bursty
* Failures are expensive
* User experience depends on latency

---

### Load Balancing Algorithms

| Algorithm         | Description                        |
| ----------------- | ---------------------------------- |
| Round Robin       | Requests evenly distributed        |
| Least Connections | Server with fewest active requests |
| IP Hash           | Same client goes to same server    |
| Weighted          | Some servers get more traffic      |

---

### Simple Python Load Balancer (Conceptual)

#### Demonstration

```python
import itertools

servers = ["server1", "server2", "server3"]
cycle = itertools.cycle(servers)

def route_request():
    return next(cycle)

for _ in range(10):
    print(route_request())
```

---

### Reverse Proxy Load Balancing (Nginx Example)

#### Demonstration

```nginx
http {
    upstream llm_backend {
        least_conn;
        server 127.0.0.1:8001;
        server 127.0.0.1:8002;
        server 127.0.0.1:8003;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://llm_backend;
        }
    }
}
```

---

### FastAPI + Multiple Workers

```bash
uvicorn main:app --workers 4
```

---

### Cloud Load Balancer Architecture

```
Users
  ↓
Cloud Load Balancer
  ↓
Kubernetes Service
  ↓
Pods (FastAPI + LangChain)
```

---

### Health Checks

```nginx
server 127.0.0.1:8001 max_fails=3 fail_timeout=30s;
```

---

### LLM-Specific Load Balancing Strategy

| Layer     | Strategy     |
| --------- | ------------ |
| API       | Round robin  |
| LLM       | Rate limited |
| Workers   | Auto-scaled  |
| Vector DB | Sharded      |

---

### Mental Model

```
Load Balancer = Traffic police for your servers
```

---

### Key Takeaways

* Mandatory for scalable AI systems
* Prevents downtime and overload
* Enables horizontal scaling
* Essential for production LLM infrastructure