```{contents}
```
## Retries & Timeout

---

### 1. Motivation & Intuition

Modern Generative AI systems are **distributed, network-based services**.
Failures are inevitable:

| Failure Source | Examples                                    |
| -------------- | ------------------------------------------- |
| Network        | packet loss, latency spikes, DNS failure    |
| Service        | model overload, internal crash, rate limits |
| Client         | transient CPU/memory pressure               |
| External APIs  | timeouts, 5xx errors                        |

Two reliability controls handle these failures:

* **Timeout** → *How long do we wait?*
* **Retry** → *What do we do if it fails?*

Together they determine **latency, throughput, cost, and user experience**.

---

### 2. Timeout: Concept

**Timeout** = maximum allowed time for a request to complete.

[
T_{timeout} = \text{deadline for success}
]

If exceeded → request is **aborted** and marked failed.

#### Why Timeouts Matter

Without timeouts:

* threads block
* queues grow
* system collapses under load

With proper timeouts:

* failures are detected early
* resources are freed
* system remains responsive

#### Timeout Types

| Type                     | Purpose                   |
| ------------------------ | ------------------------- |
| Connection timeout       | TCP handshake limit       |
| Read timeout             | waiting for model output  |
| Total request timeout    | full lifecycle limit      |
| Idle timeout             | inactivity window         |
| Per-token timeout (LLMs) | slow streaming protection |

---

### 3. Retry: Concept

**Retry** = reattempting a failed request automatically.

Goal: **recover from transient faults**

#### When to Retry

Safe to retry:

* network errors
* timeouts
* 429 / 5xx server errors

Never retry:

* invalid input (4xx)
* authentication failure
* deterministic model errors

---

### 4. Retry Strategies

| Strategy                | Description                          |
| ----------------------- | ------------------------------------ |
| Immediate               | retry instantly                      |
| Fixed delay             | wait constant time                   |
| Linear backoff          | delay increases linearly             |
| **Exponential backoff** | delay doubles each retry             |
| Exponential + jitter    | backoff + randomness (best practice) |

**Exponential Backoff with Jitter**
[
delay = random(0, base \times 2^n)
]

Prevents synchronized retry storms.

---

### 5. Combined Workflow

```text
Client Request
     ↓
Set Timeout Clock
     ↓
Send to LLM API
     ↓
Response?
   /      \
Yes       No
 |         |
Return    Timeout/Error
            |
         Retry Policy?
          /     \
        Yes     No
         |       |
    Wait(backoff) Fail
         |
     Reattempt
```

---

### 6. Practical Code Example (Python)

```python
import time
import random
import requests

MAX_RETRIES = 5
BASE_DELAY = 0.5
TIMEOUT = 8  # seconds

def call_llm(prompt):
    for attempt in range(MAX_RETRIES):
        try:
            response = requests.post(
                "https://api.llm-provider.com/generate",
                json={"prompt": prompt},
                timeout=TIMEOUT
            )
            response.raise_for_status()
            return response.json()

        except (requests.exceptions.Timeout,
                requests.exceptions.ConnectionError,
                requests.exceptions.HTTPError) as e:

            if attempt == MAX_RETRIES - 1:
                raise RuntimeError("Request failed after retries") from e

            delay = random.uniform(0, BASE_DELAY * 2 ** attempt)
            time.sleep(delay)
```

---

### 7. Application in LLM Pipelines

| Layer                | Usage                   |
| -------------------- | ----------------------- |
| Prompt → API         | retries + timeout       |
| Tool calls           | retries + timeout       |
| Vector DB retrieval  | retries                 |
| Streaming generation | token-level timeouts    |
| Orchestration        | global request deadline |

---

### 8. Trade-offs & Design Guidelines

| Goal             | Design Choice                    |
| ---------------- | -------------------------------- |
| Low latency      | small timeout, few retries       |
| High reliability | more retries, larger timeout     |
| Cost control     | limited retries                  |
| User experience  | fast failure + graceful fallback |

**Rule of thumb**

> Short timeouts + smart retries beat long blocking calls.

---

### 9. Summary Table

| Component | Role                            |
| --------- | ------------------------------- |
| Timeout   | limits waiting time             |
| Retry     | recovers from transient failure |
| Backoff   | avoids overload                 |
| Jitter    | prevents retry storms           |
| Policy    | balances reliability & latency  |

---

### 10. Key Insight

**Retries and timeouts are not error-handling details — they are core architecture of production-grade Generative AI systems.**
