```{contents}
```
## Exception Backoff 

**Exception backoff** is a reliability mechanism that controls how LangGraph reacts to runtime failures by **pausing, retrying, and progressively delaying execution** after errors occur.
It is essential for building **robust, production-grade LLM systems** that interact with unreliable external services (LLMs, APIs, tools, databases).

---

### **1. Motivation**

LLM systems frequently encounter transient failures:

| Failure Type    | Example              |
| --------------- | -------------------- |
| Network error   | API timeout          |
| Model error     | Rate limit exceeded  |
| Tool failure    | Database unavailable |
| Invalid output  | Malformed JSON       |
| System overload | Memory exhaustion    |

Naive retries cause:

* cascading failures
* thundering herd problems
* wasted tokens and cost

**Exception backoff** solves this by spacing retries intelligently.

---

### **2. Core Idea**

> After a failure, wait for a delay before retrying.
> After each subsequent failure, increase the delay.

This creates a **controlled recovery behavior**.

---

### **3. Backoff Strategies**

| Strategy             | Formula             | When Used                |
| -------------------- | ------------------- | ------------------------ |
| Fixed                | `delay = k`         | Simple systems           |
| Linear               | `delay = k × n`     | Gradual pressure release |
| Exponential          | `delay = base × 2ⁿ` | Production standard      |
| Exponential + Jitter | `random(base × 2ⁿ)` | Distributed systems      |

**LangGraph production systems strongly prefer:
Exponential backoff with jitter**

---

### **4. Where Backoff Fits in LangGraph**

LangGraph integrates backoff at the **node execution layer**.

```
Node → Execute → Exception?
            ↘ yes
         Backoff → Retry → Execute → ...
```

Backoff policies are applied via **retry wrappers**.

---

### **5. Practical Implementation**

#### **Node with Backoff**

```python
import time
import random

def with_backoff(fn, max_retries=5, base_delay=1.0):
    def wrapper(state):
        for attempt in range(max_retries):
            try:
                return fn(state)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                delay = base_delay * (2 ** attempt)
                delay += random.uniform(0, delay * 0.1)  # jitter
                time.sleep(delay)
    return wrapper
```

#### **Attach to LangGraph Node**

```python
builder.add_node("call_api", with_backoff(call_api_node))
```

---

### **6. Example with External API Failure**

```python
def unstable_api(state):
    if random.random() < 0.7:
        raise RuntimeError("Transient failure")
    return {"result": "success"}
```

```python
builder.add_node("api", with_backoff(unstable_api))
```

---

### **7. Integration with Conditional Routing**

When retries are exhausted, route to a **fallback node**:

```python
def safe_api(state):
    try:
        return unstable_api(state)
    except:
        return {"error": "API failure"}
```

```python
builder.add_conditional_edges("api", lambda s: "fallback" if "error" in s else "next", {
    "fallback": "fallback_node",
    "next": "process_node"
})
```

---

### **8. Backoff + Checkpointing**

LangGraph checkpointing ensures that retries **resume safely** without state corruption:

| Feature    | Benefit                |
| ---------- | ---------------------- |
| Checkpoint | Resume after crash     |
| Backoff    | Prevent overload       |
| Retry      | Self-healing execution |

---

### **9. Production Parameters**

| Parameter   | Typical Value |
| ----------- | ------------- |
| Max retries | 3–7           |
| Base delay  | 0.5–2 seconds |
| Max delay   | 30–60 seconds |
| Jitter      | 10–20%        |

---

### **10. Why Backoff Is Critical in LLM Systems**

| Without Backoff      | With Backoff      |
| -------------------- | ----------------- |
| Request storms       | Stable traffic    |
| High cost            | Controlled cost   |
| Unreliable pipelines | Resilient systems |
| Poor user experience | Graceful recovery |

---

### **11. Mental Model**

Exception backoff turns LangGraph from a **fragile workflow engine** into a **self-healing distributed system controller**.

> **Failure → Pause → Recover → Continue**


### Demonstration

In [1]:
import random, time
from typing import TypedDict
from langgraph.graph import StateGraph, END

# ---------------------------
# 1. Define shared state
# ---------------------------
class State(TypedDict):
    attempts: int
    result: str

# ---------------------------
# 2. Backoff wrapper
# ---------------------------
def with_backoff(fn, max_retries=5, base_delay=1.0):
    def wrapped(state: State):
        for attempt in range(max_retries):
            try:
                return fn(state)
            except Exception as e:
                if attempt == max_retries - 1:
                    return {"result": "FAILED_AFTER_RETRIES"}
                delay = base_delay * (2 ** attempt)
                delay += random.uniform(0, delay * 0.1)  # jitter
                print(f"Retry {attempt+1} in {delay:.2f}s")
                time.sleep(delay)
        return state
    return wrapped

# ---------------------------
# 3. Unstable node
# ---------------------------
def flaky_api(state: State):
    state["attempts"] += 1
    if random.random() < 0.7:
        raise RuntimeError("Transient API failure")
    return {"result": "SUCCESS"}

# ---------------------------
# 4. Fallback node
# ---------------------------
def fallback(state: State):
    return {"result": "FALLBACK_USED"}

# ---------------------------
# 5. Build graph
# ---------------------------
builder = StateGraph(State)

builder.add_node("api", with_backoff(flaky_api))
builder.add_node("fallback", fallback)

builder.set_entry_point("api")

builder.add_conditional_edges(
    "api",
    lambda s: END if s["result"] == "SUCCESS" else "fallback",
    {END: END, "fallback": "fallback"}
)

builder.add_edge("fallback", END)

graph = builder.compile()

# ---------------------------
# 6. Run
# ---------------------------
output = graph.invoke({"attempts": 0, "result": ""})
print("\nFinal Output:", output)


Retry 1 in 1.03s

Final Output: {'attempts': 0, 'result': 'SUCCESS'}
