```{contents}
```
## Exception & Retry Policy 

In LangGraph, **exception handling and retry policies** are fundamental production features that ensure **fault tolerance, reliability, and robustness** of long-running LLM workflows.
They allow graphs to **survive transient failures**, **recover safely**, and **continue execution without corrupting state**.

---

### **1. Why Retry Policies Are Required**

LLM systems operate in unreliable environments:

| Failure Source | Examples                          |
| -------------- | --------------------------------- |
| LLM APIs       | rate limits, timeouts, 5xx errors |
| External tools | network failures, API outages     |
| Data pipelines | malformed inputs                  |
| Infrastructure | node crashes, resource exhaustion |

Without retries, workflows collapse on first error.
With retries, LangGraph becomes **self-stabilizing**.

---

### **2. Exception Model in LangGraph**

LangGraph exceptions occur at **node execution time**.

Each node execution may result in:

* **Success** → State update committed
* **Exception** → Transition to failure handling logic

Exceptions **do not corrupt global state** because state updates are committed **only after successful node execution**.

---

### **3. Retry Policy Architecture**

LangGraph uses **execution configuration** to define retry behavior.

Core parameters:

| Parameter     | Description                    |
| ------------- | ------------------------------ |
| `max_retries` | Maximum retry attempts         |
| `retry_on`    | Which exceptions trigger retry |
| `backoff`     | Delay between retries          |
| `timeout`     | Max execution time             |
| `on_failure`  | Fallback node or termination   |

---

### **4. Basic Retry Example**

```python
from langgraph.graph import StateGraph, END
from langchain_core.runnables import RunnableConfig

config = RunnableConfig(
    max_retries=3,
    timeout=10
)

result = graph.invoke({"input": "hello"}, config=config)
```

This retries any failing node **up to 3 times**.

---

### **5. Selective Exception Retry**

```python
def flaky_node(state):
    if random.random() < 0.7:
        raise ValueError("Temporary failure")
    return {"result": "success"}
```

```python
config = RunnableConfig(
    max_retries=5,
    retry_on=(ValueError,)
)
```

Only `ValueError` triggers retry.

---

### **6. Backoff Strategy**

```python
config = RunnableConfig(
    max_retries=5,
    retry_backoff=2.0
)
```

Retry schedule:

```
1s → 2s → 4s → 8s → 16s
```

Prevents cascading failures under load.

---

### **7. Failure Routing & Fallback Nodes**

LangGraph supports **explicit failure edges**.

```python
builder.add_node("safe_node", safe_node)
builder.add_node("fallback", fallback_node)

builder.add_edge("safe_node", END)
builder.add_edge("safe_node", "fallback", on_failure=True)
```

If `safe_node` fails after retries → `fallback`.

---

### **8. Checkpointing + Retry = Exactly-Once Semantics**

LangGraph checkpoints state before each node:

```
Checkpoint → Node Execution → Commit → Next Node
```

If a node fails and retries, state is restored from checkpoint.
This guarantees:

* No duplicate effects
* No partial state corruption
* Safe replay

---

### **9. Production Retry Patterns**

| Pattern             | Use Case             |
| ------------------- | -------------------- |
| Transient Retry     | API timeouts         |
| Exponential Backoff | Rate limits          |
| Circuit Breaker     | Repeated failures    |
| Fallback Routing    | Graceful degradation |
| Human Escalation    | Critical failure     |
| Dead-Letter Path    | Permanent failures   |

---

### **10. Advanced Example — Resilient Node**

```python
def resilient_tool(state):
    try:
        return call_external_api()
    except TimeoutError:
        raise
    except Exception as e:
        log_error(e)
        raise RuntimeError("Non-recoverable")
```

```python
config = RunnableConfig(
    max_retries=4,
    retry_on=(TimeoutError,)
)
```

---

### **11. Operational Guarantees**

| Guarantee         | Provided By       |
| ----------------- | ----------------- |
| No partial state  | Checkpointing     |
| Safe recovery     | State restoration |
| No infinite loops | Retry limits      |
| High availability | Fallback routing  |
| Observability     | Failure traces    |

---

### **12. Conceptual Summary**

LangGraph’s retry & exception model provides:

> **Transactional execution + fault tolerance + self-healing workflows**

This elevates LLM systems from fragile demos to **production-grade distributed systems**.



### Demonstration

In [5]:
import random
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_core.runnables import RunnableConfig

# -----------------------------
# 1. State
# -----------------------------
class State(TypedDict):
    attempts: int
    success: bool
    result: str

# -----------------------------
# 2. Flaky Node (Retryable but Safe)
# -----------------------------
def flaky_node(state: State):
    state["attempts"] += 1
    print(f"Attempt {state['attempts']}")

    try:
        if random.random() < 0.8:
            raise TimeoutError("Transient API failure")

        return {"success": True, "result": "Success after retry"}

    except TimeoutError:
        if state["attempts"] >= 4:      # max_retries + 1
            return {"success": False}
        raise

# -----------------------------
# 3. Router
# -----------------------------
def route(state: State):
    if state.get("success"):
        return END
    return "fallback"

# -----------------------------
# 4. Fallback Node
# -----------------------------
def fallback_node(state: State):
    print("Entering fallback...")
    return {"result": "Recovered via fallback"}

# -----------------------------
# 5. Graph
# -----------------------------
builder = StateGraph(State)

builder.add_node("flaky", flaky_node)
builder.add_node("fallback", fallback_node)

builder.set_entry_point("flaky")
builder.add_conditional_edges("flaky", route, {
    "fallback": "fallback",
    END: END
})

builder.add_edge("fallback", END)

graph = builder.compile()

# -----------------------------
# 6. Execute with Retry Policy
# -----------------------------
config = RunnableConfig(
    max_retries=3,
    retry_on=(TimeoutError,),
    retry_backoff=1.5
)

result = graph.invoke(
    {"attempts": 0, "success": False, "result": ""},
    config=config
)

print("\nFinal State:", result)


Attempt 1


TimeoutError: Transient API failure