```{contents}
```
## **Exception Failover in LangGraph**

**Exception Failover** in LangGraph is a **control-flow and reliability mechanism** that ensures a graph continues operating correctly when a node fails, by **detecting errors, switching execution paths, retrying operations, and restoring state**.

It is the foundation for **fault-tolerant, production-grade LLM systems**.

---

### **1. Why Exception Failover Is Required**

LLM systems interact with unreliable components:

| Failure Source | Examples                              |
| -------------- | ------------------------------------- |
| LLM APIs       | Timeouts, rate limits, hallucinations |
| Tools          | Network errors, API failures          |
| Data           | Corrupt input, missing fields         |
| Logic          | Bugs, invalid state                   |
| Humans         | Incorrect interventions               |

Without failover, workflows **crash and lose state**.
LangGraph prevents this.

---

### **2. Core Concepts**

| Concept             | Role                    |
| ------------------- | ----------------------- |
| Exception Detection | Identify failure        |
| Fallback Node       | Alternative execution   |
| Retry Policy        | Controlled repetition   |
| Checkpoint Recovery | Restore last safe state |
| Conditional Routing | Decide next step        |
| Circuit Breaker     | Stop repeated failures  |
| Compensating Action | Undo side effects       |

---

### **3. Execution Model With Failover**

```
Node A ──success──▶ Node B ──success──▶ Node C
   │
   └─exception──▶ Error Handler ──▶ Fallback Path ──▶ Recovery ──▶ Continue
```

---

### **4. Implementing Exception Handling in LangGraph**

#### **State Schema**

```python
class State(TypedDict):
    data: str
    error: str | None
    retries: int
```

---

#### **Primary Node**

```python
def fragile_node(state):
    if random.random() < 0.5:
        raise ValueError("API failure")
    return {"data": "processed", "error": None}
```

---

#### **Failover Node**

```python
def error_handler(state):
    return {
        "error": "Recovered from failure",
        "retries": state["retries"] + 1
    }
```

---

#### **Graph Wiring**

```python
builder.add_node("main", fragile_node)
builder.add_node("recover", error_handler)

builder.set_entry_point("main")
builder.add_edge("main", END)

builder.add_conditional_edges(
    "main",
    lambda s: "recover" if s.get("error") else END,
    {"recover": "recover", END: END}
)

builder.add_edge("recover", "main")
```

---

### **5. Retry & Backoff Control**

```python
def should_retry(state):
    if state["retries"] >= 3:
        return END
    return "main"
```

Combined with conditional routing, this yields **bounded retries**.

---

### **6. Production Failover Patterns**

| Pattern            | Purpose                   |
| ------------------ | ------------------------- |
| Retry with Backoff | Handle transient failures |
| Fallback Model     | Switch LLM provider       |
| Alternate Tool     | Use backup API            |
| Degraded Mode      | Return partial results    |
| Human Escalation   | Manual resolution         |
| Checkpoint Restore | Resume execution          |

---

### **7. Safety Controls**

| Control         | Benefit                |
| --------------- | ---------------------- |
| Retry limit     | Prevent infinite loops |
| Timeouts        | Avoid deadlocks        |
| Circuit breaker | Stop cascades          |
| Audit logging   | Forensics              |
| Human override  | Critical workflows     |

---

### **8. Common Enterprise Use Cases**

* LLM provider outage → switch to backup model
* Tool API failure → alternate service
* Invalid result → reflection + retry
* Corrupt state → rollback from checkpoint
* Repeated errors → escalate to human

---

### **9. Mental Model**

LangGraph implements **distributed systems reliability** inside LLM workflows:

> **Detect → Isolate → Recover → Resume**

This transforms fragile LLM pipelines into **self-healing AI systems**.


### Demonstration

In [1]:
# One-cell demonstration: Exception Failover in LangGraph

from typing import TypedDict, Optional
from langgraph.graph import StateGraph, END
import random

# ----------------------------
# 1. Define State
# ----------------------------
class State(TypedDict):
    data: Optional[str]
    error: Optional[str]
    retries: int

# ----------------------------
# 2. Fragile Node (simulates failure)
# ----------------------------
def fragile_node(state: State):
    print(f"Attempt {state['retries'] + 1}")
    if random.random() < 0.6:   # 60% chance of failure
        raise Exception("Simulated API failure")
    return {"data": "SUCCESS", "error": None}

# ----------------------------
# 3. Error Handler / Recovery
# ----------------------------
def recovery_node(state: State):
    print("Recovery engaged")
    return {"error": "Recovered", "retries": state["retries"] + 1}

# ----------------------------
# 4. Safe Wrapper for Exception Capture
# ----------------------------
def safe_wrapper(fn):
    def wrapped(state):
        try:
            return fn(state)
        except Exception as e:
            return {"error": str(e)}
    return wrapped

# ----------------------------
# 5. Retry Logic
# ----------------------------
def router(state: State):
    if state["error"] and state["retries"] < 3:
        return "recover"
    return END

# ----------------------------
# 6. Build Graph
# ----------------------------
builder = StateGraph(State)

builder.add_node("main", safe_wrapper(fragile_node))
builder.add_node("recover", recovery_node)

builder.set_entry_point("main")

builder.add_conditional_edges("main", router, {
    "recover": "recover",
    END: END
})

builder.add_edge("recover", "main")

graph = builder.compile()

# ----------------------------
# 7. Execute
# ----------------------------
result = graph.invoke({"data": None, "error": None, "retries": 0})
print("\nFinal State:", result)


Attempt 1
Recovery engaged
Attempt 2
Recovery engaged
Attempt 3

Final State: {'data': 'SUCCESS', 'error': None, 'retries': 2}
