```{contents}
```
## **Error Handling in LangGraph**

Error handling in **LangGraph** is the systematic design of **fault-tolerant, recoverable, and safe execution** for LLM workflows.
Unlike traditional pipelines, LangGraph error handling operates at the **graph, node, state, and execution** levels, enabling production-grade reliability.

---

### **1. Why Error Handling Is Critical in LangGraph**

LLM systems fail frequently due to:

* Model hallucinations
* Tool timeouts
* API failures
* Invalid state transitions
* Partial outputs
* Human interruptions

LangGraph treats failure as a **first-class control flow event**, not an exception.

---

### **2. Error Taxonomy in LangGraph**

| Layer     | Error Type         | Example            |
| --------- | ------------------ | ------------------ |
| Node      | Runtime error      | Tool failure       |
| State     | Invalid state      | Missing field      |
| Control   | Infinite loop      | No termination     |
| Execution | Timeout            | Slow API           |
| External  | Dependency failure | Vector DB down     |
| Human     | Wrong input        | Approval rejection |

---

### **3. Error Handling Mechanisms**

#### **A. Node-Level Try/Catch**

```python
def tool_node(state):
    try:
        result = external_api_call(state["query"])
        return {"result": result}
    except Exception as e:
        return {"error": str(e)}
```

Nodes never crash the graph; they **return error state**.

---

#### **B. Error Routing with Conditional Edges**

```python
def route(state):
    if "error" in state:
        return "error_handler"
    return "next_step"

builder.add_conditional_edges("tool", route, {
    "error_handler": "handle_error",
    "next_step": "continue"
})
```

This converts exceptions into **explicit graph transitions**.

---

### **4. Dedicated Error Nodes**

```python
def handle_error(state):
    log_error(state["error"])
    return {"status": "recovered"}
```

Error nodes can:

* Retry
* Ask human
* Roll back state
* Switch models
* Abort execution

---

### **5. Retry Policies**

```python
def retry_node(state):
    if state.get("attempts", 0) >= 3:
        return {"failed": True}
    return {"attempts": state.get("attempts", 0) + 1}
```

Use with loop:

```
Task → Retry → Task → Retry → ...
```

Add recursion limit:

```python
graph.invoke(input, config={"recursion_limit": 10})
```

---

### **6. Timeout & Circuit Breaking**

Timeout control via node logic:

```python
import time

def safe_call(state):
    start = time.time()
    result = api()
    if time.time() - start > 5:
        return {"error": "timeout"}
```

Circuit breaker pattern:

| State             | Action             |
| ----------------- | ------------------ |
| Repeated failures | Disable path       |
| Cooldown          | Wait               |
| Recovery          | Resume normal flow |

---

### **7. Checkpointing & Recovery**

```python
from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver("checkpoints.db")
graph = builder.compile(checkpointer=checkpointer)
```

Allows:

* Resume after crash
* Debug previous states
* Rollback execution

---

### **8. Human-in-the-Loop Error Recovery**

```python
def human_review(state):
    return interrupt(state)
```

Human can:

* Edit state
* Approve continuation
* Abort execution

---

### **9. Production Error Handling Architecture**

| Layer     | Strategy            |
| --------- | ------------------- |
| Node      | Try/catch           |
| Graph     | Conditional routing |
| State     | Validation & schema |
| Execution | Checkpointing       |
| Ops       | Logging & tracing   |
| Human     | Approval & override |

---

### **10. Minimal Example: Fault-Tolerant Graph**

```python
class State(TypedDict):
    query: str
    result: str
    error: str

builder.add_node("call_api", tool_node)
builder.add_node("handle_error", handle_error)

builder.add_edge("call_api", "router")
builder.add_conditional_edges("router", route, {
    "handle_error": "handle_error",
    "continue": END
})
```

---

### **11. Design Principles**

1. **No silent failures**
2. **Every error becomes state**
3. **Failures are routable**
4. **Recovery is explicit**
5. **Humans can always intervene**

---

### **12. Mental Model**

LangGraph error handling = **workflow control system**

> Failure is not an exception — it is a transition.

This makes LangGraph suitable for **enterprise-grade autonomous systems**.



### Demonstration

In [6]:
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import InMemorySaver
import random

class State(TypedDict):
    attempts: int
    result: str
    error: str

def unreliable_api(state: State):
    if random.random() < 0.5:
        return {"error": "API Failure", "result": "", "attempts": state["attempts"]}
    return {"result": "Success!", "error": "", "attempts": state["attempts"]}

def retry(state: State):
    return {"attempts": state["attempts"] + 1}

def router(state: State):
    if state["error"]:
        if state["attempts"] >= 3:
            return "abort"
        return "retry"
    return "finish"

def abort(state: State):
    return {"result": "Execution aborted after retries"}

builder = StateGraph(State)

builder.add_node("call_api", unreliable_api)
builder.add_node("retry", retry)
builder.add_node("abort", abort)

builder.set_entry_point("call_api")
builder.add_edge("retry", "call_api")

builder.add_conditional_edges("call_api", router, {
    "retry": "retry",
    "abort": "abort",
    "finish": END
})

builder.add_edge("abort", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

output = graph.invoke(
    {"attempts": 0, "result": "", "error": ""},
    config={
        "recursion_limit": 10,
        "configurable": {"thread_id": "demo_run_1"}
    }
)

print(output)

{'attempts': 0, 'result': 'Success!', 'error': ''}
