```{contents}
```
## Recovery Strategy

A **Recovery Strategy** in LangGraph is the systematic design of mechanisms that allow an LLM workflow to **detect failure, preserve progress, correct behavior, and continue execution safely** without losing correctness, consistency, or user trust.

This is essential for **long-running, multi-agent, production-grade AI systems**.

---

### **1. Why Recovery is Fundamental**

LLM systems fail in many ways:

| Failure Type      | Example         |
| ----------------- | --------------- |
| LLM hallucination | Wrong answer    |
| Tool failure      | API timeout     |
| Node crash        | Exception       |
| Bad state         | Invalid data    |
| Human rejection   | Approval denied |
| System outage     | Pod crash       |

LangGraph handles this using **stateful execution + control flow + persistence**.

---

### **2. Core Recovery Components**

| Component          | Purpose                    |
| ------------------ | -------------------------- |
| Checkpointing      | Preserve progress          |
| State persistence  | Survive crashes            |
| Retries            | Recover transient failures |
| Fallback nodes     | Alternative execution      |
| Human intervention | Correct critical failures  |
| Replay             | Re-execute safely          |
| Compensation       | Undo harmful actions       |

---

### **3. Recovery Workflow Model**

```
Execute Node
   |
Failure Detected?
   ├── No → Continue
   └── Yes
        |
    Save Checkpoint
        |
    Classify Failure
        |
    Apply Recovery Strategy
        |
    Resume Execution
```

---

### **4. Checkpointing & State Recovery**

LangGraph persists execution state at defined boundaries.

```python
from langgraph.checkpoint.sqlite import SqliteSaver
checkpointer = SqliteSaver("workflow.db")

graph = builder.compile(checkpointer=checkpointer)
```

If the process crashes, execution **resumes from last checkpoint**.

```python
graph.invoke(input, config={"thread_id": "job-123"})
```

---

### **5. Retry & Backoff**

```python
def tool_node(state):
    try:
        return call_api()
    except Exception:
        raise

builder.add_node("tool", tool_node)
builder.add_node("retry", retry_node)
```

| Strategy            | Purpose                |
| ------------------- | ---------------------- |
| Immediate retry     | Handle flaky calls     |
| Exponential backoff | Reduce overload        |
| Retry limit         | Prevent infinite loops |

---

### **6. Conditional Recovery Paths**

```python
def router(state):
    if state["error"] == "tool_timeout":
        return "retry"
    if state["error"] == "bad_output":
        return "revise"
    return END
```

```python
builder.add_conditional_edges("router", router, {
    "retry": "tool",
    "revise": "reflect",
    END: END
})
```

---

### **7. Fallback Nodes**

Use alternative models, tools, or logic.

```python
builder.add_edge("primary_llm", "fallback_llm")
```

---

### **8. Human-in-the-Loop Recovery**

```python
builder.add_node("human_review", interrupt)
```

Used when:

* Safety is at risk
* Legal or compliance failure
* High-confidence error

---

### **9. Compensation & Rollback**

For actions that modify external systems:

| Step         | Example                |
| ------------ | ---------------------- |
| Action       | Create record          |
| Failure      | Later validation fails |
| Compensation | Delete record          |

This preserves **system integrity**.

---

### **10. Replay & Forensics**

Stored checkpoints enable:

* Full execution replay
* Failure diagnosis
* Training data collection
* Audit compliance

---

### **11. Production Recovery Blueprint**

| Layer  | Recovery Control        |
| ------ | ----------------------- |
| Node   | try/except, retry       |
| Graph  | fallback routing        |
| State  | checkpoint, persistence |
| Agents | self-correction loops   |
| Human  | approval & override     |
| System | restart + resume        |

---

### **12. Key Design Principle**

> **LangGraph recovery is state-driven, not exception-driven.**

Failures become **data** that the graph reasons about.



### Demonstration

In [4]:
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import InMemorySaver
import random

# -----------------------------
# 1. State
# -----------------------------

class State(TypedDict):
    input: str
    output: str
    error: str
    attempts: int

# -----------------------------
# 2. Nodes (ALWAYS return dict)
# -----------------------------

def primary_llm(state):
    attempts = state["attempts"] + 1

    if random.random() < 0.6:
        return {"attempts": attempts, "error": "llm_failed"}

    return {"attempts": attempts, "error": "", "output": f"Processed: {state['input']}"}

def fallback_llm(state):
    return {"output": f"[Fallback] Processed: {state['input']}", "error": ""}

# This node only updates state — no END here
def router_node(state):
    return {}

# -----------------------------
# 3. Routing Logic (control flow only)
# -----------------------------

def route(state):
    if state["error"] and state["attempts"] < 3:
        return "retry"
    if state["error"]:
        return "fallback"
    return END

# -----------------------------
# 4. Graph
# -----------------------------

builder = StateGraph(State)

builder.add_node("primary", primary_llm)
builder.add_node("fallback", fallback_llm)
builder.add_node("router", router_node)

builder.set_entry_point("primary")
builder.add_edge("primary", "router")

builder.add_conditional_edges(
    "router",
    route,
    {
        "retry": "primary",
        "fallback": "fallback",
        END: END,
    }
)

# -----------------------------
# 5. Compile with Checkpointing
# -----------------------------

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

# -----------------------------
# 6. Run
# -----------------------------

result = graph.invoke(
    {"input": "critical request", "output": "", "error": "", "attempts": 0},
    config={"thread_id": "job-42"}
)

print(result)


{'input': 'critical request', 'output': 'Processed: critical request', 'error': '', 'attempts': 1}
