```{contents}
```
## State Recovery

**State Recovery** in LangGraph is the ability of a workflow to **resume execution from a previously saved point** after interruption, failure, crash, or manual pause — without re-running the entire graph.
It is a foundational mechanism for **fault tolerance, long-running workflows, human-in-the-loop systems, and production reliability**.

---

### **1. Motivation**

LLM systems in production are:

* **Long-running**
* **Distributed**
* **Failure-prone**
* **Cost-sensitive**

Re-running an entire graph after failure is **expensive, slow, and unsafe**.
State recovery makes execution **resilient**.

---

### **2. Core Concepts**

| Concept     | Role                             |
| ----------- | -------------------------------- |
| Checkpoint  | Saved snapshot of execution      |
| State Store | Persistent storage backend       |
| Thread ID   | Unique workflow identity         |
| Run ID      | Unique execution instance        |
| Snapshot    | Serialized state + position      |
| Resume      | Continue execution from snapshot |
| Replay      | Deterministic re-execution       |
| Rollback    | Restore earlier snapshot         |

---

### **3. How State Recovery Works**

### **Execution Model**

```
Start → Node A → Checkpoint → Node B → Checkpoint → Node C → ...
```

If failure occurs after Node B:

```
Recover → Load Checkpoint → Resume at Node C
```

LangGraph stores:

* **Graph position** (which node)
* **Full state**
* **Execution metadata**

---

### **4. Storage Architecture**

LangGraph uses a **Checkpoint Store** (pluggable backend):

| Backend            | Usage               |
| ------------------ | ------------------- |
| Memory             | Development         |
| SQLite             | Local production    |
| Postgres           | Enterprise          |
| Redis              | High-throughput     |
| S3 / Cloud storage | Distributed systems |

---

### **5. Implementation Example**

```python
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.graph import StateGraph

checkpointer = SqliteSaver("state.db")

graph = builder.compile(checkpointer=checkpointer)
```

Now every step is automatically checkpointed.

---

### **6. Crash Simulation & Recovery**

```python
result = graph.invoke(
    input_data,
    config={"thread_id": "user-42"}
)
```

If process crashes mid-execution, restart and call:

```python
result = graph.invoke(
    None,
    config={"thread_id": "user-42"}
)
```

LangGraph will:

1. Load last checkpoint
2. Restore state
3. Continue execution automatically

---

### **7. Human-in-the-Loop Recovery**

State recovery enables **manual intervention**:

```python
graph.get_state("user-42")
graph.update_state("user-42", {"approved": True})
graph.invoke(None, config={"thread_id": "user-42"})
```

---

### **8. Failure Scenarios Covered**

| Failure            | Recovery Behavior           |
| ------------------ | --------------------------- |
| Process crash      | Resume from last node       |
| Network failure    | Resume safely               |
| Human pause        | Continue later              |
| Timeout            | Replay from checkpoint      |
| Model failure      | Retry without recomputation |
| Deployment restart | Continue execution          |

---

### **9. Determinism & Replay**

To guarantee safe recovery:

* Node functions must be **pure or idempotent**
* External calls should be **logged**
* Randomness must be seeded
* Tool effects must be controlled

---

### **10. Production Best Practices**

| Practice                   | Benefit            |
| -------------------------- | ------------------ |
| Checkpoint after each node | Minimal loss       |
| Use persistent backend     | Crash-safe         |
| Tag with thread_id         | Correct resumption |
| Enable tracing             | Debug recovery     |
| Limit checkpoint size      | Performance        |
| Encrypt stored state       | Security           |

---

### **11. Mental Model**

LangGraph state recovery behaves like:

> **Transactional database execution for AI workflows**

Every node commit creates a safe restore point.

### Demonstration

In [3]:
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import InMemorySaver
import time

class State(TypedDict):
    count: int

def step1(state):
    print("Step 1")
    time.sleep(1)
    return {"count": state["count"] + 1}

def step2(state):
    print("Step 2")
    time.sleep(1)
    return {"count": state["count"] + 1}

def step3(state):
    print("Step 3 (will crash)")
    time.sleep(1)
    raise RuntimeError("Simulated crash")



In [4]:
builder = StateGraph(State)

builder.add_node("step1", step1)
builder.add_node("step2", step2)
builder.add_node("step3", step3)

builder.set_entry_point("step1")
builder.add_edge("step1", "step2")
builder.add_edge("step2", "step3")
builder.add_edge("step3", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)


In [5]:
try:
    graph.invoke({"count": 0}, config={"thread_id": "job-1"})
except:
    print("Process crashed")


Step 1
Step 2
Step 3 (will crash)
Process crashed
