```{contents}
```
## **State Checkpointing in LangGraph**

**State checkpointing** in LangGraph is the mechanism for **persisting execution state at runtime** so that a graph can be **paused, resumed, inspected, replayed, audited, and recovered from failure**.
It is the foundation for **reliability, human-in-the-loop workflows, long-running agents, and production-grade fault tolerance**.

---

### **1. Motivation & Intuition**

LLM workflows are:

* Long-running
* Non-deterministic
* Failure-prone (network, tools, models, humans)

Without checkpoints, a failure means **starting from scratch**.
With checkpoints, LangGraph behaves like a **transactional system**.

> **Checkpoint = persistent snapshot of the entire graph state + execution position**

---

### **2. What Exactly Is Stored**

A checkpoint captures:

| Component | Description                    |
| --------- | ------------------------------ |
| State     | Full shared state object       |
| Node      | Current active node            |
| Edge      | Next transitions               |
| Metadata  | timestamps, step count, run id |
| Version   | State version                  |
| History   | previous snapshots             |

Formally:

```
Checkpoint = (State_t, Node_t, Metadata_t)
```

---

### **3. How Checkpointing Works**

LangGraph saves state:

* **Before & after node execution**
* At **human interrupt points**
* At **error boundaries**
* At **explicit user-defined points**

Execution then becomes:

```
Load Checkpoint → Execute Node → Save Checkpoint → Transition
```

---

### **4. Enabling Checkpointing**

```python
from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("state.db")

graph = builder.compile(checkpointer=checkpointer)
```

Now every execution step is **persisted**.

---

### **5. Thread-Based Persistence**

Each execution has a **thread_id**.

```python
graph.invoke(
    input_state,
    config={"configurable": {"thread_id": "user-123"}}
)
```

This allows:

* Session continuity
* Conversation memory
* Workflow recovery

---

### **6. Resuming from Checkpoint**

```python
graph.invoke(
    None,
    config={"configurable": {"thread_id": "user-123"}}
)
```

Execution resumes from the **last saved checkpoint**.

---

### **7. Human-in-the-Loop with Checkpoints**

```python
from langgraph.prebuilt import interrupt

def approval_node(state):
    interrupt("Waiting for approval")
```

The graph:

1. Saves checkpoint
2. Pauses execution
3. Waits for human
4. Resumes from the same state

---

### **8. Recovery & Replay**

| Capability     | How                        |
| -------------- | -------------------------- |
| Crash recovery | Reload last checkpoint     |
| Step replay    | Load previous snapshot     |
| Audit          | Inspect state history      |
| Rollback       | Restore older version      |
| Debugging      | Trace exact execution path |

---

### **9. Checkpoint Storage Options**

| Backend    | Use Case           |
| ---------- | ------------------ |
| SQLite     | Local dev          |
| PostgreSQL | Production         |
| Redis      | Fast ephemeral     |
| S3         | Long-term archival |

---

### **10. Example: Fault-Tolerant Agent**

```python
builder = StateGraph(State)
...
graph = builder.compile(checkpointer=checkpointer)

result = graph.invoke({"task": "analyze report"},
    config={"configurable": {"thread_id": "job-7"}})
```

If the process crashes mid-run:

```python
graph.invoke(None, config={"configurable": {"thread_id": "job-7"}})
```

The agent continues exactly where it stopped.

---

### **11. Why Checkpointing Is Essential for Production**

| Without Checkpoints | With Checkpoints    |
| ------------------- | ------------------- |
| Stateless           | Fully stateful      |
| Fragile             | Fault tolerant      |
| No audit            | Full traceability   |
| Manual recovery     | Automatic recovery  |
| Unsafe autonomy     | Controlled autonomy |

---

### **12. Design Patterns Enabled**

* Long-running agents
* Multi-day workflows
* Approval pipelines
* Compliance auditing
* Self-healing systems
* Distributed execution

---

### **13. Mental Model**

LangGraph with checkpoints behaves like:

> **A database transaction system for LLM workflows**

Every step is:

* Atomic
* Durable
* Recoverable
* Auditable
* 
### Demonstration


In [1]:
from typing import TypedDict

class State(TypedDict):
    step: int
    message: str


In [2]:
from langgraph.graph import StateGraph, END

def step_node(state: State):
    print("Running step:", state["step"])
    return {"step": state["step"] + 1}

def decision_node(state: State):
    if state["step"] >= 3:
        return {"message": "Finished"}
    return {}


In [3]:
builder = StateGraph(State)

builder.add_node("step", step_node)
builder.add_node("decide", decision_node)

builder.set_entry_point("step")
builder.add_edge("step", "decide")

builder.add_conditional_edges(
    "decide",
    lambda s: END if s["step"] >= 3 else "step",
    {"step": "step", END: END}
)


<langgraph.graph.state.StateGraph at 0x12c501b9070>

In [4]:
from langgraph.checkpoint.memory import InMemorySaver

checkpointer = InMemorySaver()

graph = builder.compile(checkpointer=checkpointer)


In [5]:
thread_id = "job-42"

graph.invoke({"step": 0, "message": ""}, 
    config={"configurable": {"thread_id": thread_id}})


Running step: 0
Running step: 1
Running step: 2


{'step': 3, 'message': 'Finished'}

In [6]:
graph.invoke(None, 
    config={"configurable": {"thread_id": "job-42"}})


{'step': 3, 'message': 'Finished'}