```{contents}
```
## Dead Letter Queue (DLQ)

A **Dead Letter Queue (DLQ)** in LangGraph is a **reliability and fault-tolerance mechanism** used to capture **failed node executions, invalid states, and unrecoverable events** so that they do **not break the main workflow** and can be **analyzed, repaired, and replayed later**.

---

### **1. Why DLQ Is Necessary in LangGraph**

LLM workflows operate with:

* Unreliable external APIs
* Non-deterministic LLM behavior
* Long-running, multi-step execution
* Human interactions
* Distributed infrastructure

Failures are inevitable.

Without DLQ:

> **One failure â†’ entire graph collapses**

With DLQ:

> **Failure is isolated, persisted, and recoverable**

---

### **2. What Goes Into a LangGraph DLQ**

| Stored Item            | Purpose                    |
| ---------------------- | -------------------------- |
| Failed state snapshot  | Resume or debug            |
| Node name              | Identify failing component |
| Exception & stacktrace | Root cause analysis        |
| Input & output         | Reproducibility            |
| Timestamp              | Ordering & auditing        |
| Retry count            | Failure classification     |
| Thread / run ID        | Correlation                |
| Metadata               | System diagnostics         |

---

### **3. DLQ Execution Model in LangGraph**

```
Normal Path
   â”‚
   â–¼
[ Node Execution ]
   â”‚
   â”œâ”€â”€ Success â†’ Continue graph
   â”‚
   â””â”€â”€ Failure â†’ Retry â†’ Retry â†’ Exhausted
                         â”‚
                         â–¼
                      Dead Letter Queue
```

---

### **4. Implementing DLQ in LangGraph**

### **State Schema**

```python
class State(TypedDict):
    input: str
    result: str
    error: str
    retries: int
```

---

### **Fault-Tolerant Node**

```python
MAX_RETRIES = 3

def fragile_node(state: State):
    try:
        if random.random() < 0.7:
            raise RuntimeError("LLM API failure")
        return {"result": "success"}
    except Exception as e:
        return {
            "error": str(e),
            "retries": state.get("retries", 0) + 1
        }
```

---

### **DLQ Router**

```python
def route_after_failure(state: State):
    if state["retries"] >= MAX_RETRIES:
        return "dlq"
    return "retry"
```

---

### **DLQ Node**

```python
def dead_letter_node(state: State):
    with open("dlq.log", "a") as f:
        f.write(json.dumps(state) + "\n")
    return state
```

---

### **Graph Wiring**

```python
builder.add_node("work", fragile_node)
builder.add_node("dlq", dead_letter_node)

builder.set_entry_point("work")

builder.add_conditional_edges("work", route_after_failure, {
    "retry": "work",
    "dlq": "dlq"
})

builder.add_edge("dlq", END)
```

---

### **5. Production DLQ Architecture**

| Layer         | Implementation                |
| ------------- | ----------------------------- |
| Storage       | Kafka, SQS, Redis, Postgres   |
| Retention     | 7â€“90 days                     |
| Replay Engine | LangGraph checkpoint recovery |
| Alerting      | Prometheus / Slack            |
| Dashboards    | Grafana                       |
| Governance    | Audit logging                 |

---

### **6. DLQ Replay Workflow**

```
DLQ â†’ Analyze â†’ Fix Code / State â†’ Replay Checkpoint â†’ Resume Graph
```

LangGraph supports **resuming from checkpoints**, making DLQ a first-class recovery tool.

---

### **7. DLQ Use Cases**

| Scenario          | Role of DLQ              |
| ----------------- | ------------------------ |
| LLM hallucination | Capture unsafe output    |
| Tool failure      | Preserve request & retry |
| Timeout           | Prevent graph crash      |
| Invalid state     | Debug data corruption    |
| Human rejection   | Store rejected path      |

---

### **8. Design Best Practices**

* Never lose state on failure
* Store **full execution context**
* Classify failures (transient vs fatal)
* Enable selective replay
* Monitor DLQ volume (system health indicator)

---

### **9. Mental Model**

In LangGraph:

> **DLQ = insurance policy for autonomous systems**

It turns unpredictable LLM pipelines into **reliable production systems**.


### Demonstration

In [1]:
# Dead Letter Queue (DLQ) demonstration in one cell

import random, json
from typing import TypedDict
from langgraph.graph import StateGraph, END

# ---------------------------
# State definition
# ---------------------------
class State(TypedDict):
    retries: int
    result: str
    error: str

MAX_RETRIES = 3

# ---------------------------
# Worker node (may fail)
# ---------------------------
def fragile_node(state: State):
    try:
        if random.random() < 0.7:
            raise RuntimeError("Simulated LLM / Tool failure")
        return {"result": "SUCCESS", "error": ""}
    except Exception as e:
        return {
            "error": str(e),
            "retries": state.get("retries", 0) + 1
        }

# ---------------------------
# Router after failure
# ---------------------------
def router(state: State):
    if state.get("retries", 0) >= MAX_RETRIES:
        return "dlq"
    if state.get("error"):
        return "retry"
    return END

# ---------------------------
# Dead Letter Queue node
# ---------------------------
def dead_letter_node(state: State):
    print("\nðŸ’€ DEAD LETTER QUEUE TRIGGERED")
    print(json.dumps(state, indent=2))
    return state

# ---------------------------
# Build graph
# ---------------------------
builder = StateGraph(State)

builder.add_node("work", fragile_node)
builder.add_node("dlq", dead_letter_node)

builder.set_entry_point("work")

builder.add_conditional_edges("work", router, {
    "retry": "work",
    "dlq": "dlq",
    END: END
})

builder.add_edge("dlq", END)

graph = builder.compile()

# ---------------------------
# Run
# ---------------------------
print("\nâ–¶ Running DLQ Demo\n")
result = graph.invoke({"retries": 0, "result": "", "error": ""})

print("\nFINAL STATE")
print(json.dumps(result, indent=2))



â–¶ Running DLQ Demo


FINAL STATE
{
  "retries": 1,
  "result": "SUCCESS",
  "error": ""
}
