```{contents}
```
## **Retry Node in LangGraph**

A **Retry Node** in LangGraph is a **fault-tolerance control mechanism** that automatically re-executes a node when execution fails, times out, or produces an invalid result.
It is a fundamental building block for building **reliable, production-grade LLM systems**.

---

### **1. Why Retry Nodes Exist**

LLM systems operate over **unreliable components**:

| Failure Source | Examples                          |
| -------------- | --------------------------------- |
| LLM APIs       | Rate limits, timeouts             |
| Tools          | Network failures, invalid outputs |
| Data           | Corruption, missing fields        |
| Logic          | Non-deterministic behavior        |

A retry node prevents **entire workflow failure** when transient errors occur.

---

### **2. Core Design Philosophy**

LangGraph treats retries as part of the **execution graph**, not as ad-hoc try/except blocks.

> **Failures are expected events in the control flow.**

---

### **3. Where Retries Live in LangGraph**

Retries are configured at **node-level** during graph compilation.

```python
graph = builder.compile(
    retry_policy=RetryPolicy(
        max_retries=3,
        backoff=2,
        retry_on=[TimeoutError, ValueError]
    )
)
```

This applies retries to **every node** by default.

---

### **4. Retry Policy Parameters**

| Parameter     | Purpose                               |
| ------------- | ------------------------------------- |
| `max_retries` | Maximum attempts                      |
| `backoff`     | Delay multiplier between attempts     |
| `retry_on`    | Exception types that trigger retry    |
| `jitter`      | Random delay to avoid thundering herd |
| `timeout`     | Max time per attempt                  |

---

### **5. Node-Specific Retry**

```python
builder.add_node(
    "llm_call",
    llm_fn,
    retry=RetryPolicy(max_retries=5, backoff=1.5)
)
```

This overrides the global policy for that node.

---

### **6. Execution Semantics**

When a node fails:

1. Exception is captured
2. Retry counter increments
3. Backoff delay applied
4. Node re-executes
5. On success → continue
6. On exhaustion → propagate failure

**State remains consistent** because LangGraph checkpoints after each successful node.

---

### **7. Retry + Checkpointing**

LangGraph integrates retries with **state checkpoints**.

| Event      | Effect                   |
| ---------- | ------------------------ |
| Node fails | Last checkpoint restored |
| Retry      | Node re-executes safely  |
| Crash      | Resume from checkpoint   |

This enables **exactly-once execution semantics**.

---

### **8. Minimal Working Example**

```python
from langgraph.graph import StateGraph
from langgraph.pregel import RetryPolicy

def unstable_node(state):
    import random
    if random.random() < 0.7:
        raise TimeoutError("Transient failure")
    return {"value": "success"}

builder = StateGraph(dict)
builder.add_node("unstable", unstable_node)
builder.set_entry_point("unstable")

graph = builder.compile(
    retry_policy=RetryPolicy(max_retries=5, backoff=1)
)

print(graph.invoke({}))
```

---

### **9. Retry with Conditional Validation**

Retries are also used for **semantic validation**:

```python
def validate_output(state):
    if not state["answer"].startswith("Yes"):
        raise ValueError("Invalid output")
    return state
```

The node retries until the LLM produces a valid response.

---

### **10. Production Best Practices**

| Guideline                     | Reason               |
| ----------------------------- | -------------------- |
| Retry only transient failures | Avoid infinite loops |
| Use exponential backoff       | Reduce load          |
| Log every attempt             | Debugging            |
| Set global limits             | Cost control         |
| Combine with timeout          | Prevent hanging      |
| Escalate after exhaustion     | Graceful failure     |

---

### **11. Conceptual Model**

```
Execute → Fail → Restore → Wait → Retry → Success → Continue
```

The retry node transforms unreliable components into a **stable execution pipeline**.

---

### **12. When to Use Retry Nodes**

* LLM calls
* API calls
* Network tools
* Data fetch operations
* Validation steps
* Critical business actions


### Demonstration

In [1]:
from typing import TypedDict

class State(TypedDict):
    attempt: int
    result: str

import random

def unreliable_node(state: State) -> dict:
    state["attempt"] += 1
    print(f"Attempt #{state['attempt']}")

    # 70% chance of failure
    if random.random() < 0.7:
        raise TimeoutError("Transient failure")

    return {"result": "Success"}



In [3]:
from langgraph.graph import StateGraph, END

builder = StateGraph(State)

builder.add_node("unstable", unreliable_node)
builder.set_entry_point("unstable")
builder.add_edge("unstable", END)

graph = builder.compile()



In [4]:
result = graph.invoke(
    {"attempt": 0, "result": ""},
    config={
        "max_retries": 5,
        "retry_on": [TimeoutError],
        "backoff": 1.0
    }
)

print("Final State:", result)


Attempt #1


TimeoutError: Transient failure