```{contents}
```
## Self-Healing Graph

A **Self-Healing Graph** is a LangGraph design pattern in which the workflow **automatically detects failures, diagnoses their causes, applies corrective actions, and resumes execution**—without human intervention.
It enables **resilient, long-running, autonomous LLM systems**.

---

### **1. Why Self-Healing Is Needed**

LLM systems in production face:

| Failure Type   | Examples                        |
| -------------- | ------------------------------- |
| Model errors   | hallucinations, invalid output  |
| Tool failures  | API down, timeout, bad response |
| Logic errors   | wrong plan, infinite loop       |
| Data issues    | missing fields, corrupted state |
| Infrastructure | memory loss, crash              |

A self-healing graph **closes the loop between failure and recovery**.

---

### **2. Conceptual Loop**

```
Execute → Detect Failure → Diagnose → Repair → Resume
   ↑                                     ↓
   └────────────── Self-Healing Cycle ───┘
```

This is implemented as a **cycle in the graph**.

---

### **3. Core Components**

| Component      | Role                    |
| -------------- | ----------------------- |
| Execution Node | Runs task               |
| Monitor Node   | Validates output        |
| Failure State  | Encoded error condition |
| Diagnoser Node | Determines cause        |
| Repair Node    | Applies fix             |
| Router         | Decides next step       |
| Checkpoint     | Stores safe state       |

---

### **4. State Schema**

```python
class State(TypedDict):
    input: str
    result: str
    error: str | None
    retries: int
    healed: bool
```

---

### **5. Minimal Self-Healing Graph**

```python
from langgraph.graph import StateGraph, END
from typing import TypedDict

class State(TypedDict):
    result: str
    error: str | None
    retries: int

def execute(state):
    if state["retries"] < 2:
        return {"error": "Tool failed"}
    return {"result": "Success", "error": None}

def diagnose(state):
    return {"retries": state["retries"] + 1}

def route(state):
    if state["error"] and state["retries"] < 3:
        return "diagnose"
    return END

builder = StateGraph(State)

builder.add_node("execute", execute)
builder.add_node("diagnose", diagnose)

builder.set_entry_point("execute")
builder.add_edge("diagnose", "execute")

builder.add_conditional_edges("execute", route, {
    "diagnose": "diagnose",
    END: END
})

graph = builder.compile()
```

This graph **repairs itself** until success.

---

### **6. Healing Strategies**

| Strategy          | Example             |
| ----------------- | ------------------- |
| Retry             | Re-execute tool     |
| Replan            | Generate new plan   |
| Model switch      | Change LLM          |
| Prompt repair     | Fix format          |
| Data repair       | Fill missing fields |
| Tool substitution | Use backup service  |
| Human fallback    | Manual intervention |

---

### **7. Production Controls**

| Mechanism        | Purpose                 |
| ---------------- | ----------------------- |
| Retry limits     | Prevent infinite loops  |
| Timeouts         | Avoid deadlock          |
| Checkpointing    | Resume safely           |
| Audit logs       | Debug & compliance      |
| Circuit breakers | Stop cascading failures |
| Health metrics   | Monitor reliability     |

```python
graph.invoke(input, config={"recursion_limit": 10})
```

---

### **8. Enterprise Use Cases**

| System                | Benefit                      |
| --------------------- | ---------------------------- |
| Autonomous agents     | Stable long-running behavior |
| Data pipelines        | Auto-recovery                |
| Customer support bots | Fault tolerance              |
| DevOps automation     | Safe execution               |
| Cybersecurity         | Automatic remediation        |

---

### **9. Mental Model**

A Self-Healing Graph is a **closed-loop control system**:

> **Plan → Execute → Verify → Repair → Continue**

This is the foundation of **reliable autonomous AI systems**.


### Demonstration

In [1]:
from langgraph.graph import StateGraph, END
from typing import TypedDict

# -----------------------------
# 1. Define State
# -----------------------------
class State(TypedDict):
    result: str | None
    error: str | None
    retries: int

# -----------------------------
# 2. Execution Node (Fails first 2 times)
# -----------------------------
def execute(state: State):
    if state["retries"] < 2:
        return {"error": "External tool failure", "result": None}
    return {"error": None, "result": "Task completed successfully"}

# -----------------------------
# 3. Diagnosis & Repair Node
# -----------------------------
def diagnose_and_repair(state: State):
    print(f"Healing attempt {state['retries'] + 1}")
    return {"retries": state["retries"] + 1}

# -----------------------------
# 4. Router Logic
# -----------------------------
def router(state: State):
    if state["error"] and state["retries"] < 3:
        return "heal"
    return END

# -----------------------------
# 5. Build Self-Healing Graph
# -----------------------------
builder = StateGraph(State)

builder.add_node("execute", execute)
builder.add_node("heal", diagnose_and_repair)

builder.set_entry_point("execute")
builder.add_edge("heal", "execute")

builder.add_conditional_edges("execute", router, {
    "heal": "heal",
    END: END
})

graph = builder.compile()

# -----------------------------
# 6. Run
# -----------------------------
final_state = graph.invoke({"result": None, "error": None, "retries": 0})
print("\nFinal State:", final_state)


Healing attempt 1
Healing attempt 2

Final State: {'result': 'Task completed successfully', 'error': None, 'retries': 2}
