```{contents}
```
## Alerting

**Alerting** in LangGraph refers to the **systematic detection, signaling, and handling of abnormal, risky, or significant events during graph execution**, enabling operators and systems to **respond in real time** to failures, performance degradation, policy violations, and business-critical conditions.

Alerting transforms LangGraph from a workflow engine into a **production-grade autonomous system**.

---

### **1. Why Alerting Is Required in LangGraph**

LLM workflows are:

* Long-running
* Non-deterministic
* Tool-dependent
* Cost-sensitive
* Business-critical

Failures cannot remain silent.

| Without Alerting          | With Alerting        |
| ------------------------- | -------------------- |
| Silent failures           | Immediate visibility |
| Hidden cost explosions    | Budget protection    |
| Undetected hallucinations | Safety enforcement   |
| Manual debugging          | Automated response   |

---

### **2. Alerting Architecture**

```
Graph Execution
      |
State + Events
      |
Alert Conditions
      |
Alert Engine
      |
Channels (Slack, Email, PagerDuty, Webhooks)
```

Alerting is **orthogonal to execution**: it observes execution without altering control flow unless configured to do so.

---

### **3. Alert Triggers in LangGraph**

Alerts are generated from **runtime signals**:

| Category    | Examples                                       |
| ----------- | ---------------------------------------------- |
| Execution   | Node failure, timeout, infinite loop           |
| State       | Invalid state, missing fields, corruption      |
| Performance | Latency spike, retry storm                     |
| Cost        | Token overrun, model budget breach             |
| Safety      | Policy violation, unsafe tool usage            |
| Quality     | Low confidence output, hallucination detection |
| Business    | SLA breach, workflow failure                   |

---

### **4. Alerting Implementation Pattern**

### **A. Define Alert Conditions**

```python
def cost_alert(state):
    if state["token_usage"] > 50_000:
        return True
    return False
```

### **B. Attach Alert Node**

```python
def alert_node(state):
    send_slack(f"High cost detected: {state['token_usage']}")
    return {}
```

### **C. Integrate into Graph**

```python
builder.add_node("alert", alert_node)

builder.add_conditional_edges(
    "monitor",
    lambda s: "alert" if cost_alert(s) else "next_step",
    {"alert": "alert", "next_step": "next_step"}
)
```

---

### **5. Centralized Alert Manager (Production)**

In production, alerting is implemented as a **cross-cutting service**:

```
Graph Runtime â†’ Event Bus â†’ Alert Manager â†’ Notification Systems
```

#### Responsibilities

* Event aggregation
* Rule evaluation
* De-duplication
* Throttling
* Escalation
* Audit logging

---

### **6. Alert Severity Levels**

| Level    | Meaning       |
| -------- | ------------- |
| INFO     | Observational |
| WARNING  | Degradation   |
| ERROR    | Failure       |
| CRITICAL | System outage |
| SECURITY | Policy breach |

---

### **7. Automated Responses**

Alerts can trigger **control actions**:

| Action                 | Effect             |
| ---------------------- | ------------------ |
| Pause graph            | Prevent damage     |
| Rollback state         | Recover            |
| Switch model           | Degrade gracefully |
| Request human approval | Safety             |
| Kill workflow          | Emergency stop     |

```python
if alert.severity == "CRITICAL":
    graph.interrupt(thread_id)
```

---

### **8. Observability Integration**

| Tool       | Purpose           |
| ---------- | ----------------- |
| LangSmith  | Execution tracing |
| Prometheus | Metrics           |
| Grafana    | Dashboards        |
| PagerDuty  | On-call           |
| Slack      | Team alerts       |
| ELK Stack  | Log analysis      |

---

### **9. Example: Production Alert Scenario**

**Case: Cost Explosion**

1. LLM token usage exceeds threshold
2. Alert engine triggers
3. Slack + PagerDuty notified
4. Graph paused
5. Model downgraded
6. Human review required

---

### **10. Alerting vs Exception Handling**

| Feature     | Exception Handling | Alerting         |
| ----------- | ------------------ | ---------------- |
| Scope       | Code-level         | System-level     |
| Purpose     | Stop execution     | Notify & respond |
| Audience    | Developer          | Operator / SRE   |
| Persistence | Local              | Logged & audited |

---

### **11. Design Principles**

* Alerts must be **actionable**
* Minimize noise
* Tie alerts to **business impact**
* Enforce **automatic safety responses**
* Maintain **auditability**

---

### **12. Mental Model**

> **Alerting = Nervous System of a LangGraph Application**

It senses pain, triggers reflexes, and calls for help.


### Demonstration

In [1]:
# --- One-Cell LangGraph Alerting Demo ---

from langgraph.graph import StateGraph, END
from typing import TypedDict
import random

# ---------- 1. Define State ----------

class State(TypedDict):
    token_usage: int
    status: str

# ---------- 2. Simulated Nodes ----------

def llm_node(state):
    # simulate token usage
    new_tokens = random.randint(5_000, 20_000)
    total = state["token_usage"] + new_tokens
    print(f"LLM used {new_tokens} tokens | total = {total}")
    return {"token_usage": total, "status": "running"}

def monitor_node(state):
    return {}

# ---------- 3. Alert Logic ----------

MAX_TOKENS = 50_000

def alert_condition(state):
    return "alert" if state["token_usage"] > MAX_TOKENS else "continue"

def alert_node(state):
    print("ðŸš¨ ALERT: Token budget exceeded!")
    print("Pausing workflow and requesting human review.")
    return {"status": "paused"}

# ---------- 4. Build Graph ----------

builder = StateGraph(State)

builder.add_node("llm", llm_node)
builder.add_node("monitor", monitor_node)
builder.add_node("alert", alert_node)

builder.set_entry_point("llm")
builder.add_edge("llm", "monitor")

builder.add_conditional_edges(
    "monitor",
    alert_condition,
    {"alert": "alert", "continue": "llm"}
)

builder.add_edge("alert", END)

graph = builder.compile()

# ---------- 5. Run ----------

result = graph.invoke({"token_usage": 0, "status": "start"}, 
                      config={"recursion_limit": 10})

print("\nFinal State:", result)


LLM used 10769 tokens | total = 10769
LLM used 19857 tokens | total = 30626
LLM used 17246 tokens | total = 47872
LLM used 18840 tokens | total = 66712
ðŸš¨ ALERT: Token budget exceeded!
Pausing workflow and requesting human review.

Final State: {'token_usage': 66712, 'status': 'paused'}
