```{contents}
```
## Latency Optimization 

**Latency optimization** in LangGraph is the systematic process of **minimizing end-to-end response time** of an LLM workflow by optimizing graph structure, execution flow, model usage, tool invocation, memory access, and infrastructure behavior — while preserving correctness and reliability.

---

### **1. Where Latency Comes From in LangGraph**

| Source           | Description                 |
| ---------------- | --------------------------- |
| LLM Inference    | Model compute + network     |
| Tool Calls       | External APIs, databases    |
| Graph Overhead   | Scheduling & routing        |
| State Operations | Serialization, persistence  |
| Memory Access    | Vector search, DB IO        |
| Network          | Cross-service communication |
| Retries          | Failure recovery delays     |

Total latency is the **sum of all node execution times + orchestration overhead**.

---

### **2. Latency Optimization Strategy Map**

```
Graph Design
   ↓
Execution Model
   ↓
Model & Prompt Design
   ↓
Tool & Memory Optimization
   ↓
Infrastructure Optimization
```

Each layer must be optimized.

---

### **3. Graph-Level Optimizations**

### **3.1 Reduce Critical Path Length**

Minimize sequential nodes on the main execution path.

❌ Slow:

```
A → B → C → D → E
```

✅ Faster:

```
A → (B || C) → D → E
```

```python
builder.add_edge("A", ["B", "C"])   # fan-out
builder.add_edge(["B", "C"], "D")   # join
```

---

### **3.2 Parallel Execution**

Execute independent tasks concurrently.

Use **async nodes** for non-dependent operations.

```python
async def tool_a(state): ...
async def tool_b(state): ...
```

---

### **3.3 Early Exit Routing**

Terminate execution as soon as result is available.

```python
def router(state):
    if state["confidence"] > 0.95:
        return END
    return "verify"
```

---

### **3.4 Limit Cycles**

Every loop multiplies latency.

```python
graph.invoke(data, config={"recursion_limit": 5})
```

---

### **4. Model-Level Optimizations**

### **4.1 Right-Size the Model**

| Task       | Model        |
| ---------- | ------------ |
| Routing    | Small / fast |
| Extraction | Medium       |
| Reasoning  | Large        |

```python
router_llm = ChatOpenAI(model="gpt-4o-mini")
reason_llm = ChatOpenAI(model="gpt-4o")
```

---

### **4.2 Prompt Compression**

* Remove redundancy
* Use structured prompts
* Limit context window

---

### **4.3 Streaming**

Reduce **perceived latency** by returning tokens immediately.

```python
graph.stream(input)
```

---

### **5. Tool & Memory Optimization**

### **5.1 Cache Everything**

| Layer         | Cache           |
| ------------- | --------------- |
| LLM responses | Redis           |
| Tool calls    | HTTP cache      |
| Vector search | Embedding cache |
| State         | In-memory store |

---

### **5.2 Batch Operations**

Batch tool calls and vector queries.

---

### **5.3 Lazy Memory Loading**

Load memory only when required.

---

### **6. State & Persistence Optimization**

* Avoid large state objects
* Store references, not raw blobs
* Reduce checkpoint frequency
* Use in-memory store for hot paths

---

### **7. Infrastructure Optimizations**

| Component   | Optimization       |
| ----------- | ------------------ |
| LLM API     | Regional routing   |
| Network     | Co-locate services |
| Compute     | Warm containers    |
| Storage     | SSD-backed stores  |
| Scaling     | Auto-scale workers |
| Concurrency | Async runtime      |

---

### **8. Measuring & Enforcing Latency**

```python
graph.invoke(data, config={
    "timeout": 8,
    "recursion_limit": 4
})
```

Monitor:

* P50 / P90 / P99 latency
* Node execution times
* Queue depth

---

### **9. Production Latency Playbook**

| Symptom           | Fix                   |
| ----------------- | --------------------- |
| Slow responses    | Reduce model size     |
| Long tail latency | Parallelize           |
| Spikes            | Add caching           |
| Timeouts          | Early exit            |
| High cost         | Prompt & model tuning |

---

### **10. Final Principle**

> **Optimize the graph before optimizing the model.
> Most latency problems are orchestration problems.**


### Demonstration

In [3]:
# ======== Latency-Optimized LangGraph Demo (Correct) ========

import asyncio
import time
from typing import TypedDict
from langgraph.graph import StateGraph, END

# ---- State ----
class State(TypedDict):
    query: str
    confidence: float
    answer: str

# ---- Fast & Slow models (simulated) ----
async def fast_router(state: State):
    await asyncio.sleep(0.1)
    if "simple" in state["query"]:
        return {"confidence": 0.99, "answer": "Quick answer."}
    return {"confidence": 0.4}

async def slow_reasoner(state: State):
    await asyncio.sleep(1.5)
    return {"answer": f"Deep reasoning for: {state['query']}", "confidence": 0.96}

# ---- Router logic (sync) ----
def route(state: State):
    if state["confidence"] > 0.95:
        return END
    return "reason"

# ---- Graph ----
builder = StateGraph(State)
builder.add_node("router", fast_router)
builder.add_node("reason", slow_reasoner)

builder.set_entry_point("router")
builder.add_conditional_edges("router", route, {
    "reason": "reason",
    END: END
})
builder.add_edge("reason", END)

graph = builder.compile()

# ---- Run correctly with async runtime ----
async def run():
    start = time.time()
    result = await graph.ainvoke({
        "query": "simple question",
        "confidence": 0.0,
        "answer": ""
    }, config={"timeout": 5, "recursion_limit": 3})
    print("Result:", result)
    print("Latency:", round(time.time() - start, 2), "seconds")

await run()

Result: {'query': 'simple question', 'confidence': 0.99, 'answer': 'Quick answer.'}
Latency: 0.15 seconds
