```{contents}
```
## Latency Tracking

**Latency tracking** measures **how long each part of an LLM workflow takes to execute**—from request start to final response.
It helps identify **slow components** such as retrieval, prompt building, LLM calls, retries, or tools.

In LLM systems, latency answers:

* *Why is the response slow?*
* *Which step is the bottleneck?*
* *Is latency caused by retrieval, model, or retries?*

Latency tracking is commonly implemented using:

* Timers (manual)
* Callback handlers
* Tracing systems

---

### Why Latency Tracking Is Critical

Without latency tracking:

* Users complain about slowness
* You guess where the problem is
* Scaling decisions are blind

With latency tracking:

* Pinpoint slow steps
* Optimize prompts, chunking, models
* Enforce SLAs
* Compare models and configurations

---

### What Contributes to Latency

| Component       | Typical Cause                 |
| --------------- | ----------------------------- |
| Retriever       | Vector DB I/O                 |
| Prompt building | Large context                 |
| LLM call        | Model size / load             |
| Retries         | Rate limits / timeouts        |
| Streaming       | Same total latency, better UX |
| Tools           | External APIs                 |

---

### Architecture View

![Image](https://portkey.ai/blog/content/images/size/w1200/2025/11/end-to-endllm-observability.png)

![Image](https://bentoml.com/llm/assets/images/llm-inference-ttft-latency-3419154284149af2052def0403380a30.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A2000/1%2AKfvrQNLtHCnFVD0js7tBMg.png)


---

### Manual Latency Tracking

#### Simple Timing with `time`

```python
import time
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()

start = time.time()
response = llm.invoke("Explain latency tracking")
end = time.time()

print("Latency (seconds):", end - start)
```

This gives **end-to-end latency only**.

---

#### Step-Level Latency Tracking

```python
import time
from langchain_core.runnables import RunnableLambda

def timed_step(name, fn):
    def wrapper(x):
        start = time.time()
        result = fn(x)
        duration = time.time() - start
        print(f"{name} latency: {duration:.3f}s")
        return result
    return RunnableLambda(wrapper)

chain = (
    timed_step("Normalize", lambda x: x.strip())
    | timed_step("Uppercase", lambda x: x.upper())
)
```

```python
chain.invoke(" latency demo ")
```

You now see **per-step latency**.

---

### Latency Tracking with Callback Handlers (Recommended)

#### Custom Latency Callback

```python
import time
from langchain.callbacks.base import BaseCallbackHandler

class LatencyCallback(BaseCallbackHandler):
    def on_chain_start(self, serialized, inputs, **kwargs):
        self.chain_start = time.time()

    def on_llm_start(self, serialized, prompts, **kwargs):
        self.llm_start = time.time()

    def on_llm_end(self, response, **kwargs):
        llm_latency = time.time() - self.llm_start
        print(f"LLM latency: {llm_latency:.3f}s")

    def on_chain_end(self, outputs, **kwargs):
        total_latency = time.time() - self.chain_start
        print(f"Total chain latency: {total_latency:.3f}s")
```

---

#### Attach Latency Callback

```python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Explain {topic} briefly"
)

llm = ChatOpenAI(callbacks=[LatencyCallback()])

chain = prompt | llm

chain.invoke({"topic": "latency tracking"})
```

**Output (example)**

```
LLM latency: 0.612s
Total chain latency: 0.640s
```

---

### Latency Tracking in RAG Pipelines

```python
class RAGLatencyCallback(BaseCallbackHandler):
    def on_retriever_start(self, serialized, query, **kwargs):
        self.retriever_start = time.time()

    def on_retriever_end(self, documents, **kwargs):
        print("Retriever latency:", time.time() - self.retriever_start)

    def on_llm_end(self, response, **kwargs):
        print("LLM latency:", time.time() - self.llm_start)
```

This shows:

* Retrieval time
* Generation time separately

---

### Latency Tracking with Async Execution

```python
async def run():
    start = time.time()
    await llm.ainvoke("Explain async latency tracking")
    print("Async latency:", time.time() - start)
```

Async latency tracking is essential for:

* FastAPI
* High concurrency systems

---

### Latency Tracking via Tracing (Automatic)

When tracing is enabled:

```bash
export LANGCHAIN_TRACING_V2=true
```

Each trace automatically records:

* Step-level latency
* LLM call duration
* Tool and retriever timing
* Retry delays

Viewed visually in tracing UI.

---

### Latency Tracking vs Cost Tracking

| Aspect       | Latency | Cost   |
| ------------ | ------- | ------ |
| Measures     | Time    | Money  |
| Unit         | ms / s  | $      |
| User impact  | UX      | Budget |
| Optimization | Speed   | Spend  |

They should **always be tracked together**.

---

### Common Latency Problems

* Large context windows
* Too many retrieved chunks
* Slow vector DB
* Unbounded retries
* Heavy callback logic

---

### Latency Optimization Hooks

Typical actions based on latency:

* Reduce `k` in retrieval
* Switch to smaller model
* Enable streaming
* Cache results
* Parallelize independent steps

---

### Mental Model

Latency tracking is a **stopwatch attached to every step**.

```
Pipeline runs → timers record → slow step identified → optimized
```

---

### Key Takeaways

* Latency tracking measures **time per step**
* Can be manual, callback-based, or tracing-based
* Essential for performance optimization
* Required for SLAs and production readiness