```{contents}
```

## Performance Metrics 


**Performance metrics** measure **how fast, stable, and scalable** your LLM system is.

They answer:

> **Can this system handle real users reliably at scale?**

Even the best model fails if performance is poor.

---

### Core Performance Metrics

| Metric                         | What It Measures         | Why It Matters       |
| ------------------------------ | ------------------------ | -------------------- |
| **Latency**                    | Total response time      | UX & SLA             |
| **p50 / p95 / p99 Latency**    | Tail latency             | Bottleneck detection |
| **TTFT (Time To First Token)** | Streaming responsiveness | Perceived speed      |
| **Throughput**                 | Requests per second      | Scalability          |
| **Error Rate**                 | Failed responses %       | Stability            |
| **Availability**               | Uptime                   | Production readiness |

---

### Demonstration Setup

```python
import time
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()
```

---

### Latency Measurement

```python
start = time.time()
llm.invoke("Explain performance metrics")
latency = time.time() - start
print("Latency:", latency)
```

---

### p50 / p95 / p99 Latency

```python
import numpy as np

latencies = []
for _ in range(100):
    start = time.time()
    llm.invoke("Ping")
    latencies.append(time.time() - start)

print("p50:", np.percentile(latencies, 50))
print("p95:", np.percentile(latencies, 95))
print("p99:", np.percentile(latencies, 99))
```

---

### TTFT (Streaming Speed)

```python
from langchain_openai import ChatOpenAI

llm_stream = ChatOpenAI(streaming=True)

start = time.time()
for chunk in llm_stream.stream("Explain streaming"):
    ttft = time.time() - start
    print("TTFT:", ttft)
    break
```

---

### Throughput (Requests Per Second)

```python
start = time.time()
for _ in range(50):
    llm.invoke("Ping")
end = time.time()

throughput = 50 / (end - start)
print("Throughput:", throughput)
```

---

### Error Rate

```python
success = 0
fail = 0

for _ in range(50):
    try:
        llm.invoke("Ping")
        success += 1
    except:
        fail += 1

error_rate = fail / (success + fail)
print("Error rate:", error_rate)
```

---

### Availability (Uptime)

```python
uptime = successful_requests / total_requests
```

Tracked over time in production.

---

### Why These Metrics Matter

| Problem            | Detected By   |
| ------------------ | ------------- |
| Slow UI            | Latency, TTFT |
| System overload    | Throughput    |
| Hidden bottlenecks | p95/p99       |
| Crashes            | Error rate    |
| Outages            | Availability  |

---

### Production Thresholds (Typical)

| Metric       | Target    |
| ------------ | --------- |
| p95 latency  | < 1.5 sec |
| TTFT         | < 300 ms  |
| Error rate   | < 1%      |
| Availability | > 99.9%   |

---

### Mental Model

```
Performance = Speed + Scale + Stability
```

---

### Key Takeaways

* Performance metrics are mandatory for production
* Tail latency matters more than average latency
* Streaming improves perceived performance
* Must be continuously monitored
