
```{contents}
```
## Reliability & Stability Metrics


These metrics measure whether your LLM system is:

* **Consistently available**
* **Predictably behaving**
* **Resilient to failures**
* **Not degrading over time**

They answer:

> **Can this system be trusted in production long-term?**

---

### Core Reliability & Stability Metrics

| Metric                       | What It Measures                | Why It Matters              |
| ---------------------------- | ------------------------------- | --------------------------- |
| **Error Rate**               | % of failed requests            | Detects outages & bugs      |
| **Timeout Rate**             | % of SLA violations             | Indicates performance risk  |
| **Retry Rate**               | How often requests are retried  | Reveals instability         |
| **Fallback Rate**            | Frequency of backup model usage | Detects capacity issues     |
| **Availability (Uptime)**    | % of time system is usable      | Business continuity         |
| **Consistency Drift**        | Variation in outputs over time  | Model stability             |
| **Quality Regression Delta** | Quality change vs baseline      | Prevents silent degradation |
| **Model Degradation Rate**   | Long-term quality decay         | Model health                |
| **Incident Frequency**       | How often failures occur        | Operational health          |

---

### Practical Demonstrations

### Error Rate

```python
error_rate = failed_requests / total_requests
```

---

### Timeout Rate

```python
timeout_rate = timed_out_requests / total_requests
```

---

### Retry Rate

```python
retry_rate = retried_requests / total_requests
```

---

### Fallback Rate

```python
fallback_rate = fallback_requests / total_requests
```

---

### Availability (Uptime)

```python
availability = uptime_minutes / total_minutes
```

---

### Quality Regression Delta

```python
quality_delta = new_quality_score - baseline_quality_score
```

---

### Consistency Drift

```python
import numpy as np

drift = np.std(answer_embeddings_over_time)
```

---

### Model Degradation Rate

```python
degradation_rate = (baseline_quality - current_quality) / time_elapsed
```

---

### Production Thresholds (Typical)

| Metric        | Target  |
| ------------- | ------- |
| Availability  | ≥ 99.9% |
| Error Rate    | ≤ 1%    |
| Retry Rate    | ≤ 3%    |
| Fallback Rate | ≤ 10%   |
| Timeout Rate  | ≤ 0.5%  |
| Quality Delta | ≥ 0     |

---

### Why These Metrics Matter

| Failure Mode         | Detected By       |
| -------------------- | ----------------- |
| Hidden outages       | Error rate        |
| Performance collapse | Timeout rate      |
| Model instability    | Consistency drift |
| Silent degradation   | Quality delta     |
| Capacity problems    | Fallback rate     |

---

### Mental Model

```
Reliability & Stability =
Does the system behave predictably under stress and over time?
```

---

### Key Takeaways

* These metrics determine **production readiness**
* They detect silent failures and long-term decay
* Must be monitored continuously
* Often more important than raw quality metrics
