```{contents}
```
## Monitoring

### 1. Definition & Motivation

**Monitoring** in Generative AI is the continuous measurement, analysis, and control of a deployed model’s **behavior, performance, safety, cost, and reliability** in real-world usage.

Unlike classical ML, Generative AI systems:

* interact directly with users,
* produce open-ended outputs,
* evolve with data, prompts, and tool integrations,

therefore require **real-time, multi-dimensional monitoring**.

---

### 2. What Must Be Monitored

| Category                  | What is Measured                            | Why It Matters             |
| ------------------------- | ------------------------------------------- | -------------------------- |
| **Model Quality**         | Accuracy, relevance, hallucination rate     | Prevents degradation       |
| **Safety & Alignment**    | Toxicity, bias, harmful content, jailbreaks | Regulatory & ethical risk  |
| **Drift**                 | Data drift, concept drift, prompt drift     | Detects model misalignment |
| **Latency & Reliability** | Response time, uptime, error rate           | UX and system stability    |
| **Cost & Usage**          | Tokens, requests, tool calls                | Budget control             |
| **Security**              | Prompt injection, leakage, abuse            | Prevents exploitation      |

---

### 3. Monitoring Architecture

```
User → Prompt → LLM → Output
         ↓         ↓
    Prompt Logs   Output Logs
         ↓         ↓
   Metrics Engine → Alerts → Dashboards
         ↓
   Drift / Safety / Quality Evaluators
```

---

### 4. Core Monitoring Dimensions

### 4.1 Quality Monitoring

Key metrics:

* **Relevance Score**
* **Factuality / Hallucination Rate**
* **Helpfulness**
* **Coherence**

**Automated evaluation example**

```python
from openai import OpenAI
client = OpenAI()

def evaluate_response(prompt, response):
    judge = client.responses.create(
        model="gpt-4.1-mini",
        input=f"Score factuality and relevance (0-1).\nPrompt: {prompt}\nResponse: {response}"
    )
    return judge.output_text
```

---

### 4.2 Safety & Alignment Monitoring

Detect:

* toxicity
* hate / violence
* self-harm
* policy violations
* jailbreak attempts

```python
safety = client.moderations.create(
    model="omni-moderation-latest",
    input=response_text
)
```

---

### 4.3 Drift Monitoring

| Drift Type        | What Changes                        |
| ----------------- | ----------------------------------- |
| **Data Drift**    | User input distribution             |
| **Prompt Drift**  | System + user instruction evolution |
| **Concept Drift** | Meaning of outputs over time        |

```python
import numpy as np
from scipy.stats import wasserstein_distance

drift = wasserstein_distance(old_embeddings, new_embeddings)
```

---

### 4.4 Performance & Reliability

Monitor:

* p50 / p95 latency
* token throughput
* API error rate
* tool-call success rate

```python
latency = end_time - start_time
error_rate = failed_requests / total_requests
```

---

### 4.5 Cost & Usage

```python
cost = tokens_used * cost_per_token
```

Track:

* tokens per user
* tokens per feature
* cost per workflow

---

### 5. Alerting & Governance

| Condition       | Action                    |
| --------------- | ------------------------- |
| Hallucination ↑ | Retrain / tighten prompts |
| Toxicity spike  | Block + investigate       |
| Drift detected  | Refresh data / prompts    |
| Latency ↑       | Scale infra               |
| Cost ↑          | Enforce quotas            |

---

### 6. Human-in-the-Loop Monitoring

Certain failures require **human validation**:

* subjective quality
* edge-case safety
* regulatory compliance
* domain-specific correctness

```text
Auto metrics → Uncertain cases → Human review → Feedback → Model updates
```

---

### 7. Production Workflow Summary

1. **Log everything** (prompt, output, metadata)
2. **Compute metrics continuously**
3. **Detect drift and anomalies**
4. **Trigger alerts**
5. **Route critical cases to humans**
6. **Retrain / reconfigure**
7. **Repeat**

---

### 8. Why Monitoring is Mandatory for GenAI

| Without Monitoring  | With Monitoring      |
| ------------------- | -------------------- |
| Silent failures     | Early detection      |
| Safety risks        | Controlled behavior  |
| Cost overruns       | Budget governance    |
| Trust erosion       | Stable deployment    |
| Regulatory exposure | Compliance readiness |

---

### 9. Conceptual Comparison

| Traditional ML       | Generative AI             |
| -------------------- | ------------------------- |
| Static outputs       | Open-ended outputs        |
| Periodic evaluation  | Continuous evaluation     |
| Few metrics          | Multi-dimensional metrics |
| Offline drift checks | Real-time drift detection |

---

### Final Insight

**Monitoring is not an accessory in Generative AI — it is the control system of intelligence in production.**

Without monitoring, Generative AI systems become:

* untrustworthy,
* unsafe,
* expensive,
* and uncontrollable.

