```{contents}
```
## Quality Evaluation

### 1. Motivation

Generative AI systems produce **open-ended outputs** (text, images, code, audio).
Unlike classification, there is **no single correct answer**, making evaluation a central scientific challenge.

Quality evaluation answers:

* *Is the output correct?*
* *Is it useful and aligned with the task?*
* *Is it safe and reliable?*
* *Is it better than previous models?*

---

### 2. Core Dimensions of Quality

| Dimension                      | Meaning                                  | Typical Failures            |
| ------------------------------ | ---------------------------------------- | --------------------------- |
| **Correctness / Faithfulness** | Matches facts, input, or ground truth    | Hallucinations              |
| **Relevance**                  | Addresses the prompt intent              | Off-topic generation        |
| **Coherence**                  | Logically consistent and well-structured | Contradictions              |
| **Fluency**                    | Grammatically and stylistically natural  | Awkward language            |
| **Completeness**               | Covers required information              | Missing key content         |
| **Safety & Alignment**         | Avoids harmful or biased outputs         | Toxicity, policy violations |
| **Usefulness**                 | Helps user accomplish task               | Vague, low utility          |

---

### 3. Taxonomy of Evaluation Methods

### 3.1 Automatic Metrics (Reference-Based)

Used when **ground truth references** exist.

| Metric           | Domain           | Measures                |
| ---------------- | ---------------- | ----------------------- |
| BLEU             | MT, text         | n-gram precision        |
| ROUGE            | Summarization    | n-gram recall           |
| METEOR           | MT               | Alignment with synonyms |
| CIDEr            | Image captioning | Consensus similarity    |
| Exact Match / F1 | QA               | String overlap          |

**Limitation:**
Surface-level matching ≠ semantic correctness.

---

### 3.2 Automatic Metrics (Reference-Free)

Used for **open-ended** generation.

| Metric           | Idea                                     |
| ---------------- | ---------------------------------------- |
| BERTScore        | Semantic similarity via embeddings       |
| Perplexity       | Language model confidence                |
| MAUVE            | Distribution similarity (human vs model) |
| Self-Consistency | Stability across multiple generations    |

---

### 3.3 Model-Based Evaluation (LLM-as-a-Judge)

A strong model scores outputs along dimensions.

**Typical criteria:**

* Helpfulness
* Faithfulness
* Coherence
* Safety

**Advantages**

* Scalable
* Captures semantics
* Matches human judgments surprisingly well

---

### 3.4 Human Evaluation

Gold standard for:

* Helpfulness
* Safety
* Preference
* Alignment

**Protocols**

| Method              | Description          |
| ------------------- | -------------------- |
| Likert scoring      | Rate from 1–5        |
| Pairwise preference | Choose better output |
| Rubric-based review | Structured checklist |

---

### 4. Evaluation Workflow

```text
1. Define task & quality criteria
2. Collect representative prompts
3. Generate model outputs
4. Apply automatic metrics
5. Run LLM-judge or human review
6. Aggregate scores
7. Analyze failure modes
8. Iterate model & prompts
```

---

### 5. Practical Example: Text Generation Evaluation

```python
from bert_score import score

preds = ["The capital of France is Paris."]
refs  = ["Paris is the capital of France."]

P, R, F1 = score(preds, refs, lang="en")
print(F1.mean().item())
```

---

### 6. LLM-as-a-Judge Example

```python
judge_prompt = f"""
Evaluate the answer for correctness, coherence, and usefulness (1-5 each).

Question: {question}
Answer: {model_output}
"""

judge_score = llm(judge_prompt)
```

---

### 7. Composite Scoring

Modern systems combine multiple signals:

[
Q = \alpha C + \beta R + \gamma U + \delta S
]

Where:

* (C): Correctness
* (R): Relevance
* (U): Usefulness
* (S): Safety

---

### 8. Typical Failure Modes Discovered by Evaluation

| Failure         | Detection                        |
| --------------- | -------------------------------- |
| Hallucination   | Faithfulness checks              |
| Prompt drift    | Relevance metrics                |
| Mode collapse   | Diversity metrics                |
| Overconfidence  | Calibration tests                |
| Bias / toxicity | Safety classifiers + human audit |

---

### 9. Summary

Quality evaluation in Generative AI is:

* **Multi-dimensional**
* **Hybrid (automatic + human + LLM-based)**
* **Task-specific**
* **Central to safe deployment and model improvement**

Reliable evaluation is what converts **impressive generation** into **trustworthy AI systems**.
