```{contents}
```
## Relevance

**Relevance** measures **how well an LLM’s answer addresses the user’s question**.
It checks **alignment**, not truth or grounding.

In simple terms:

> **Does the answer actually answer what was asked?**

Relevance is a core evaluation dimension in LangChain–based evaluation pipelines and applies to **LLMs, RAG systems, and agents**.

---

### Why Relevance Matters

An answer can be:

* Factually correct ❌
* Well grounded ❌
* Completely off-topic ❌

Relevance ensures:

* User intent is satisfied
* No unnecessary or unrelated content
* High UX quality

In production, **low relevance feels like failure**, even if the answer is correct.

---

### Relevance vs Faithfulness vs Correctness

| Metric       | Checks                       |
| ------------ | ---------------------------- |
| Relevance    | Does it answer the question? |
| Faithfulness | Is it supported by context?  |
| Correctness  | Is it factually right?       |

An answer can be:

* Relevant but unfaithful
* Faithful but irrelevant
* Correct but irrelevant

---

### Conceptual Flow

![Image](https://images.ctfassets.net/otwaplf7zuwf/2aXqN1u0QPT1ST23Na7r8Y/03a9cb3f9206ed66362adef4ebcd3631/image.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/1%2AYXv5YPOI0U9tWP7AG-gNJw.png)

![Image](https://www.deepchecks.com/wp-content/uploads/2025/04/img-rag-evaluation.jpg)

```
User Question
   ↓
LLM Answer
   ↓
Relevance Check
   → On-topic ✅
   → Off-topic ❌
```


---

### Example Question

#### Question

```text
What is Retrieval Augmented Generation (RAG)?
```

---

### Relevant vs Irrelevant Answers

#### Relevant Answer ✅

```text
RAG combines document retrieval with language model generation to answer questions.
```

✔ Directly answers *what RAG is*.

---

### Irrelevant Answer ❌

```text
Large language models are trained on massive datasets using transformers.
```

❌ Factually correct
❌ Completely unrelated to the question

---

### Relevance Evaluation Using LangChain

#### Load a Relevance Evaluator

```python
from langchain.evaluation import load_evaluator

relevance_eval = load_evaluator(
    "criteria",
    criteria="relevance"
)
```

---

### Evaluate a Relevant Answer

```python
result = relevance_eval.evaluate_strings(
    input="What is RAG?",
    prediction="RAG combines document retrieval with language model generation."
)

print(result)
```

**Example Output**

```python
{"score": 1.0, "reasoning": "The answer directly addresses the question."}
```

---

### Evaluate an Irrelevant Answer

```python
result = relevance_eval.evaluate_strings(
    input="What is RAG?",
    prediction="Transformers are neural network architectures for NLP."
)

print(result)
```

**Example Output**

```python
{"score": 0.0, "reasoning": "The answer does not address the question."}
```

---

### Partial Relevance (Common Case)

#### Partially Relevant Answer ⚠️

```text
RAG uses retrieval, and transformers are commonly used in NLP models.
```

* First clause: relevant
* Second clause: unnecessary

Typical score:

```
0.4 – 0.6
```

---

### Relevance in RAG Pipelines

#### Where Relevance Is Applied

In RAG, relevance is checked at **two levels**:

| Level               | What Is Checked                          |
| ------------------- | ---------------------------------------- |
| Answer relevance    | Does the answer match the question?      |
| Retrieval relevance | Are retrieved docs related to the query? |

Answer relevance ≠ Retrieval relevance.

---

### Relevance Check on RAG Output

```python
relevance_eval.evaluate_strings(
    input=question,
    prediction=rag_answer
)
```

This ensures:

* The model didn’t drift
* The answer stayed focused

---

### Common Relevance Failure Patterns

| Pattern               | Example                           |
| --------------------- | --------------------------------- |
| Topic drift           | Answer shifts mid-way             |
| Over-generalization   | Talks broadly, not specifically   |
| Template leakage      | Generic explanation               |
| Keyword matching only | Mentions terms but doesn’t answer |

---

### Improving Relevance

* Use intent-focused prompts
  *“Answer **only** what is asked.”*
* Penalize verbosity
* Use relevance thresholds
* Combine with faithfulness
* Reject low-relevance outputs

---

### Relevance Thresholds (Typical)

| Score     | Action              |
| --------- | ------------------- |
| ≥ 0.8     | Accept              |
| 0.5 – 0.8 | Review / regenerate |
| < 0.5     | Reject              |

---

### Mental Model

Relevance answers:

> **“If I read only the question and the answer, do they match?”**

---

### Key Takeaways

* Relevance measures **question–answer alignment**
* Independent of correctness and faithfulness
* Essential for good UX
* Evaluated using LLM-as-a-judge
* Mandatory metric for RAG and agents