```{contents}
```
## RAG and Knowledge Grounding Metrics

**RAG (Retrieval-Augmented Generation) and knowledge grounding metrics** evaluate whether:

1. The **right documents were retrieved**
2. The **retrieved knowledge was relevant**
3. The **LLM used only that knowledge**
4. The answer **did not hallucinate**

They answer the most important RAG question:

> **Did the model say only what it could justify from retrieved knowledge?**

These metrics are **mandatory** for enterprise, regulated, and high-trust systems.

---

### Why RAG Metrics Are Separate from Quality Metrics

A response can be:

* Relevant ❌
* Fluent ❌
* Confident ❌
  and still be **hallucinated**.

RAG metrics focus on **evidence alignment**, not language quality.

---

### Categories of RAG & Grounding Metrics

```
RAG Metrics
│
├── Retrieval Quality Metrics
├── Context Quality Metrics
├── Grounding / Faithfulness Metrics
└── Hallucination Metrics
```

---

### Retrieval Quality Metrics

#### Retrieval Precision@K

**What it measures**
How many retrieved documents are actually relevant.

$$
\text{Precision@K} = \frac{\text{Relevant Docs Retrieved}}{K}
$$

**Demonstration**



In [1]:
def precision_at_k(retrieved_docs, relevant_docs, k):
    retrieved = retrieved_docs[:k]
    relevant = set(relevant_docs)
    hits = sum(1 for d in retrieved if d in relevant)
    return hits / k




---

#### Retrieval Recall@K

**What it measures**
Whether the system retrieved at least one required document.

$$
\text{Recall@K} = \frac{\text{Relevant Docs Retrieved}}{\text{Total Relevant Docs}}
$$



In [2]:
def recall_at_k(retrieved_docs, relevant_docs, k):
    retrieved = set(retrieved_docs[:k])
    relevant = set(relevant_docs)
    return len(retrieved & relevant) / len(relevant)




**Key rule**
If recall = 0 → generation is guaranteed to fail.

---

#### Mean Reciprocal Rank (MRR)

**What it measures**
How early the first correct document appears.

$$
\text{MRR} = \frac{1}{\text{rank of first relevant doc}}
$$



In [3]:
def mrr(retrieved_docs, relevant_docs):
    for i, doc in enumerate(retrieved_docs, start=1):
        if doc in relevant_docs:
            return 1 / i
    return 0




---

### Context Quality Metrics

#### Context Relevance

**What it measures**
Are the retrieved documents relevant to the question?



In [1]:
from langchain_classic.evaluation import load_evaluator

# Example question and context for demonstration
question = "What is RAG in machine learning?"
context_text = """
RAG stands for Retrieval-Augmented Generation. It is a technique that combines 
document retrieval with language model generation. The system first retrieves 
relevant documents from a knowledge base, then uses those documents as context 
to generate accurate, grounded responses.
"""

context_relevance = load_evaluator("criteria", criteria="relevance")

context_relevance.evaluate_strings(
    input=question,
    prediction=context_text
)

{'reasoning': 'The criterion asks if the submission is referring to a real quote from the text. However, there is no text provided for the submission to quote from. The submission is answering the question about what RAG in machine learning is, but it is not quoting from any text. Therefore, the submission does not meet the criterion.\n\nN',
 'value': 'N',
 'score': 0}



Prevents **prompt pollution**.

---

#### Context Coverage

**What it measures**
Does the context contain all necessary information?

Often measured by:

* Keyword coverage
* LLM-as-judge completeness



In [3]:
# Define custom completeness criteria
custom_criteria = {
    "completeness": "Does the context contain all necessary information to answer the question comprehensively?"
}

coverage_eval = load_evaluator("criteria", criteria=custom_criteria)

# Example usage
coverage_result = coverage_eval.evaluate_strings(
    input=question,
    prediction=context_text
)
print(f"Coverage score: {coverage_result['score']}")
print(f"Reasoning: {coverage_result['reasoning']}")


Coverage score: 1
Reasoning: The criterion is completeness. The question asks for the meaning of RAG in machine learning. 

The submitted answer provides a definition of RAG, explaining that it stands for Retrieval-Augmented Generation. It also explains what this technique does, combining document retrieval with language model generation. The answer further explains how the system works, retrieving relevant documents from a knowledge base and using those documents as context to generate responses. 

The answer seems to cover all necessary information to answer the question comprehensively. It provides a definition, explains the technique, and describes how it works. 

Therefore, the submission meets the criterion of completeness. 

Y




---

### Knowledge Grounding / Faithfulness Metrics

#### Faithfulness (Groundedness)

**What it measures**
Is every claim in the answer supported by retrieved context?



In [5]:
# Define answer for demonstration
answer = """
RAG (Retrieval-Augmented Generation) is a machine learning technique that combines 
information retrieval with text generation. It retrieves relevant documents and uses 
them as context for generating responses.
"""

# Option 1: Use context_qa evaluator (checks if answer is supported by context)
context_qa_eval = load_evaluator("context_qa")

faithfulness_result = context_qa_eval.evaluate_strings(
    input=question,
    prediction=answer,
    reference=context_text
)

print(f"Faithfulness evaluation:")
print(f"Score: {faithfulness_result.get('score', 'N/A')}")
print(f"Reasoning: {faithfulness_result.get('reasoning', 'N/A')}")

# Option 2: Use custom criteria for faithfulness
faithfulness_criteria = {
    "faithfulness": "Is every claim in the answer supported by the provided context? The answer should not include information that cannot be verified from the context."
}

faithfulness_eval = load_evaluator("criteria", criteria=faithfulness_criteria)

result = faithfulness_eval.evaluate_strings(
    input=question,
    prediction=answer,
    reference=context_text
)

print(f"\nCustom Faithfulness score: {result['score']}")
print(f"Reasoning: {result['reasoning']}")


Faithfulness evaluation:
Score: 1
Reasoning: CORRECT


To use references, use the labeled_criteria instead.
  result = faithfulness_eval.evaluate_strings(



Custom Faithfulness score: 0
Reasoning: The criterion is faithfulness, which requires that every claim in the answer is supported by the provided context. The context in this case is the question "What is RAG in machine learning?"

The submission claims that RAG (Retrieval-Augmented Generation) is a machine learning technique that combines information retrieval with text generation. It also claims that RAG retrieves relevant documents and uses them as context for generating responses.

The context, which is the question, does not provide any information to verify these claims. Therefore, the submission does not meet the criterion of faithfulness.

N




This is the **single most important RAG metric**.

---

#### Attribution Accuracy

**What it measures**
Can the answer be traced to specific sources?

```text
Answer: RAG combines retrieval with generation. [Doc1]
```

Used in:

* Legal
* Healthcare
* Enterprise search

---

### Hallucination Metrics

#### Hallucination Rate

**What it measures**
Percentage of answers that include unsupported claims.

$$
\text{Hallucination Rate} = \frac{\text{Unfaithful Answers}}{\text{Total Answers}}
$$

```python
hallucination_rate = unfaithful_answers / total_answers
```

---

#### Unsupported Claim Ratio

Measures:

* How much of the answer is unsupported
* Partial hallucinations

---

### End-to-End RAG Evaluation Flow (Demo)

```python
docs = retriever.invoke(question)
context = "\n".join(d.page_content for d in docs)

answer = rag_chain.invoke(question).content

faithfulness_score = faithfulness.evaluate_strings(
    input=question,
    prediction=answer,
    reference=context
)["score"]
```

---

### Acceptance Thresholds (Industry Practice)

| Metric             | Typical Threshold |
| ------------------ | ----------------- |
| Faithfulness       | ≥ 0.8             |
| Recall@K           | ≥ 0.9             |
| Precision@K        | ≥ 0.6             |
| Hallucination Rate | ≤ 5%              |

---

### Common RAG Failure Patterns

| Failure                | Metric That Detects It |
| ---------------------- | ---------------------- |
| Wrong documents        | Recall@K               |
| Too much noise         | Precision@K            |
| Hallucinations         | Faithfulness           |
| Incomplete answers     | Context Coverage       |
| Over-confident answers | Attribution accuracy   |

---

### Mental Model

```
RAG Quality =
Did we fetch the right knowledge?
Did we give enough knowledge?
Did the model stick to that knowledge?
```

If any answer is **no**, the RAG system failed.

---

### Key Takeaways

* RAG metrics evaluate **evidence usage**, not language quality
* Faithfulness is the most critical metric
* Retrieval failure guarantees generation failure
* Hallucination detection is mandatory in production
* These metrics protect trust, safety, and compliance

---

If you want next:

* **Safety & alignment metrics**
* **RAG regression testing**
* **Production RAG metric dashboards**
* **How to auto-regenerate answers on low faithfulness**
