```{contents}
```
## Faithfulness


**Faithfulness** measures whether an LLM’s answer is **strictly grounded in the provided context** and **does not introduce unsupported facts** (hallucinations).

In RAG systems, faithfulness answers one critical question:

> **Is every claim in the answer supported by the retrieved documents?**

Faithfulness is a **core evaluation dimension** in LangChain and RAG systems.

```
Context (Sources)
   ↓
LLM Answer
   ↓
Faithfulness Check
   → Supported ✅
   → Unsupported ❌ (Hallucination)
```

---

### Why Faithfulness Is Critical

An answer can be:

* Fluent ❌
* Confident ❌
* Wrong ❌

Faithfulness ensures:

* No hallucinated facts
* Trustworthy answers
* Explainability via sources
* Production safety

This is **more important than fluency** in enterprise RAG.

---

### Faithfulness vs Correctness

| Dimension              | Faithfulness | Correctness |
| ---------------------- | ------------ | ----------- |
| Depends on context     | ✅            | ❌           |
| Detects hallucinations | ✅            | ❌           |
| Checks factual truth   | ❌            | ✅           |
| RAG-specific           | ✅            | ❌           |

An answer can be **correct but unfaithful** if it uses outside knowledge.

---

### Architecture View

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/0%2A0VW2uHaAq2dB_AuH.png)

![Image](https://substackcdn.com/image/fetch/%24s_%21mGi3%21%2Cf_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1c4ac0-aa62-40a6-8c6d-446756abee60_1400x1018.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1208/0%2AgJeMETTHMmgG2siR.png)

---

### Example Context and Question

#### Retrieved Context

```text
RAG (Retrieval-Augmented Generation) combines document retrieval
with language model generation to answer user questions.
```

#### Question

```text
What is RAG?
```

---

### Faithful vs Unfaithful Answers

#### Faithful Answer ✅

```text
RAG combines document retrieval with language model generation.
```

✔ Every statement appears in the context.

---

#### Unfaithful Answer ❌

```text
RAG retrieves documents and fine-tunes the model in real time.
```

❌ “fine-tunes the model” is **not present** in the context → hallucination.

---

### Faithfulness Evaluation Using LangChain

#### Load the Faithfulness Evaluator

```python
from langchain.evaluation import load_evaluator

faithfulness_eval = load_evaluator("faithfulness")
```

---

#### Evaluate a Faithful Answer

```python
result = faithfulness_eval.evaluate_strings(
    input="What is RAG?",
    prediction="RAG combines document retrieval with language model generation.",
    reference="RAG combines document retrieval with language model generation."
)

print(result)
```

**Output (example)**

```python
{"score": 1.0, "reasoning": "All claims are supported by the context."}
```

---

#### Evaluate an Unfaithful Answer

```python
result = faithfulness_eval.evaluate_strings(
    input="What is RAG?",
    prediction="RAG retrieves documents and fine-tunes models in real time.",
    reference="RAG combines document retrieval with language model generation."
)

print(result)
```

**Output (example)**

```python
{"score": 0.0, "reasoning": "The answer includes unsupported claims."}
```

This flags **hallucination**.

---

### Faithfulness in a Real RAG Pipeline

#### RAG Output + Retrieved Docs

```python
answer = response.content
context = "\n".join(doc.page_content for doc in retrieved_docs)
```

---

#### Faithfulness Check on RAG Output

```python
faithfulness_eval.evaluate_strings(
    input=question,
    prediction=answer,
    reference=context
)
```

This is how **production RAG systems validate grounding**.

---

### Interpreting Faithfulness Scores

| Score | Meaning            |
| ----- | ------------------ |
| 1.0   | Fully grounded     |
| 0.5   | Partially grounded |
| 0.0   | Hallucinated       |

Thresholds are usually:

* **≥ 0.8 → Accept**
* **< 0.8 → Reject / regenerate**

---

### Common Faithfulness Failure Patterns

| Pattern             | Example                   |
| ------------------- | ------------------------- |
| Added facts         | “fine-tuned in real time” |
| External knowledge  | Info not in docs          |
| Over-generalization | Claims beyond context     |
| Assumptions         | “typically”, “usually”    |

---

### Faithfulness vs Safety

Faithfulness protects against:

* Misinformation
* Legal risk
* Medical / financial hallucinations

It is a **precondition for safety**, not a replacement.

---

### Best Practices to Improve Faithfulness

* Use strict prompts:

  > “Answer using **only** the provided context”
* Reduce context size
* Improve retrieval precision
* Use citations
* Reject low-faithfulness outputs

---

### Mental Model

Faithfulness is a **contract**:

> *If it’s not in the context, it must not be in the answer.*

---

### Key Takeaways

* Faithfulness checks **groundedness**, not truth
* Core metric for RAG evaluation
* Detects hallucinations reliably
* More important than fluency
* Mandatory for production RAG systems