```{contents}
```
## Answer Correctness


**Answer correctness** measures whether an LLM’s output is **factually correct with respect to an expected ground truth**, regardless of *where the information came from*.

In simple terms:

> **Is the answer right?**

Answer correctness is a core evaluation metric in LangChain–based evaluation pipelines and applies to:

* Plain LLM Q&A
* RAG outputs (after faithfulness)
* Agent decisions
* Prompt/model regression testing

---

### Why Answer Correctness Matters

An answer can be:

* Relevant ❌
* Faithful ❌
* Fluent ❌
  and still be **wrong**.

Correctness ensures:

* Factual accuracy
* Trust in system outputs
* Safe deployment in enterprise use cases

In many systems, **correctness is the final gate** before shipping.

---

### Correctness vs Relevance vs Faithfulness

| Metric       | Checks                              |
| ------------ | ----------------------------------- |
| Correctness  | Is the answer factually right?      |
| Relevance    | Does it answer the question?        |
| Faithfulness | Is it grounded in provided context? |

Examples:

* Correct but unfaithful → Uses outside knowledge
* Faithful but incorrect → Context itself is wrong
* Relevant but incorrect → Answers the question wrongly

---

### Conceptual Flow

![Image](https://images.ctfassets.net/otwaplf7zuwf/2tNy3bcdnxBV6ced1QEjcW/149cebc79f9215159e79d1ac9836bc5f/image.png)

![Image](https://images.ctfassets.net/otwaplf7zuwf/77WwI4e8hpjjIzazrSEdTp/95391e9b0e5d9c39a1d57690ebdcdea9/image.png)

![Image](https://chatgen.ai/wp-content/uploads/2023/12/ezgif-2-dec5b52644-1024x566.jpeg)

```
Question + Expected Answer
        ↓
   LLM Answer
        ↓
 Correctness Check
   → Correct ✅
   → Incorrect ❌
```

---

### Example Question and Ground Truth

#### Question

```text
What is Retrieval Augmented Generation (RAG)?
```

#### Expected (Ground Truth) Answer

```text
RAG combines document retrieval with language model generation.
```

---

### Correct vs Incorrect Answers

#### Correct Answer ✅

```text
RAG combines document retrieval with language model generation.
```

✔ Matches the expected meaning
✔ Factually accurate

---

### Incorrect Answer ❌

```text
RAG fine-tunes large language models during inference.
```

❌ Factually wrong
❌ Contradicts known definition

---

### Answer Correctness Evaluation Using LangChain

#### Load a Correctness Evaluator

```python
from langchain.evaluation import load_evaluator

correctness_eval = load_evaluator(
    "labeled_criteria",
    criteria="correctness"
)
```

This uses an **LLM-as-a-judge** to score correctness.

---

### Evaluate a Correct Answer

```python
result = correctness_eval.evaluate_strings(
    input="What is RAG?",
    prediction="RAG combines document retrieval with language model generation.",
    reference="RAG combines document retrieval with language model generation."
)

print(result)
```

**Example Output**

```python
{"score": 1.0, "reasoning": "The answer is factually correct."}
```

---

### Evaluate an Incorrect Answer

```python
result = correctness_eval.evaluate_strings(
    input="What is RAG?",
    prediction="RAG fine-tunes models during inference.",
    reference="RAG combines document retrieval with language model generation."
)

print(result)
```

**Example Output**

```python
{"score": 0.0, "reasoning": "The answer contains factual errors."}
```

---

### Partial Correctness (Very Common)

#### Partially Correct Answer ⚠️

```text
RAG retrieves documents and uses them with an LLM.
```

✔ Core idea present
❌ Missing full definition

Typical score:

```
0.6 – 0.8
```

---

### Correctness in RAG Pipelines

In RAG systems, correctness is evaluated **after**:

1. Retrieval quality
2. Faithfulness (groundedness)

```python
correctness_eval.evaluate_strings(
    input=question,
    prediction=rag_answer,
    reference=expected_answer
)
```

Correctness ensures:

* Even grounded answers are **factually right**
* Outdated or misleading sources are detected

---

### Common Correctness Failure Patterns

| Pattern             | Example                 |
| ------------------- | ----------------------- |
| Outdated facts      | Old version numbers     |
| Confident errors    | Fluent but wrong        |
| Over-generalization | Missing key constraints |
| Misinterpretation   | Wrong definition        |

---

### Improving Answer Correctness

* Maintain high-quality ground truth datasets
* Use multiple evaluators (ensemble judging)
* Combine with relevance + faithfulness
* Reject low-score outputs
* Add human review for edge cases

---

### Typical Correctness Thresholds

| Score     | Action |
| --------- | ------ |
| ≥ 0.8     | Accept |
| 0.6 – 0.8 | Review |
| < 0.6     | Reject |

Thresholds depend on **risk level** of the application.

---

### Mental Model

Answer correctness answers one question:

> **“If a domain expert reads this answer, would they say it’s right?”**

---

### Key Takeaways

* Answer correctness measures **factual accuracy**
* Independent of relevance and faithfulness
* Evaluated using LLM-as-a-judge + ground truth
* Critical for regression testing and production readiness
* Final quality gate in most LLM systems

If you want next:

* **Combining correctness + faithfulness**
* **Human vs automated correctness evaluation**
* **CI pipelines for correctness regression testing**