```{contents}
```
## LLM Evaluation 


**LLM Evaluation** is the systematic process of **measuring the quality, reliability, and usefulness** of LLM outputs against defined criteria such as correctness, relevance, faithfulness, safety, latency, and cost.

Evaluation answers:

* *Is the model giving correct answers?*
* *Is RAG grounded in sources?*
* *Is one prompt/model better than another?*
* *Did performance regress after a change?*

It is natively supported in LangChain and visualized/managed in LangSmith.

---

### Why LLM Evaluation Is Mandatory

Without evaluation:

* You rely on intuition
* Regressions go unnoticed
* Hallucinations ship to prod

With evaluation:

* Objective quality metrics
* Safe prompt/model iteration
* Continuous improvement
* Production confidence

---

### What Can Be Evaluated

| Dimension    | What It Measures                 |
| ------------ | -------------------------------- |
| Correctness  | Is the answer right?             |
| Relevance    | Does it answer the question?     |
| Faithfulness | Is it grounded in context (RAG)? |
| Safety       | Is output compliant?             |
| Consistency  | Is behavior stable across runs?  |
| Latency      | How fast is it?                  |
| Cost         | How expensive is it?             |

---

### Evaluation Architecture

![Image](https://www.researchgate.net/publication/383918543/figure/fig3/AS%3A11431281277260711%401726025711738/Evaluation-pipeline-Each-green-square-represents-a-call-to-an-LLM-while-the-blue-dotted.png)

![Image](https://images.ctfassets.net/otwaplf7zuwf/77WwI4e8hpjjIzazrSEdTp/95391e9b0e5d9c39a1d57690ebdcdea9/image.png)

![Image](https://i0.wp.com/gradientflow.com/wp-content/uploads/2024/08/RAG-Architecture.png?fit=1568%2C909\&ssl=1)

```
Dataset → Prompt/Chain → LLM Output → Evaluators → Scores → Report
```

---

###  Dataset-Based Evaluation (Offline)

#### Prepare an Evaluation Dataset

```python
dataset = [
    {
        "question": "What is RAG?",
        "expected": "Retrieval Augmented Generation combines retrieval with generation."
    },
    {
        "question": "What is token streaming?",
        "expected": "Streaming returns tokens incrementally as they are generated."
    }
]
```

---

#### Define the Chain Under Test

```python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI()

prompt = ChatPromptTemplate.from_template(
    "Answer the question clearly:\n{question}"
)

chain = prompt | llm
```

---

#### Run the Chain on the Dataset

```python
predictions = []
for item in dataset:
    out = chain.invoke({"question": item["question"]})
    predictions.append({
        "question": item["question"],
        "answer": out.content,
        "expected": item["expected"]
    })
```

---

### Automatic LLM-as-a-Judge Evaluation

#### Create a Correctness Evaluator

```python
from langchain.evaluation import load_evaluator

evaluator = load_evaluator(
    "labeled_criteria",
    criteria="correctness"
)
```

---

### Score Predictions

```python
for p in predictions:
    result = evaluator.evaluate_strings(
        prediction=p["answer"],
        reference=p["expected"],
        input=p["question"]
    )
    print(p["question"], "→", result["score"])
```

**Typical Output**

```
What is RAG? → 0.92
What is token streaming? → 0.89
```

Scores are normalized (0–1).

---

### RAG Faithfulness Evaluation

#### Faithfulness 

```python
faithfulness = load_evaluator("faithfulness")

result = faithfulness.evaluate_strings(
    prediction="RAG retrieves documents then generates answers.",
    input="Explain RAG",
    reference="Retrieved docs explain RAG combines retrieval + generation."
)

print(result)
```

This detects **hallucinations** when answers are not supported by context.

---

### Pairwise Model / Prompt Comparison

#### A/B Evaluation

```python
from langchain.evaluation import load_evaluator

pairwise = load_evaluator("pairwise_string")

pairwise.evaluate_string_pairs(
    input="Explain RAG",
    prediction="RAG uses retrieval + LLM generation.",
    prediction_b="RAG just searches the web."
)
```

Returns which answer is **better and why**.

---

### Latency & Cost Evaluation (Quantitative)

#### Measure Latency and Cost

```python
import time
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    start = time.time()
    chain.invoke({"question": "Explain evaluation"})
    latency = time.time() - start

print("Latency:", latency)
print("Cost:", cb.total_cost)
```

Now you evaluate **quality + performance + cost** together.

---

### Continuous Evaluation with Tracing

#### Enable Tracing for Eval Runs

```bash
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT=llm-evaluation
```

Each evaluation run records:

* Inputs/outputs
* Scores
* Tokens
* Latency
* Regressions

Reviewed visually in LangSmith.

---

### Manual (Human-in-the-Loop) Evaluation

When required (compliance/high risk):

* Human scores outputs
* Labels stored
* Used as gold standard
* Combined with automated evals

---

### Common Evaluation Pitfalls

* Evaluating without a dataset
* Using only one metric
* Ignoring faithfulness in RAG
* No baseline for comparison
* Evaluating only quality, not cost/latency

---

### Best-Practice Evaluation Strategy

| Stage        | Evaluation         |
| ------------ | ------------------ |
| Prompt dev   | LLM-as-judge       |
| RAG tuning   | Faithfulness       |
| Model choice | Pairwise           |
| Pre-prod     | Dataset regression |
| Prod         | Sampling + tracing |

---

### Mental Model

LLM Evaluation is **unit testing for intelligence**.

```
Change prompt/model → run eval → compare scores → ship safely
```

---

### Key Takeaways

* LLM evaluation makes quality measurable
* Supports correctness, faithfulness, safety, cost, latency
* Can be automated or human-reviewed
* Essential for production-grade AI systems