```{contents}
```
## Regression Testing

**Regression testing** ensures that **existing behavior does not degrade** when you change something in your system—such as:

* Prompt updates
* Model upgrades
* Retriever / chunking changes
* Tool or agent logic changes

In LLM systems, regression testing answers:

> **Did this change make answers worse than before?**

It is a **mandatory practice** for production LLM, RAG, and agent systems.

Implemented commonly using LangChain and tracked via LangSmith.

---

### Why Regression Testing Is Critical for LLMs

Unlike traditional software:

* LLM outputs are **non-deterministic**
* Small prompt changes can cause **large behavior shifts**
* Model upgrades can silently break quality

Without regression testing:

* Hallucinations slip into prod
* Answer quality degrades unnoticed
* Trust is lost

---

### What Can Regress in LLM Systems

| Area         | Example Regression         |
| ------------ | -------------------------- |
| Prompt       | More verbose, less precise |
| RAG          | Wrong documents retrieved  |
| Faithfulness | New hallucinations         |
| Relevance    | Topic drift                |
| Correctness  | Factually wrong answers    |
| Latency      | Slower responses           |
| Cost         | Token explosion            |

---

### Regression Testing Architecture

![Image](https://cdn.prod.website-files.com/660ef16a9e0687d9cc27474a/66d91e930996182ebc2b81f4_668296a3c80f1adbb05ede7c_01_llm_regression_testing_process-min.png)

![Image](https://exactpro.com/sites/default/files/inline-images/4%20SOR%20RAG.png)

![Image](https://arize.com/wp-content/uploads/2024/12/image1-1024x256.png)

```
Baseline Version (v1)
        ↓
 Evaluation Dataset
        ↓
 Metrics (saved)
        ↓
 New Version (v2)
        ↓
 Same Dataset
        ↓
 Compare Metrics → Pass / Fail
```

---

### Create a Fixed Regression Dataset

#### Regression Dataset (Golden Set)

```python
regression_data = [
    {
        "question": "What is RAG?",
        "expected": "RAG combines document retrieval with language model generation."
    },
    {
        "question": "What is token streaming?",
        "expected": "Token streaming sends output incrementally as it is generated."
    }
]
```

This dataset must:

* Stay **stable**
* Represent **real user queries**
* Cover **edge cases**

---

### Define Baseline and New Chains

#### Baseline Chain (v1)

```python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm_v1 = ChatOpenAI(model="gpt-3.5-turbo")

prompt_v1 = ChatPromptTemplate.from_template(
    "Answer clearly:\n{question}"
)

chain_v1 = prompt_v1 | llm_v1
```

---

#### New Chain (v2 – After Change)

```python
llm_v2 = ChatOpenAI(model="gpt-4")

prompt_v2 = ChatPromptTemplate.from_template(
    "Answer the question concisely and accurately:\n{question}"
)

chain_v2 = prompt_v2 | llm_v2
```

---

### Run Both Versions on the Same Dataset

#### Collect Outputs

```python
def run_chain(chain, data):
    results = []
    for row in data:
        out = chain.invoke({"question": row["question"]})
        results.append({
            "question": row["question"],
            "answer": out.content,
            "expected": row["expected"]
        })
    return results

baseline_results = run_chain(chain_v1, regression_data)
new_results = run_chain(chain_v2, regression_data)
```

---

### Evaluate and Compare (Correctness Example)

#### Correctness Evaluation

```python
from langchain.evaluation import load_evaluator

correctness_eval = load_evaluator(
    "labeled_criteria",
    criteria="correctness"
)

def score(results):
    scores = []
    for r in results:
        s = correctness_eval.evaluate_strings(
            input=r["question"],
            prediction=r["answer"],
            reference=r["expected"]
        )
        scores.append(s["score"])
    return sum(scores) / len(scores)

baseline_score = score(baseline_results)
new_score = score(new_results)

print("Baseline score:", baseline_score)
print("New score:", new_score)
```

---

#### Regression Decision

```python
if new_score < baseline_score - 0.05:
    raise Exception("❌ Regression detected")
else:
    print("✅ No regression")
```

This introduces a **tolerance band** (industry standard).

---

### Regression Testing for RAG (Faithfulness)

#### Faithfulness Regression Check

```python
faithfulness_eval = load_evaluator("faithfulness")

def faithfulness_score(answer, docs, question):
    context = "\n".join(d.page_content for d in docs)
    return faithfulness_eval.evaluate_strings(
        input=question,
        prediction=answer,
        reference=context
    )["score"]
```

Fail if:

* Faithfulness score drops
* New hallucinations appear

---

### Latency & Cost Regression

#### Performance Regression Check

```python
import time
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    start = time.time()
    chain_v2.invoke({"question": "Explain RAG"})
    latency = time.time() - start

print("Latency:", latency)
print("Cost:", cb.total_cost)
```

Fail if:

* Latency ↑ beyond SLA
* Cost ↑ beyond budget

---

#### CI/CD Integration (Real-World)

Typical pipeline:

```
PR opened
  ↓
Run regression dataset
  ↓
Evaluate metrics
  ↓
Compare with baseline
  ↓
Fail build if regression
```

LangSmith can:

* Store baselines
* Compare versions visually
* Track regressions over time

---

### Common Regression Failure Patterns

| Pattern            | Cause                |
| ------------------ | -------------------- |
| Lower correctness  | Prompt drift         |
| New hallucinations | Retrieval changes    |
| Lower relevance    | Over-verbose prompts |
| Latency spike      | Larger context       |
| Cost spike         | More tokens          |

---

### Best-Practice Regression Strategy

| Stage          | What to Regress         |
| -------------- | ----------------------- |
| Prompt changes | Relevance, correctness  |
| RAG changes    | Faithfulness, retrieval |
| Model upgrade  | All metrics             |
| Prod           | Sampled regression      |

---

### Mental Model

Regression testing is **unit testing for LLM behavior**.

```
Known good behavior → lock it → compare every change
```

---

### Key Takeaways

* Regression testing prevents silent quality loss
* Always compare against a baseline
* Use fixed datasets
* Evaluate correctness, relevance, faithfulness, latency, cost
* Mandatory for production LLM and RAG systems