# Feature Track 1: Evaluation & Validation

---

Shipping a RAG system without systematic evaluation is like navigating without instruments. The pipeline may *seem* to work on the queries you tested by hand, but you have no way to know where it breaks, how often, or whether a change you made helped or hurt.

**Evaluation closes the feedback loop:**

```
Change a parameter  ──►  Measure quantitatively  ──►  Decide based on data
```

**Prerequisite:** Run `feature0_baseline_rag.ipynb` Steps 1–2 first to build the vector store.

| Notebook | Focus |
|---|---|
| Feature 0 | Working baseline prototype |
| **Feature Track 1 (this notebook)** | Quantitative evaluation |
| Feature Track 2 | Reliable, structured outputs |
| Feature Track 3 | Better retrieval strategies |
| Feature Track 4 | Multi-step agent workflows |

---

## Foundation

### Why Systematic Evaluation?
Suppose you change the chunk size from 800 to 400 characters. Did that help? How would you know?

Without metrics you are forced to manually re-read answers for a handful of test queries and guess. With metrics you run the evaluation suite and get a number -> a number you can track across changes and use to justify decisions.

### A Concrete Example from Feature 0
In Feature 0 we saw that the baseline RAG sometimes:
- Described the **Lara Pallet** as if it exists (it doesn't)
- Cited the **outdated 2021 GWP figure** even though a newer, verified EPD supersedes it
- Reported the tesa **68% CO₂ reduction** without flagging it as unverified

These are not edge cases -> they are exactly the queries that matter for compliance.
How often does this happen? After every change to the pipeline, you need an answer.

### The Four Questions Evaluation Answers
| Question | Why it matters |
|---|---|
| **Is the retriever finding the right chunks?** | A perfect LLM cannot fix wrong retrieval |
| **Is the LLM hallucinating?** | A fabricated GWP figure can be shared with clients |
| **Is the answer complete?** | Missing "this is unverified" can mislead a user |
| **Did my change help?** | Without a baseline metric you cannot tell improvement from regression |

---

### The RAG Pipeline

Each arrow is a potential failure point. Evaluation targets a specific stage so you can isolate *where* the problem is.

```
**Ingestion** (run once)

  Documents  ──►  [1] Chunker  ──►  [2] Embedder  ──►  [3] Vector DB


**Querying** (every user question)

  User query  ──►  [2] Embedder  ──►  [3] Retriever  ──►  Top-k Chunks
                                                                 │
                                                          [4] LLM + Prompt
                                                                 │
                                                          Answer + Sources
```

| Step | What it does | If it fails |
|---|---|---|
| [1] Chunking | Split documents into searchable units | Context split mid-fact; tables broken; information lost |
| [2] Embedding | Convert text to vectors | Wrong chunks returned despite a matching query |
| [3] Vector search | Find most similar chunks to query | Relevant chunks not returned |
| [4] Generation | LLM answers using retrieved context | Hallucination; ignores context; incomplete answer |

---

### Stage-by-Stage Evaluation Map

| Stage | What to measure |
|---|---|
| **Ingestion / Parsing** | Text completeness; table structure preserved; reading order correct in multi-column layouts |
| **Chunking** | Chunk size distribution; percentage of chunks exceeding the embedding model's token limit |
| **Embedding** | Similarity gap between a relevant and an irrelevant chunk for the same query |
| **Vector search** | Fraction of queries where the correct chunk appears in the top-k results |
| **Retrieved context** | Relevance of the retrieved chunks to the query |
| **Faithfulness** | Fraction of answer claims that are directly supported by the retrieved context |
| **Answer relevance** | Whether the answer addresses the actual question, not a related but different one |
| **Answer correctness** | Factual accuracy of the answer compared to the known ground truth |


---

## RAGAS

[RAGAS](https://docs.ragas.io) (*Retrieval Augmented Generation Assessment*) is an
open-source Python library for evaluating RAG pipelines. It is a widely adopted evaluation framework in the LLM/RAG ecosystem.

#### How it works internally
Rather than asking a judge LLM "rate this answer 0–5", RAGAS decomposes the answer into individual atomic claims:

```
Answer: "The Logypal 1 GWP is 3.2 kg CO₂e, verified by Bureau Veritas."

  Claim 1: "GWP is 3.2 kg CO₂e"          → supported by context?  ✓
  Claim 2: "verified by Bureau Veritas"  → supported by context?  ✓

  Faithfulness = 2 supported / 2 total = 1.0
```

This is more rigorous than a holistic score: it catches partial hallucination, e.g. a correct figure with a fabricated verifier name.

#### Metrics at a glance
| Metric | Ground truth? | What it catches |
|---|---|---|
| `Faithfulness` | No | Claims not supported by the retrieved context |
| `AnswerRelevancy` | No | Off-topic or evasive answers |
| `AnswerCorrectness` | Yes | Wrong or missing facts vs. the reference answer |
| `ContextPrecision` | Yes | Irrelevant chunks ranked above relevant ones |

#### Strengths
- Standardised, reproducible metrics widely used in industry
- `Faithfulness` and `AnswerRelevancy` require zero labelling effort
- Claim-level decomposition is more rigorous than holistic scoring

#### Weaknesses
- Needs a capable judge LLM -> RAGAS defaults to OpenAI (requires `OPENAI_API_KEY`)
- LLM judge has its own biases; may be lenient on confident-sounding hallucinations
- Slow and costly at scale: ~3 LLM calls per sample per metric
- Metrics are proxies, not ground truth: score 0.9 ≠ 90% of answers are correct

---

### First Look at RAGAS

#### Setup

**Prerequisites:** `conversational-toolkit` and `backend` installed in editable mode. Vector store must already exist -> run `feature0_baseline_rag.ipynb` Steps 1–2 first.

RAGAS uses OpenAI as its judge LLM by default ->`OPENAI_API_KEY` must be set.


In [None]:
import os
import pathlib
import warnings

from langchain_openai import ChatOpenAI, OpenAIEmbeddings as LangChainOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper  # type: ignore[import-untyped]
from ragas.metrics import (  # type: ignore[attr-defined]
    Faithfulness as RagasFaithfulness,
    AnswerRelevancy as RagasAnswerRelevancy,
)

from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.evaluation import Evaluator
from conversational_toolkit.evaluation.adapters import evaluate_with_ragas
from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    EMBEDDING_MODEL,
    VS_PATH,
    SYSTEM_PROMPT,
    build_llm,
    build_agent,
)

warnings.filterwarnings("ignore", category=DeprecationWarning)

_secret_path = pathlib.Path("/secrets/OPENAI_API_KEY")
if "OPENAI_API_KEY" not in os.environ and _secret_path.exists():
    os.environ["OPENAI_API_KEY"] = _secret_path.read_text().strip()

RETRIEVER_TOP_K = 5
BACKEND = "openai"  # "ollama"  or  "openai"
# Note: RAGAS uses OpenAI for its judge LLM regardless of BACKEND above.

if not BACKEND:
    raise ValueError('Set BACKEND to "ollama" or "openai" before running.')

# RAG pipeline
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
vs = ChromaDBVectorStore(db_path=str(VS_PATH))
llm = build_llm(backend=BACKEND)
agent = build_agent(
    vector_store=vs,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT,
    number_query_expansion=0,
)

# AnswerRelevancy internally calls embed_query() / embed_documents() to compare generated questions against the original query. langchain_openai.OpenAIEmbeddings implements this interface and is accepted directly by ragas.evaluate().
ragas_embeddings = LangChainOpenAIEmbeddings(model="text-embedding-3-small")

# RAGAS defaults to max_tokens=3072 for its judge LLM. Long answers with many atomic claims overflow this limit mid-JSON, causing "output is incomplete" errors.
# Wrap ChatOpenAI with a higher limit and pass it explicitly to evaluate_with_ragas().
ragas_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", max_completion_tokens=8192)
)

print(f"Embedding model : {EMBEDDING_MODEL}")
print(f"Vector store    : {VS_PATH}")
print(f"RAG agent LLM   : {BACKEND}")
print("RAGAS judge LLM : gpt-4o-mini (OpenAI)")
print("Setup complete.")

2026-02-24 00:53:33.228 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:__init__:57 - Sentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}
2026-02-24 00:53:33.240 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_llm:135 - LLM backend: OpenAI (gpt-4o-mini)
2026-02-24 00:53:33.263 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4o-mini; temperature: 0.3; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'text'}
2026-02-24 00:53:33.263 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_agent:332 - RAG agent ready (top_k=5  query_expansion=0)


Embedding model : sentence-transformers/all-MiniLM-L6-v2
Vector store    : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/backend/data_vs.db
RAG agent LLM   : openai
RAGAS judge LLM : gpt-4o-mini (OpenAI)
Setup complete.


We start with the two metrics that need no ground truth:

- **Faithfulness**: are all claims in the answer supported by the retrieved context?
- **AnswerRelevancy**: does the answer directly address the question?

The `evaluate_with_ragas()` adapter converts our `EvaluationSample` objects to RAGAS format, calls the judge LLM, and returns an `EvaluationReport`.

*Takes ~2–3 minutes: RAGAS makes multiple judge LLM calls per sample.*

In [None]:
queries = [
    "Does PrimePack AG offer a product called the Lara Pallet?",
    "Which products in the portfolio have a third-party verified EPD?",
    "Can the 68% CO2 reduction claim for tesapack ECO (product 50-102) be included in a customer sustainability response?",
    "Are any tape products confirmed to be PFAS-free?",
    "Which suppliers are not yet compliant with the EPD requirement by end of 2025?",
]

print(
    f"Building {len(queries)} evaluation samples (runs the RAG agent once per query)..."
)
samples = await Evaluator.build_samples_from_agent(agent=agent, queries=queries)
print(f"Done. {len(samples)} samples built.\n")


print("Running RAGAS: Faithfulness + AnswerRelevancy (~2-3 min)\n")

report_basic = evaluate_with_ragas(
    samples=samples,
    metrics=[
        RagasFaithfulness(),  # type: ignore[call-arg]
        RagasAnswerRelevancy(strictness=1),  # type: ignore[call-arg]
    ],
    llm=ragas_llm,
    embeddings=ragas_embeddings,
)

print("─" * 40)
print(f"Samples evaluated : {report_basic.num_samples}")
print("─" * 40)
for metric_name, score in report_basic.summary().items():
    print(f"{metric_name:<22}  {score:.3f}")
print("─" * 40)

Building 5 evaluation samples (runs the RAG agent once per query)...


2026-02-23 23:55:28.748 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-23 23:55:29.488 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-23 23:55:32.852 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-23 23:55:36.431 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-23 23:55:39.523 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Done. 5 samples built.

Running RAGAS: Faithfulness + AnswerRelevancy (~2-3 min)



Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

────────────────────────────────────────
Samples evaluated : 5
────────────────────────────────────────
faithfulness            0.982
answer_relevancy        0.380
────────────────────────────────────────


In [15]:
# Per-sample breakdown: find which queries score best / worst so we know where to focus improvement efforts.

import math

f_result = next(
    (r for r in report_basic.results if "faithfulness" in r.metric_name.lower()), None
)
a_result = next(
    (
        r
        for r in report_basic.results
        if "relevancy" in r.metric_name.lower() or "relevance" in r.metric_name.lower()
    ),
    None,
)

f_scores: list[float] = (f_result.per_sample_scores if f_result else None) or []
a_scores: list[float] = (a_result.per_sample_scores if a_result else None) or []


def fmt(v: float) -> str:
    return "  N/A" if math.isnan(v) else f"{v:>5.2f}"


print("Per-sample scores  (F = Faithfulness,  A = AnswerRelevancy)\n")
print(f"{'#':<3} {'F':>4} {'A':>4}      {'query':<40}  response")
print("─" * 110)
for i, (sample, f, a) in enumerate(zip(samples, f_scores, a_scores), 1):
    flag = " ◄" if (not math.isnan(f) and f < 0.7) else "  "
    q = sample.query[:40] + ".." if len(sample.query) > 40 else sample.query
    r = (
        (sample.answer[:40] or "") + ".."
        if len(sample.answer or "") > 60
        else (sample.answer or "")
    )
    print(f"{i:<3} {fmt(f)} {fmt(a)}{flag}  {q:<40}  {r}")

Per-sample scores  (F = Faithfulness,  A = AnswerRelevancy)

#      F    A      query                                     response
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1    1.00  0.99    Does PrimePack AG offer a product called..  No, PrimePack AG does not offer a produc..
2    1.00  0.00    Which products in the portfolio have a t..  Based on the information provided, the p..
3    1.00  0.00    Can the 68% CO2 reduction claim for tesa..  Based on the information provided, the 6..
4    1.00  0.00    Are any tape products confirmed to be PF..  As of now, there are no confirmed tape p..
5    0.91  0.91    Which suppliers are not yet compliant wi..  As of January 2025, the following suppli..


## Brainstorming & Tasks
