# The Baseline RAG Pipeline

**RAG Prototyping Workshop**

---

## What You Will Learn

This notebook is the starting point for the workshop. It introduces the **key concepts** behind Retrieval-Augmented Generation (RAG) and walks through every step of the **baseline pipeline** that the later phases build upon.

After working through this notebook you will be able to:
- Explain why a standalone LLM is insufficient for grounded enterprise Q&A
- Describe the five stages of a RAG pipeline (chunk -> embed -> store -> retrieve -> generate)
- Run the full baseline pipeline against the PrimePack AG corpus
- Use the retrieval inspection step as the primary debugging tool
- Identify the three main failure modes this workshop addresses

**Workshop Phases at a Glance**
| Notebook | Focus |
|---|---|
| **Baseline (this notebook)** | Key concepts + end-to-end baseline |
| Feature Track 1 | Chunking strategies & document ingestion |
| Feature Track 2 | Evaluation metrics (retrieval + generation) |
| Feature Track 3 | Reliable & structured outputs |
| Feature Track 4 | Advanced retrieval |
| Feature Track 5 | Multi-step agent workflows |

---

## 1. Why RAG? The Problem with a Standalone LLM

### The Scenario
**PrimePack AG** buys packaging materials (pallets, cardboard boxes, tape) from multiple suppliers. Sustainability claims are increasingly scrutinised by customers and regulators. Employees need to answer questions like:
> *"What is the GWP of the Logypal 1 pallet, and is the figure verified?"*  
> *"Can we tell a customer that the tesa tape is PFAS-free?"*  
> *"Which of our suppliers have a certified EPD?"*

### Why Not Just Ask ChatGPT?
A general-purpose LLM has three fundamental problems for this task:

| Problem | Why It Matters |
|---|---|
| **No product knowledge** | LLMs know nothing about Logypal 1, PrimePack's specific portfolio, or the individual supplier documents. |
| **Hallucination** | When asked about unknown products the LLM invents plausible-sounding but false figures. |
| **No evidence trail** | Even when correct, a raw LLM answer cannot be traced back to a source document. |

### The RAG Solution
RAG adds a **retrieval step** between the user's question and the LLM:

```
 Documents ──► Chunker ──► Embedder ──► Vector DB
                                              │
 User query ─────────────────► Embedder ─────►  Retriever ──► Top-k Chunks
                                                                      │
                                                               LLM + Prompt
                                                                      │
                                                               Answer + Sources
```

The LLM only sees documents that are **actually in the corpus**. The answer can be traced to specific source chunks. If the corpus does not contain the answer, the LLM is instructed to say so.

### What RAG Does *Not* Fix
RAG shifts the problem from hallucination to **retrieval quality**. If the right chunk is not retrieved, the answer will still be wrong (or absent). The later phases of this workshop address exactly this: better chunking, better retrieval, and better output structure.

---

## 2. Core Concepts

### Chunks

A **chunk** is a short excerpt from a source document, a section of a PDF, one sheet of a spreadsheet, or one heading-delimited paragraph of a Markdown file. Chunks are the unit of indexing and retrieval.

```python
@dataclass
class Chunk:
    id: str           # unique identifier
    title: str        # e.g. section heading
    content: str      # the text that gets embedded
    metadata: dict    # source_file, page, ...
```

### Embeddings
An **embedding** converts text to a dense numeric vector (e.g. 384 dimensions). Semantically similar texts produce similar vectors. Here we use `all-MiniLM-L6-v2`, a compact local model that runs without an API key.

### Vector Store (ChromaDB)
A **vector store** persists chunk embeddings on disk and supports approximate nearest-neighbour search. Given a query embedding, it returns the `top_k` most similar chunks in milliseconds.

### Retriever
A **retriever** wraps a vector store and exposes a single `retrieve(query)` method. The baseline uses a `VectorStoreRetriever` with `top_k=5`.

### RAG Agent
The **RAG agent** combines a retriever and an LLM. Its `answer()` method:
1. Embeds the query
2. Retrieves the top-k chunks
3. Formats chunks as XML `<source>` tags in the prompt
4. Calls the LLM and returns the answer + cited sources

---

## 3. Setup

**Prerequisites:**
- `conversational-toolkit` installed in editable mode (`pip install -e conversational-toolkit/`) (already done on Renku)
- `backend` installed in editable mode (`pip install -e backend/`) (already done on Renku)
- For the **Ollama** backend (default): `ollama serve` running + `ollama pull mistral-nemo:12b`
- For the **OpenAI** and "QWEN" backend you need to set the API Keys.

In [None]:
from pathlib import Path


from conversational_toolkit.agents.base import QueryWithContext
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.retriever.vectorstore_retriever import VectorStoreRetriever

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    load_chunks,
    inspect_chunks,
    build_vector_store,
    inspect_retrieval,
    build_agent,
    build_llm,
    ask,
    DATA_DIR,
    VS_PATH,
    EMBEDDING_MODEL,
    RETRIEVER_TOP_K,
)

# ── Choose your LLM backend ─────────────────────────────────────────────────
# Set BACKEND to one of:
#   "ollama"  — local model, requires running `ollama serve` (see Renku_README.md)
#   "openai"  — cloud model, requires an OpenAI API key   (see Renku_README.md)
#   "qwen"    — SDSC cloud model, requires an SDSC token  (see Renku_README.md)
BACKEND = "qwen"  # set this before running

if not BACKEND:
    raise ValueError(
        'BACKEND is not set. Edit the line above and set it to "ollama", "openai", or "qwen".\n'
        "See Renku_README.md for setup instructions."
    )

ROOT = Path().resolve().parents[1]  # backend/notebooks/ → project root
print(f"Project root : {ROOT}")
print(f"Data dir     : {DATA_DIR}")
print(f"Vector store : {VS_PATH}")
print(f"LLM backend  : {BACKEND}")

Consider using the pymupdf_layout package for a greatly improved page layout analysis.
Project root : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag
Data dir     : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data
Vector store : /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/backend/data_vs.db
LLM backend  : qwen


---

## Step 1: Load and Chunk Documents

The `load_chunks()` function walks `data/` and dispatches each file to the right chunker:

| Extension | Chunker | Strategy |
|---|---|---|
| `.pdf` | `PDFChunker` | Convert to Markdown via `pymupdf4llm`, split on `#` headings |
| `.xlsx`, `.xls` | `ExcelChunker` | One chunk per sheet, serialised as a Markdown table |
| `.md`, `.txt` | `MarkdownChunker` | Split on `#` headings |

The result is a flat `list[Chunk]`, the same structure regardless of the original format.

You can use `max_files=5` here for speed. Remove the limit (or set `None`) to load the full corpus.

> **Feature Track 1** explores alternative chunking strategies in depth.

In [None]:
chunks = load_chunks(max_files=None)
inspect_chunks(chunks)

# Quick size distribution
char_lengths = [len(c.content) for c in chunks]
over_limit = sum(1 for n in char_lengths if n > 1024)
print(f"\nChunks total       : {len(chunks)}")
print(f"Mean length (chars): {sum(char_lengths) // len(char_lengths)}")
print(f"Over 1024-char limit (≈256 tok embedding limit): {over_limit} / {len(chunks)}")

2026-02-21 15:30:13.369 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:196 - Chunking 35 files from /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data
2026-02-21 15:30:13.371 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:205 -   ART_internal_procurement_policy.md: 12 chunks
2026-02-21 15:30:13.372 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:205 -   ART_logylight_incomplete_datasheet.md: 7 chunks
2026-02-21 15:30:13.373 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:205 -   ART_product_catalog.md: 7 chunks
2026-02-21 15:30:13.373 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:205 -   ART_relicyc_logypal1_old_datasheet_2021.md: 7 chunks
2026-02-21 15:30:13.374 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:205 -   ART_response_inquiry_frische_felder.md: 6 chunks
2026-02-21 15:30:13.374 | DEBUG    | 

: 

### What a Chunk Looks Like

Each chunk carries a `title` (the heading), the raw text `content`, and a `metadata` dict
with the source file name. This metadata is returned alongside the answer so the user can
trace every claim back to its origin document.

In [None]:
# Print 3 representative chunks
for c in chunks[:3]:
    print(f"--- [{c.metadata.get('source_file', '?')}] ---")
    print(f"Title  : {c.title!r}")
    print(f"Length : {len(c.content)} chars")
    print(f"Preview: {c.content[:250].strip()!r}")
    print()

--- [ART_customer_inquiry_frische_felder.md] ---
Title  : '# Customer Inquiry — Sustainability Information Request'
Length : 320 chars
Preview: '# Customer Inquiry — Sustainability Information Request\n\n*Document type: Customer communication (incoming + internal draft response)* *Date received: 14 January 2025* *Customer: Frische Felder AG, Procurement Department* *Contact: Ms. Sabine Keller —'

--- [ART_customer_inquiry_frische_felder.md] ---
Title  : '## Incoming Customer Email'
Length : 1662 chars
Preview: '## Incoming Customer Email\n\n**Subject:** Request for Sustainability Documentation — Tape and Cardboard Products\n\nDear PrimePack AG Team,\n\nThank you for our ongoing partnership. As part of our updated supplier due diligence process in line with the EU'

--- [ART_customer_inquiry_frische_felder.md] ---
Title  : '## Internal Draft Response'
Length : 374 chars
Preview: '## Internal Draft Response\n\n*Status: DRAFT — not yet sent. Requires review and completion before sending.*

: 

: 

---

## Step 2: Embed Chunks and Build the Vector Store

`SentenceTransformerEmbeddings` converts every chunk's `content` to a 384-dimensional vector using `all-MiniLM-L6-v2`. The resulting matrix (shape `[n_chunks, 384]`) is inserted into a persistent `ChromaDBVectorStore`.

**On subsequent runs**, leave `reset=False` (the default) to skip re-embedding, it takes time and the store on disk is already correct. Pass `reset=True` only when the corpus or chunking strategy changes.

> **Why 384 dimensions?** `all-MiniLM-L6-v2` is a distilled model: small enough to run on CPU in seconds but good enough for retrieval on short technical texts. OpenAI's `text-embedding-3-small` produces 1536-dimensional vectors with higher quality at the cost of an API call per chunk.

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Embedding model: {EMBEDDING_MODEL}")

# Set reset=True to rebuild the store from scratch
vector_store = await build_vector_store(
    chunks, embedding_model, db_path=VS_PATH, reset=False
)
print("Vector store ready.")

2026-02-21 15:28:49.115 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:__init__:57 - Sentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}
2026-02-21 15:28:49.121 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_vector_store:253 - Vector store already contains 78 chunks — skipping embedding.


Embedding model: sentence-transformers/all-MiniLM-L6-v2
Vector store ready.


: 

: 

### Similarity in Embedding Space

Embeddings that are close in vector space share semantic meaning. The cell below embeds several sentences and measures their cosine similarity: a value between -1 (opposite) and 1 (identical). You can change the sentences to see the impact on cosine similarity.

In [None]:
import numpy as np

sentence1 = "carbon footprint of a pallet"
sentence2 = "GWP value for the Logypal 1"
sentence3 = "PFAS-free tape declaration"
sentence4 = "the annual report of a software firm"


async def cosine_similarity(a: str, b: str) -> float:
    vecs = await embedding_model.get_embeddings([a, b])
    return float(
        np.dot(vecs[0], vecs[1]) / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
    )


pairs = [
    (sentence1, sentence2),
    (sentence1, sentence3),
    (sentence1, sentence4),
]

print("Cosine similarities:")
for a, b in pairs:
    sim = await cosine_similarity(a, b)
    print(f"{sim:.3f}  -->  {a!r}  vs  {b!r}")

: 

: 

: 

---

## Step 3: Inspect Retrieval (Before the LLM Sees Anything)

This is the **most important diagnostic step** in the whole pipeline:

> If the retrieved chunks are wrong, the final answer will be wrong regardless of how good the LLM is.

`inspect_retrieval()` runs the query through the embedding model, fetches the top-k most similar chunks from ChromaDB, and prints them with scores. Use this to:
- Verify that relevant documents are in the index
- Tune `top_k`
- Compare different query phrasings
- Identify retrieval gaps before blaming the LLM

The **similarity score** is the L2 distance, range [0,4], lower = more similar. L2 distance is used becuase it works for any vectors, normalised or not. Cosine similarity only makes sense for direction (magnitude doesn't matter), so it requires that vectors be unit-length to be meaningful. L2 makes no such assumption, making it the safer general default. ChromaDB defaults to L2 because it's simpler to compute and works even if vector magnitudes vary. Since our embedding model always produces equal-length vectors, we get cosine-equivalent ranking. The score numbers look different, but the top-5 results would be identical either way.

In [None]:
QUERY = "What materials is the Logypal 1 pallet made from?"

results = await inspect_retrieval(QUERY, vector_store, embedding_model)

: 

: 

: 

### Retrieval for a Product Outside the Portfolio

The PrimePack AG product catalog defines the portfolio boundary. The **Lara Pallet** is not in the catalog, it does not exist. Watch which chunks are returned and what scores they have. A **higher** minimum score (large L2 distance) signals *weaker semantic match*.

In [None]:
QUERY_OOK = "What materials is the Lara pallet made from?"

results_ook = await inspect_retrieval(QUERY_OOK, vector_store, embedding_model)

: 

: 

: 

> **Observation:** The retriever always returns the *closest* chunks it can find, it has no concept of "no match". For an unknown product the L2 distances are **higher** (the closest chunks are still about other pallets), but without a score-threshold guard the LLM receives those chunks anyway and may silently answer about the wrong product.
> **Phase 3** shows how to combat this issue.

---

## Step 4: Build the RAG Agent

`build_agent()` assembles the three components:

```
VectorStoreRetriever
    └─ ChromaDBVectorStore (on disk, persists across runs)
    └─ SentenceTransformerEmbeddings

RAG Agent
    ├─ LLM (Ollama / OpenAI / SDSC Qwen)
    ├─ Retriever
    └─ System prompt
```

### The System Prompt

The system prompt is a very powerful lever for controlling LLM behaviour:

```
You are a helpful AI assistant specialised in sustainability and product compliance. Answer questions using the provided sources. If the information is not in the sources, say so clearly.
```

The instruction *"If the information is not in the sources, say so clearly"* should prevents hallucination about missing products and unverified claims.

In [None]:
llm = build_llm(backend=BACKEND)
agent = build_agent(
    vector_store=vector_store,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    number_query_expansion=0,  # 0 = no expansion; will look at this more in feautre track 4
)
print("RAG agent assembled.")

: 

: 

: 

---

## Step 5: Ask a Question

`ask()` sends the query to the agent and returns the answer string. The internal flow is:

1. Embed the query
2. Retrieve top-k chunks
3. Build the prompt: `<system>` + `<sources>` XML block + user question
4. Generate the answer with the LLM
5. Return the answer and a list of cited source chunks

In [None]:
QUERY = "What materials is the Logypal 1 pallet made from?"

answer = await ask(agent, QUERY)

: 

: 

: 

---

## 4. Probing Failure Modes

The corpus was designed with three deliberate challenges. Run the queries below and observe the answers.

### 4a: Out-of-Portfolio Query

The **Lara Pallet** does not exist. A good RAG must say so instead of describing a different pallet.

In [None]:
answer_ook = await ask(agent, "What materials is the Lara pallet made from?")

: 

: 

: 

### 4b: Missing Data (LogyLight Pallet)

The LogyLight datasheet marks all LCA fields as *"not yet available"*. The correct answer is that we don't have the data, not a fabricated figure.

In [None]:
answer_gap = await ask(agent, "What is the GWP of the LogyLight pallet?")

: 

: 

: 

### 4c: Conflicting Evidence (Relicyc GWP Figures)

The 2021 Relicyc datasheet reports **4.1 kg CO₂e** per pallet. The 2023 EPD (third-party verified) reports a different, more recent figure. The RAG should flag the conflict and prefer the verified, more recent source.

In [None]:
answer_conflict = await ask(
    agent, "What is the GWP of the Logypal 1 pallet, and how reliable is the figure?"
)

: 

: 

: 

### 4d: Unverified Supplier Claim (Tesa ECO Tape)

The tesa supplier brochure claims **68% CO₂ reduction** compared to conventional tape. This is a self-declared marketing claim, there is no independent EPD. The RAG should report the claim but flag that it is unverified.

In [None]:
answer_claim = await ask(
    agent,
    "How much lower is the carbon footprint of tesa ECO tape compared to standard tape?",
)

: 

: 

: 

> Can you think of and find **other failure modes**?

---

## 5. Multi-Turn Conversation

The `ask()` function accepts a `history` argument, a list of prior `LLMMessage` objects. When history is provided the agent first **rewrites the query** to be self-contained (*"it"* becomes the actual product name) before retrieval.

This prevents the retriever from embedding vague pronouns that match nothing in the corpus.

In [None]:
from conversational_toolkit.llms.base import LLMMessage, Roles

history: list[LLMMessage] = []


async def conversation_turn(query: str) -> str:
    global history
    answer = await agent.answer(QueryWithContext(query=query, history=history))
    history.append(LLMMessage(role=Roles.USER, content=query))
    history.append(LLMMessage(role=Roles.ASSISTANT, content=answer.content))
    return answer.content


# Turn 1: ask about a specific product
reply1 = await conversation_turn(
    "Which pallets in our portfolio have a third-party verified EPD?"
)
print("User: Which pallets in our portfolio have a third-party verified EPD?")
print(f"Assistant: {reply1}\n")

# Turn 2: follow-up using a pronoun — the agent should resolve "it" before retrieval
reply2 = await conversation_turn(
    "What is the GWP figure reported in it for the Logypal 1?"
)
print("User: What is the GWP figure reported in it for the Logypal 1?")
print(f"Assistant: {reply2}")

: 

: 

: 

---

## 6. Running the Full Pipeline in One Call

The `run_pipeline()` convenience function executes all five steps end-to-end. It is also
what the `__main__` entry point calls.

Use it for quick one-shot queries. Use the individual step functions above when you need
to inspect intermediate results or iterate on a specific stage.

In [None]:
from sme_kt_zh_collaboration_rag.feature0_baseline_rag import run_pipeline

answer = await run_pipeline(
    backend=BACKEND,
    query="What sustainability certifications do the pallets in the portfolio have?",
    reset_vs=False,
)
print(answer)

: 

: 

: 

---

## 7. Switching LLM Backends

The pipeline abstracts the LLM behind a common interface. Only `build_llm()` needs to change.

| Backend | `BACKEND=` | Prerequisite |
|---|---|---|
| Ollama (local) | `"ollama"` | `ollama serve` + `ollama pull mistral-nemo:12b` |
| OpenAI | `"openai"` | `OPENAI_API_KEY` env variable |
| SDSC Qwen | `"qwen"` | `SDSC_QWEN3_32B_AWQ` env variable |

You can also override the model within a backend:

```python
llm = build_llm(backend="openai", model_name="gpt-4o") # stronger model
llm = build_llm(backend="ollama", model_name="llama3.2") # smaller local model
```

The RAG pipeline is **backend-agnostic**, the retrieval step is identical regardless of which LLM is used.

In [None]:
# Test openai

QUERY = "What materials is the Lara pallet made from?"

llm_openai = build_llm(backend="openai", model_name="gpt-4o-mini")
agent_openai = build_agent(vector_store, embedding_model, llm_openai)
answer_openai = await ask(agent_openai, QUERY)

: 

: 

: 

---

## 8. Tasks

1. **test set creation**: Go through the dataset and come up with questions and the corresponding correct answers that the RAG should give based on the query. ALso include trick questions, such as asking for information that is not in the data or for qhich contradicting infoormation exists. 

2. **Retrieval inspection** — Call `inspect_retrieval()` with different queries and inspect which files are returned? What do the scores tell you about how well the corpus covers this topic?

3. **Top-k sensitivity**: Change `top_k` from 5 to 1. Does the answer to the the questions change? What about to 10? Is more always better?

4. **System prompt ablation**: In `baseline_rag.py`, locate `SYSTEM_PROMPT`. Try changing it and then rebuild the agent and re-run the Lara Pallet query. Does the answer change?

5. **Query phrasing**: The embedding model is sensitive to wording. Try `"CO₂ footprint Logypal 1"`, `"carbon emissions recycled pallet"`, and `"GWP A1-A3 EPD pallet"`. Do the top-1 chunk and score differ?

In [None]:
# Scratch cell — run your experiments here
async def quick_retrieve(query: str, top_k: int = 5):
    retriever = VectorStoreRetriever(embedding_model, vector_store, top_k=top_k)
    results = await retriever.retrieve(query)
    print(f"Query: {query!r}  (top_k={top_k})")
    for r in results:
        src = r.metadata.get("source_file", "?")
        print(f"  score={r.score:.4f}  {src}  {r.title!r}")


await quick_retrieve("PFAS-free tape declaration")

: 

: 

: 

---

## Summary

| Step | Function | 
|---|---|
| 1. Load & chunk | `load_chunks(max_files)` |
| 2. Embed & index | `build_vector_store(chunks, emb, reset)` |
| 3. Inspect retrieval | `inspect_retrieval(query, vs, emb)` | 
| 4. Build agent | `build_agent(vs, emb, llm, top_k)` |
| 5. Generate answer | `ask(agent, query, history)` |

### Three Core Failure Modes (Addressed in Later Feature Tracks)
- Wrong entity
- Missing data presented as fact
- Low recall

**Next — Feature 1:** Explore chunking strategies and understand how chunk size affects retrieval quality.
