# The Baseline RAG Pipeline

---

This notebook is your **starting point**. The pipeline is already built and working: run it, explore its outputs, question it, and find its limits.

**How to use this notebook**

| | |
|---|---|
| üìñ **Read** | The explanations use plain language, no coding background needed |
| ‚ñ∂Ô∏è **Run** | Execute cells top to bottom with Shift+Enter to see the pipeline in action |
| üí¨ **Discuss** | Talk about the outputs with your peers, do they make sense? Would you trust them? |
| üîß **Experiment** | Modify queries, tweak parameters, break things on purpose |
| üöÄ **Extend** | The Tasks section points to what you can take further |

> You don't need to understand every line of code. Focus on what the system gets right and wrong and on thinking about how this would apply in your own context.

---

## What This Notebook Covers

**Retrieval-Augmented Generation (RAG)** combines an AI assistant with a search capability across a corpus of documents. Instead of the AI making things up from memory, it first searches your documents and then answers based on what it finds. The answer can always be traced back to a source.

```
Your question  ->  Search your documents  ->  AI answers using only those documents
```

| Notebook | Focus |
|---|---|
| **Baseline (this notebook)** | Working baseline prototype |
| Feature Track 1 | How to measure answer quality |
| Feature Track 2 | Reliable, structured outputs |
| Feature Track 3 | Better retrieval strategies |
| Feature Track 4 | Multi-step agent workflows |

---

## Why RAG? The Problem with a Standalone LLM

### The Scenario
**PrimePack AG** buys packaging materials (pallets, cardboard boxes, tape) from multiple suppliers. Sustainability claims are increasingly scrutinised by customers and regulators. Employees need to answer questions like:
> *"What is the GWP of the Logypal 1 pallet, and is the figure verified?"*  
> *"Can we tell a customer that the tesa tape is PFAS-free?"*  
> *"Which of our suppliers have a certified EPD?"*

### Why Not Just Ask ChatGPT?
A general-purpose LLM has three fundamental problems for this task:

| Problem | Why It Matters |
|---|---|
| **Internal document** | LLMs don't know about internal company documents. |
| **Hallucination** | When asked about unknown products the LLM invents plausible-sounding but false figures. |
| **No evidence trail** | Even when correct, a raw LLM answer cannot be traced back to a source document. |

### The RAG Solution
RAG adds a **retrieval step** between the user's question and the LLM:

```
 Documents ‚îÄ‚îÄ‚ñ∫ Chunker ‚îÄ‚îÄ‚ñ∫ Embedder ‚îÄ‚îÄ‚ñ∫ Vector DB
                                              ‚îÇ
 User query ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Embedder ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫  Retriever ‚îÄ‚îÄ‚ñ∫ Top-k Chunks
                                                                      ‚îÇ
                                                               LLM + Prompt
                                                                      ‚îÇ
                                                               Answer + Sources
```

The LLM only sees documents that are **actually in the corpus**. The answer can be traced to specific source chunks. If the corpus does not contain the answer, the LLM is instructed to say so.

### What RAG Does *Not* Fix
RAG shifts the problem from hallucination to **retrieval quality**. If the right chunk is not retrieved, the answer will still be wrong (or absent). The later feature tracks address exactly this: better chunking, better retrieval, and better output structure.

---

## Core Concepts

### Chunks

A **chunk** is a short excerpt from a source document, a section of a PDF, one sheet of a spreadsheet, or one heading-delimited paragraph of a Markdown file. Chunks are the unit of indexing and retrieval.

```python
@dataclass
class Chunk:
    id: str           # unique identifier
    title: str        # e.g. section heading
    content: str      # the text that gets embedded
    metadata: dict    # source_file, page, ...
```

### Embeddings
An **embedding** converts text to a dense numeric vector (e.g. 384 dimensions). Semantically similar texts produce similar vectors. Here we use `all-MiniLM-L6-v2`, a compact local model that runs without an API key.

### Vector Store (ChromaDB)
A **vector store** persists chunk embeddings on disk and supports approximate nearest-neighbour search. Given a query embedding, it returns the `top_k` most similar chunks in milliseconds.

### Retriever
A **retriever** wraps a vector store and exposes a single `retrieve(query)` method. The baseline uses a `VectorStoreRetriever` with `top_k=5`.

### RAG Agent
The **RAG agent** combines a retriever and an LLM. Its `answer()` method:
1. Embeds the query
2. Retrieves the top-k chunks
3. Formats chunks as XML `<source>` tags in the prompt
4. Calls the LLM and returns the answer + cited sources

---

## Setup

**Prerequisites:** `conversational-toolkit` and `backend` must be installed in editable mode (`pip install -e conversational-toolkit/ && pip install -e backend/`). For the **Ollama** backend, start `ollama serve` and pull the model (`ollama pull mistral-nemo:12b`). For **OpenAI**, set `OPENAI_API_KEY` in your environment.

In [None]:
from pathlib import Path

from conversational_toolkit.agents.base import QueryWithContext
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.retriever.vectorstore_retriever import VectorStoreRetriever

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    load_chunks,
    inspect_chunks,
    build_vector_store,
    inspect_retrieval,
    build_agent,
    build_llm,
    ask,
    DATA_DIR,
    VS_PATH,
    EMBEDDING_MODEL,
    RETRIEVER_TOP_K,
)

# Choose your LLM backend: "ollama" (local, requires `ollama serve`) or "openai" (requires OPENAI_API_KEY)
BACKEND = "ollama"  # set this before running

if not BACKEND:
    raise ValueError(
        'BACKEND is not set. Edit the line above and set it to "ollama", or "openai".\n'
        "See Renku_README.md for setup instructions."
    )

ROOT = Path().resolve().parents[1]
print(f"Project root : {ROOT}")
print(f"Data dir     : {DATA_DIR}")
print(f"Vector store : {VS_PATH}")
print(f"LLM backend  : {BACKEND}")

---

## Before the RAG Pipeline: The LLM on Its Own

A **large language model (LLM)** is a neural network trained on billions of words of text. It can summarise documents, answer questions, and generate structured output, but only from knowledge baked into its weights during training. It has no direct access to your internal documents.

Before building the RAG pipeline, let's interact with the LLM directly to understand what it can and cannot do on its own.

In [None]:
from conversational_toolkit.llms.base import LLMMessage, Roles

# Reuse the backend you chose in the Setup cell above
llm_standalone = build_llm(backend=BACKEND)

# A question the LLM can answer from general training data
general_question = "What does GWP stand for, and what unit is it typically measured in?"

response_general = await llm_standalone.generate(
    [LLMMessage(role=Roles.USER, content=general_question)]
)
print("---------------------------")
print(f"Q: {general_question}\n")
print("---------------------------")
print(f"A: {response_general.content}")

The LLM answers that correctly, GWP is a well-known concept covered in its training data.

Now ask something specific to PrimePack AG's product portfolio:

In [None]:
# A product-specific question the LLM has never seen in training
primepack_question = "What is the Global Warming Potential (GWP) of the Logypal 1 pallet sold by PrimePack AG, and is the figure third-party verified? Provide the link to PrimePack AG's official website."

response_pp = await llm_standalone.generate(
    [LLMMessage(role=Roles.USER, content=primepack_question)]
)
print("---------------------------")
print(f"Q: {primepack_question}\n")
print("---------------------------")
print(f"A: {response_pp.content}")

PrimePack AG is a fictional company, no training data exists for this product
- If the model gave a specific figure or website link: it is hallucinated.
- If the model said 'I don't know': that is honest, but still not useful.

Either way, the LLM cannot provide the actual figure with a verifiable source.

> **Task: Compare LLM Backends**
> 1. **Switch backends.** Change BACKEND to "ollama" (or "openai" if you started with Ollama) and re-run the two question cells above. Does one model hallucinate a specific figure or website link while the other declines? How confident does each answer sound?
> 2. **Change the question.** Replace the tesa question with something you could imagine being asked in a real supplier audit. Does the standalone LLM give you an answer you would trust?
> 3. **Note the pattern.** Regardless of whether the model hallucinates or says "I don't know", ask: could you send this response to a customer? What is missing?

### Why do different models behave differently?

**OpenAI models** (GPT-4o, GPT-4o-mini) are extensively trained with human feedback (Reinforcement Learning from Human Feedback(RLHF)) to decline when they lack reliable information. For a fictional company like PrimePack AG, with no public web presence, the model has learned to say "I don't know" rather than confabulate a specific figure.

**Smaller local models** (Mistral, LLaMA 7‚Äì13 B) are typically less safety-fine-tuned. Without the reinforcement signal that penalises confident wrong answers, they are more likely to generate a plausible-sounding but fabricated answer.

**The problem in either case:** "I don't know" and a hallucinated answer are equally useless to an employee who needs to respond to a CSRD audit. The correct response, *"The verified GWP is X kg CO‚ÇÇe according to the 2023 EPD (source: EPD_pallet_relicyc_logypal1.pdf)"*, requires access to the actual document.

---

### Choosing a Backend: OpenAI API vs Local Models

Two LLM backends are available in this workshop. Both expose the same interface, switching requires changing a single variable.

#### Comparison

| | **OpenAI API** (`gpt-4o-mini`, `gpt-4o`, ‚Ä¶) | **Ollama -> local** (`mistral-nemo:12b`, `llama3.2`, ‚Ä¶) |
|---|---|---|
| **Data security** | Queries and document chunks are sent to OpenAI's servers. You can request zero-data-retention. | Everything stays on-premise. Nothing leaves the machine. Suitable for confidential documents without any external data agreements. |
| **Model capability** | State-of-the-art. Follows complex instructions reliably, structures output well, handles edge cases. `gpt-4o-mini` is the default for this workshop, it is much cheaper than `gpt-4o` with most of the capability for RAG tasks. | Smaller models (7‚Äì13 B parameters) are weaker on complex reasoning and strict rule-following. For straightforward retrieval and summarisation tasks the quality gap narrows considerably. |
| **Cost** | Per-token billing. A typical RAG query costs a fraction of a cent. See the cost estimation section below. | No API cost: you pay for hardware (CPU/GPU) and electricity. |
| **Setup** | One API key, no local hardware required | `ollama serve` + model download |

#### Self-Hosting Larger Models

The quality gap between a 12 B local model and GPT-4o can be substantially closed at larger model sizes:

- **LLaMA 3.1 70 B, Mistral Large 2, Qwen 2.5 72 B**: run on GPUs. Quality can approach GPT-4o on structured tasks like RAG.
- **Quantised models (GGUF / GPTQ):** Reduce memory requirements with a modest quality trade-off, making larger models accessible on smaller hardware.
- **Production stacks:** `vLLM` and `llama.cpp` server provide OpenAI-compatible APIs with batching and much higher throughput than `ollama` alone. 

**For this workshop** `gpt-4o-mini` (OpenAI) and `mistral-nemo:12b` (Ollama) are both sufficient to demonstrate the full RAG pipeline.

---

### Cost Estimation

API costs scale with the number of tokens processed. **Input tokens** (your system prompt and retrieved document chunks) are cheaper than **output tokens** (the model's generated answer).

OpenAI API pricing for all models can be found [here](https://developers.openai.com/api/docs/pricing/). As an example, prices for `gpt-4o-mini` are:

| Token type | Price |
|---|---|
| Input | $0.15 / 1 M tokens |
| Output | $0.60 / 1 M tokens |

A rough rule of thumb: **1 token ‚âà 4 characters** of English text.

In [None]:
from sme_kt_zh_collaboration_rag.feature0_ingestion import estimate_tokens

# Cost estimation for gpt-4o-mini
INPUT_PRICE_PER_TOKEN = 0.15 / 1_000_000  # USD
OUTPUT_PRICE_PER_TOKEN = 0.60 / 1_000_000  # USD


def estimate_cost(input_text: str, output_text: str) -> dict:
    input_tok = estimate_tokens(input_text)
    output_tok = estimate_tokens(output_text)
    cost = input_tok * INPUT_PRICE_PER_TOKEN + output_tok * OUTPUT_PRICE_PER_TOKEN
    return {"input_tokens": input_tok, "output_tokens": output_tok, "cost_usd": cost}


# Simulate a typical RAG query: system prompt + 5 retrieved chunks + user question --> short generated answer
example_input = (
    "You are a helpful AI assistant specialised in sustainability for PrimePack AG. "
    "Answer only using the provided document excerpts. Cite your sources.\n\n"
    "Source: EPD_pallet_relicyc_logypal1.pdf\n"
    "The Logypal 1 pallet has a declared GWP of 3.2 kg CO\u2082e per functional unit (A1\u2013A3), "
    "verified by an independent third-party auditor under ISO\u202014044.\n\n"
    "Source: ART_product_catalog.md\n"
    "The Logypal 1 (Product ID: 20-100) is a recycled-plastic pallet supplied by Relicyc. "
    "It is listed as the primary pallet for heavy-duty use in the PrimePack AG portfolio.\n\n"
    "[... 3 more retrieved chunks ...]\n\n"
    "Q: What is the GWP of the Logypal 1 pallet, and is it verified?"
)
example_output = (
    "The Logypal 1 pallet has a GWP of 3.2\u202fkg\u202fCO\u2082e per functional unit (A1\u2013A3), "
    "according to the third-party verified EPD (EPD_pallet_relicyc_logypal1.pdf). "
    "The figure has been independently audited under ISO\u202014044."
)

info = estimate_cost(example_input, example_output)
print(f"Input  : ~{info['input_tokens']:>5,} tokens (prompt + chunks + question)")
print(f"Output : ~{info['output_tokens']:>5,} tokens (generated answer)")
print(f"Cost   : ${info['cost_usd']:.6f} per query")
print()
print(f"At 1,000 queries / day     ->  ${info['cost_usd'] * 1_000:>8.4f} / day")
print(f"At 10,000 queries / day    ->  ${info['cost_usd'] * 10_000:>8.4f} / day")
print(f"At 1,000,000 queries / day ->  ${info['cost_usd'] * 1_000_000:>8.2f} / day")

The largest cost driver is context length. More retrieved chunks = more input tokens.
top_k=5 (~2,000 input tokens) is a reasonable starting point for RAG.

> **Consider:** For your use case, how many queries would the system handle per day? At what volume does the per-query cost become meaningful? Would data-security requirements push you towards a local model even at lower throughput?

---
# RAG Pipeline

Now that we have seen what an LLM can and cannot do on its own, we are ready to build the retrieval layer that makes it genuinely useful. The following five steps walk through the full RAG pipeline end-to-end, from loading the documents all the way to a sourced answer.

```
 Documents ‚îÄ‚îÄ‚ñ∫ Chunker ‚îÄ‚îÄ‚ñ∫ Embedder ‚îÄ‚îÄ‚ñ∫ Vector DB
                                              ‚îÇ
 User query ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Embedder ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫  Retriever ‚îÄ‚îÄ‚ñ∫ Top-k Chunks
                                                                      ‚îÇ
                                                               LLM + Prompt
                                                                      ‚îÇ
                                                               Answer + Sources
```

---
## Step 1: Load and Chunk Documents

The `load_chunks()` function walks `data/` and dispatches each file to the right chunker:

| Extension | Chunker | Strategy |
|---|---|---|
| `.pdf` | `PDFChunker` | Convert to Markdown via `pymupdf4llm`, split on `#` headings |
| `.xlsx`, `.xls` | `ExcelChunker` | One chunk per sheet, serialised as a Markdown table |
| `.md`, `.txt` | `MarkdownChunker` | Split on `#` headings |

The result is a flat `list[Chunk]`, the same structure regardless of the original format.

You can use `max_files=5` here for speed. Remove the limit (or set `None`) to load the full corpus.

> **Feature Track 1** explores the importance of ingestion and alternative chunking strategies in more depth.

In [None]:
# Load documents from DATA_DIR and split them into chunks.
chunks = load_chunks(max_files=None)
# Print a statistical summary and sampled content for visual inspection.
inspect_chunks(chunks)

# Print size distribution
char_lengths = [len(c.content) for c in chunks]
over_limit = sum(1 for n in char_lengths if n > 1024)
print(f"\nChunks total       : {len(chunks)}")
print(f"Mean length (chars): {sum(char_lengths) // len(char_lengths)}")
print(f"Over 1024-char limit (‚âà256 tok embedding limit): {over_limit} / {len(chunks)}")
print("\nSuccessfully loaded and chunked the documents!")

### What a Chunk Looks Like

Each chunk carries a `title` (the heading), the raw text `content`, and a `metadata` dict
with the source file name. This metadata is returned alongside the answer so the user can
trace every claim back to its origin document.

In [None]:
# Print 3 representative chunks
for c in chunks[:3]:
    print(f"--- [{c.metadata.get('source_file', '?')}] ---")
    print(f"Title  : {c.title!r}")
    print(f"Length : {len(c.content)} chars")
    print(f"Preview: {c.content[:200].strip()!r}")
    print()

In [None]:
import re
import os
import numpy as np
import pymupdf4llm  # type: ignore[import-untyped]

from conversational_toolkit.chunking.excel_chunker import ExcelChunker

from sme_kt_zh_collaboration_rag.feature0_ingestion import (
    header_based_chunks,
    fixed_size_chunks,
    paragraph_aware_chunks,
    analyze_chunks,
    compare_strategies,
    print_comparison_table,
    char_histogram,
)

# We analyse a single document in detail ‚Äî the Relicyc Logypal 1 EPD.
# `chunks_header` is separate from `chunks` (the full corpus loaded above).
sample_pdf = str(DATA_DIR / "EPD_pallet_relicyc_logypal1.pdf")
chunks_header = header_based_chunks(sample_pdf)
print(
    f"Header-based chunking: {len(chunks_header)} chunks from {Path(sample_pdf).name}"
)
print(f"Titles: {[c.title for c in chunks_header[:4]]}")

---

### Ingestion & Chunking: Under the Hood

The chunks you loaded above are the **output** of the ingestion pipeline. This section opens up the pipeline and works through the challenges practitioners face:

1. **Parser choice**: which tool converts your PDF to text, and what gets lost
2. **Layout quality**: heading detection, multi-column text, dropped content
3. **Chunk size**: why it matters for retrieval and comprehension
4. **Chunking strategies**: three approaches compared side by side
5. **Tables**: in PDFs and spreadsheets
6. **Images**: what happens to figures and diagrams
7. **Custom chunkers**: extending the pipeline for your own document types
8. **The embedding token limit**: one constraint to be aware of

---

### 1. PDF Parser Choice

Before any chunking, a PDF must be converted to text. The choice of parser determines what survives (tables, images, headings, multi-column layouts) and how fast the pipeline runs.

| Tool | What it is | Strength | Weakness | Tables (as) | Images (extraction) | Positions (bbox/coords) |
|---|---|---|---|---|---|---|
| **`PyMuPDF`** | Fast low-level PDF parser | Speed + layout coords | Raw output, DIY chunking | Text (heuristics)  | Native raw extraction | Yes (strong) |
| **`PyMuPDF4LLM`** ‚úì | PyMuPDF -> Markdown chunks | LLM-ready, easy | Less flexible | Markdown | Limited / mostly references | Yes |
| **`Docling`** | Structured document AI parser | Best layout & tables | Heavy, slower | Structured -> Markdown/JSON | Advanced extraction & understanding | Yes |
| **`Unstructured`** | Multi-format ingestion pipeline | Many file types | Heavy deps, slower   | Element types (table-ish) | Raw extraction via elements   | Parser-dependent  |
| **`PyPDF`** | Lightweight pure-Python reader | Simple & light | Poor layout quality | Text | Raw extraction only | Limited |

**Why we use `pymupdf4llm`?** It is fast enough for interactive use, outputs clean Markdown with headings and tables already formatted, and requires no GPU. For production pipelines where table accuracy is critical it makes sense to explore other tools.

---

### 2. Layout Quality

Converting a PDF to text is lossy. Understanding *what* gets lost and *why* is essential for building a reliable ingestion pipeline.

**Common layout quality challenges:**

| Challenge | What happens | Effect on retrieval |
|---|---|---|
| **Content before the first heading** | Everything before `# Heading 1` is silently dropped | Cover page metadata, product IDs, version numbers can disappear |
| **Heading detection errors** | Lines that look like headings (e.g. short bold lines in a table) may become false heading boundaries | Chunks split at the wrong place; small noise chunks created |
| **Multi-column layouts** | Column text is linearised left-to-right across the full page width | Sentences from different columns are merged, producing incoherent text |
| **Tables** | Rendered as Markdown pipe tables, column headers may not appear in each row chunk | Query about a specific row may not match because the product name is only in the header |
| **Images and figures** | Default: silently dropped | Figures, diagrams, and chart data are invisible to a text-only retriever |
| **Headers/footers** | Page numbers and document titles may appear in the body text | Noise chunks of 2‚Äì3 words are created |

The cell below inspects the raw Markdown for the Relicyc EPD. Read through it carefully, some of the quality issues above should be visible.

In [None]:
raw_markdown: str = pymupdf4llm.to_markdown(sample_pdf)  # type: ignore[assignment]

heading_pattern = re.compile(r"^(#{1,6}\s.*)$", re.MULTILINE)
heading_matches = list(heading_pattern.finditer(raw_markdown))

# Summary
print(f"Total characters : {len(raw_markdown)}")
print(f"Total lines      : {raw_markdown.count(chr(10))}")
print(f"Headings found   : {len(heading_matches)}")

# Heading map
print("\nHeading map:")
for m in heading_matches[:12]:
    level = m.group(1).count("#", 0, m.group(1).find(" "))
    indent = "  " * (level - 1)
    print(f"  char {m.start():5d}  {indent}{m.group(1).strip()}")
if len(heading_matches) > 12:
    print(f"  ... ({len(heading_matches) - 12} more)")

# Dropped region
first_heading_pos = heading_matches[0].start() if heading_matches else len(raw_markdown)
dropped = raw_markdown[:first_heading_pos]
print(f"\nDropped (before first heading): {len(dropped)} chars")
if dropped.strip():
    print(dropped[:1500])
    if len(dropped) > 1500:
        print(f"... [{len(dropped) - 1500} more chars]")
else:
    print("(nothing dropped ‚Äî document starts with a heading)")

> üí¨ **Discuss:** Scroll through the heading map and dropped region above.
> 1. Does the dropped content contain anything a user might query?
> 2. Are there any headings that look like false positives (not real section titles)?
> 3. For your own organisation's documents, which layout quality challenge concerns you most?

---

### 3. Chunk Size: Why It Matters

Chunk size affects retrieval quality in two opposite directions:

| Chunk too **small** | Chunk too **large** |
|---|---|
| Loses surrounding context -> the chunk makes no sense in isolation | Mixes multiple topics -> the embedding averages over too many ideas |
| May not mention the product name or entity the data is about | LLM receives noisy context -> relevant sentence is buried |
| Many chunks -> slower index build, higher storage | Risk of exceeding the embedding model's token limit (more on this in the token-limit section below) |

The histogram below shows the character-length distribution for the current chunking strategy. **~4 characters ‚âà 1 token** for English technical text.

In [None]:
stats = analyze_chunks(chunks_header, "header_based")
print(stats)
print()
print("Character length distribution (1 bar ‚âà 1 bucket of lengths):")
print(char_histogram(chunks_header))

print("\nPrinting the first 3 chunks as examples:")
for chunk in chunks_header[:3]:
    print(f"--- {chunk.title or '(no title)'} ---")
    print(chunk.content[:300])
    print()

**Reading the histogram:** Each bucket spans a range of character lengths. A bar in the `[100‚Äì524)` bucket means that many chunks land in that range.

A good target for `all-MiniLM-L6-v2` (the local embedding model used here) is **under ~1000 characters, so under ~250 tokens**. Longer chunks risk truncation, we will demonstrate this experimentally in the token-limit section below.

> **Step 2 preview:** Embedding models have a maximum input length. For `all-MiniLM-L6-v2` this is 256 tokens. Step 2 covers embeddings in more detail. Chunk size is one lever; embedding model choice is another.

---

### 4. Chunking Strategies

We can test two other strategies. The cells below run all three on the same PDF and print a side-by-side comparison table.

**Fixed-Size** cuts every `chunk_size` characters with an `overlap` for boundary context. Guarantees an upper bound on chunk length but may split mid-sentence or mid-table row.

In [None]:
chunks_fixed = fixed_size_chunks(sample_pdf, chunk_size=800, overlap=100)
print(f"Fixed-size: {len(chunks_fixed)} chunks")
print(analyze_chunks(chunks_fixed, "fixed_size_800"))
print()
print(char_histogram(chunks_fixed))

print("\nPrinting the first 3 chunks as examples:")
for chunk in chunks_fixed[:3]:
    print(f"--- {chunk.title or '(no title)'} ---")
    print(chunk.content[:300])
    print()

**Paragraph-Aware** respects double-newline boundaries and merges consecutive paragraphs until a `target_chars` ceiling is reached. Never splits mid-sentence.

In [None]:
chunks_para = paragraph_aware_chunks(sample_pdf, target_chars=600)
print(f"Paragraph-aware: {len(chunks_para)} chunks")
print(analyze_chunks(chunks_para, "paragraph_600"))
print()
print(char_histogram(chunks_para))

print("\nPrinting the first 3 chunks as examples:")
for chunk in chunks_para[:3]:
    print(f"--- {chunk.title or '(no title)'} ---")
    print(chunk.content[:300])
    print()

In [None]:
results = compare_strategies(sample_pdf)
print_comparison_table(results)

> When would you use which chunking strategy?

---

### 5. Tables: PDFs and Spreadsheets

**Tables in PDFs** are rendered as Markdown pipe tables. The problem: a table chunk full of numbers may not mention key information like the product name, weakening the semantic match for queries like *"GWP of the Logypal 1"*. If the heading chunk above the table does contain the product name, the retriever may return the heading chunk but miss the actual data.

**Spreadsheets (`.xlsx`)** are chunked differently: `ExcelChunker` creates one chunk per sheet, serialising the sheet as a Markdown-like table. This preserves column structure but produces large chunks for wide sheets.

In [None]:
# PDF tables
table_chunks = [c for c in chunks_header if c.content.count("|") >= 8]
print(
    f"Chunks with tables (pipe heuristic): {len(table_chunks)} of {len(chunks_header)}"
)
print()
for tc in table_chunks[1:2]:
    product_mentioned = any(
        t in tc.content.lower() for t in ["logypal", "relicyc", "32-103"]
    )
    print(f"Title              : {tc.title!r}")
    print(f"Product name in chunk: {product_mentioned}")
    print("Content preview:")
    print(tc.content)
    print()

**The table content was silently dropped by the parser**

The chunk does contain a table, but look closely at the extracted content:

|Col1|Col2|Col3|Col4|Col5|Col6|Col7| 
|---|---|---|---|---|---|---| 
|||||||| |||||||| |||||||| |||||||| |||||||| |||||||| |||||||15|

`pymupdf4llm` detected the table structure but could not read the cell content. The result is a skeleton of empty rows with auto-generated column names (`Col1`...`Col7`) and a single stray value (`15`). The actual header labels and data are gone. The table caption (`_Table 3: Content declaration of pallet Logypal 1_`) survived only because it is regular paragraph text, not part of the table element.

**Why does this happen?**  
PDF tables have no universal standard. Some PDFs encode tables as proper table structures; others draw them as lines and position text independently. When `pymupdf4llm` cannot reliably map text to cells merged cells, or text positioned outside the detected grid, it falls back to empty placeholders rather than guessing wrong values.

In [None]:
# Excel / spreadsheets
xlsx_file = str(DATA_DIR / "product_overview.xlsx")
xl_chunks = ExcelChunker().make_chunks(xlsx_file)
print(f"Excel chunks: {len(xl_chunks)} (one per sheet)")
for c in xl_chunks:
    print(f"  Sheet '{c.title}': {len(c.content)} chars")
    print(f"  Preview: {c.content.strip()!r}")
    print()

> Inspect the table obtained form `product_overview.xlsx` with the ExcelChunker. Is it correct?

|  | Product Category | ID | Product Name | Supplier | EPD |
| --- | --- | --- | --- | --- | --- |
|  | Tape | 50-100 | Pressure-Sensitive Hot Melt Carton Sealing Tape | ipg | yes |
|  |  | 50-101 | Water-Activated Tape | ipg | yes |
|  |  | 50-102 | tesapack ECO & ULTRA STRONG ecoLogo | tesa | no |
|  | Pallets | 32-100 | No√© pallet | CPR System | yes |
|  |  | 32-101 | Wooden pallet | CPR System | no |
|  |  | 32-102 | Plastic pallet | CPR System | no |
|  |  | 32-103 | Logypal 1 | Relicyc | yes |
|  |  | 32-104 | LogyLight | Relicyc | no |
|  |  | 32-105 | Plastic pallet EP 08 ¬Æ | StabilPlastik | yes |
|  | Cardboard boxes | 11-100 | Cartonpallet CMP Roserio | redbox | yes |
|  |  | 11-101 | Corrugated cardboard packaging | Grupak | yes |

---

### 6. Images in PDFs

By default `PDFChunker` uses `write_images=False`, **images are silently dropped**. The text-only Markdown output contains no trace of any figure or diagram. For sustainability documents this can be significant: LCA system boundaries, process flow diagrams, and certifications labels are often only communicated as images.

With `write_images=True`, images are extracted to disk and a reference `![image_name](path)` is injected into the Markdown at the correct position. However: **the current embedding model cannot see images**, it embeds the reference string `"![Figure 3](figure_3.png)"`, which carries essentially no semantic content.

| Possible approach | What the retriever sees |
|---|---|
| `write_images=False` (default) | Nothing, images dropped | 
| `write_images=True` | Image filename as text reference | 
| Vision LLM caption -> embed caption | Rich text description of the image |
| Multimodal embedding model | Image vector + text vector fused |

In [None]:
# Count image references in the default (write_images=False) raw Markdown
img_refs = re.findall(r"!\[.*?\]\(.*?\)", raw_markdown)
print(f"Image references in raw Markdown (write_images=False): {len(img_refs)}")
if img_refs:
    print("References found:")
    for r in img_refs[:5]:
        print(f"  {r}")
else:
    print("No image references -> all figures are dropped by default.")

---

### 7. Custom Chunkers

Every chunker inherits from `Chunker`, a simple abstract base class with a single method:

```python
class Chunker(ABC):
    @abstractmethod
    def make_chunks(self, *args, **kwargs) -> list[Chunk]:
        ...
```

If none of the built-in chunkers fit your document type, e.g. supplier data in a CSV, you can implement your own in a few lines.

---

### 8. The Embedding Token Limit

`all-MiniLM-L6-v2` truncates its input at **256 tokens** (~1,024 characters). Text beyond this is silently dropped *before* the embedding is computed, the retriever never sees the tail of long chunks.

> **Quick fix: embedding models with higher token limit** For example OpenAIs `text-embedding-3-small` accepts up to 8,191 tokens, eliminating the truncation problem entirely for this corpus. The `OpenAIEmbeddings` class in the toolkit is a drop-in replacement. The tradeoff is an API key, a per-call cost, and a network dependency. For a workshop on a laptop, the local model is fine; for a production system handling documents with large sections it might be worth switching.
>
> Step 2 covers embedding models in more detail.

---
After this deep-dive into parsing and chunking, let's go to the next step in the RAG pipeline:

## Step 2: Embed Chunks and Build the Vector Store

`SentenceTransformerEmbeddings` converts every chunk's `content` to a 384-dimensional vector using `all-MiniLM-L6-v2`. The resulting matrix (shape `[n_chunks, 384]`) is inserted into a persistent `ChromaDBVectorStore`.

**On subsequent runs**, leave `reset=False` (the default) to skip re-embedding, it takes time and the store on disk is already correct. Pass `reset=True` only when the corpus or chunking strategy changes.

---

### Embedding Models
Two model families are implemented in the toolkit. The choice affects retrieval quality, cost, context limit, and data security.

| | `all-MiniLM-L6-v2` (default) | Other SentenceTransformer models | `text-embedding-3-small` (OpenAI) |
|---|---|---|---|
| **Dimensions** | 384 | 384‚Äì1024 (model-dependent) | 1024 (toolkit setting) |
| **Context limit** | 256 tokens | 256‚Äì8 192 (model-dependent) | 8 191 tokens |
| **Cost** | Free, local | Free, local | ~$0.02 / 1 M tokens |
| **Data security** | Fully local | Fully local | Sent to OpenAI |
| **Quality** | Good for short, focused text | Varies; some match OpenAI | State-of-the-art |
| **Setup** | No API key | No API key | `OPENAI_API_KEY` required |

**SentenceTransformer: free alternatives from HuggingFace**
Any model from HuggingFace that is compatible with the `sentence-transformers` library works with our `SentenceTransformerEmbeddings` by passing a different `model_name`. Browse quality rankings on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

Hardware matters: larger models are slower on CPU. A useful rule of thumb: models under ~150 MB run comfortably on CPU; larger models benefit from a GPU.

| Model | Size | CPU speed | Max input tokens (truncation limit) | Notes |
|---|---|---|---|---|
| `all-MiniLM-L6-v2` | 90 MB | Very fast | 256 | Default; good for short technical text |
| `sentence-transformers/all-mpnet-base-v2` | 420 MB | Moderate | 384 | Better English quality, same 512-token limit |
| `BAAI/bge-m3` | 2.3 GB | Very slow, GPU recommended | 8192 | Best multilingual quality; 8192-token limit |
| `thenlper/gte-large` | 1.3 GB | Slow, GPU recommended | 512 | Strong English quality |

> On Renku (CPU-only sessions), the top two rows are practical choices. 

**OpenAI embeddings**
`OpenAIEmbeddings` from `conversational_toolkit.embeddings.openai` calls the OpenAI API. The toolkit requests 1024 dimensions using OpenAI's Matryoshka dimension reduction, a technique that allows truncating full embeddings to a smaller size with minimal quality loss. Requires `OPENAI_API_KEY`.

```python
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model_name="text-embedding-3-small")
```

---

### Vector Store

A vector store persists chunk embeddings on disk and provides approximate nearest-neighbour search. Two implementations are in the toolkit.

**`ChromaDBVectorStore`** (used in this notebook)
- Embedded database: no separate server process, data stored as files on disk at `VS_PATH`.
- Survives session restarts, which is why `reset=False` is safe by default.
- Uses L2 distance for search. For unit-length vectors (which both embedding models produce) this gives the same ranking as cosine similarity -> different numbers, identical top-k order.
- Well-suited for corpora up to ~100 k chunks. Not designed for concurrent writes or multi-user access.

**`PGVectorStore`** (also in the toolkit)
- PostgreSQL with the `pgvector` extension. Requires a running Postgres instance.
- Uses cosine similarity natively (also supports other).
- Supports rich metadata filtering, concurrent reads and writes, and standard SQL queries alongside vector search.
- The right choice when you already have a Postgres infrastructure, need concurrent access, or want to combine vector search with relational data.

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Embedding model: {EMBEDDING_MODEL}")

# Set reset=True to rebuild the store from scratch
vector_store = await build_vector_store(
    chunks, embedding_model, db_path=VS_PATH, reset=False
)
print("Vector store ready.")

### Similarity in Embedding Space

Embeddings that are close in vector space share semantic meaning (for visualisation of embeddings in the vector space look at this [website](https://projector.tensorflow.org)). The cell below embeds several sentences and measures their cosine similarity: a value between -1 (opposite) and 1 (identical). You can change the sentences to see the impact on cosine similarity.

In [None]:
sentence1 = "carbon footprint of a pallet"
sentence2 = "GWP value for the Logypal 1"
sentence3 = "PFAS-free tape declaration"
sentence4 = "the annual report of a software firm"


async def cosine_similarity(a: str, b: str) -> float:
    vecs = await embedding_model.get_embeddings([a, b])
    return float(
        np.dot(vecs[0], vecs[1]) / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
    )


pairs = [
    (sentence1, sentence2),
    (sentence1, sentence3),
    (sentence1, sentence4),
]

print("Cosine similarities:")
for a, b in pairs:
    sim = await cosine_similarity(a, b)
    print(f"{sim:.3f}  -->  {a!r}  vs  {b!r}")

### Comparing Embedding Models on our Documents

A quick way to compare models without running a full evaluation: pick a query, a relevant chunk, and an irrelevant chunk, then measure the **cosine similarity gap** -> how much more similar the relevant chunk is to the query than the irrelevant one. A larger gap means the model discriminates better between useful and noise results, which translates directly to higher retrieval precision.

The cell below runs this for three CPU-friendly models (and OpenAI if the key is set). The HuggingFace models are downloaded on first use (~400 MB each, takes a minute).

In [None]:
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings


def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


# Test query and two chunks from the corpus
query = "What is the carbon footprint of the Logypal 1 pallet?"

chunk_relevant = "Logypal 1 ‚Äî GWP: 3.2 kg CO2e per functional unit (A1-A3). Figure verified by independent third-party auditor under ISO 14044."
chunk_irrelevant = "PrimePack AG Supplier Code of Conduct. All suppliers must comply with applicable environmental regulations and report annually on progress."

# Models to compare (CPU-friendly by default)
models_to_compare: dict = {
    "all-MiniLM-L6-v2 (90 MB, 384-dim)": SentenceTransformerEmbeddings(
        "all-MiniLM-L6-v2"
    ),
    "all-mpnet-base-v2 (420 MB, 768-dim)": SentenceTransformerEmbeddings(
        "sentence-transformers/all-mpnet-base-v2"
    ),
    "multilingual-MiniLM-L12 (470 MB, 384-dim)": SentenceTransformerEmbeddings(
        "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
    ),
}
# Add OpenAI if key is available
if os.getenv("OPENAI_API_KEY"):
    models_to_compare["text-embedding-3-small  (API, 1024-dim)"] = OpenAIEmbeddings(
        "text-embedding-3-small"
    )

# Run comparison
print("---------------------------")
print(f"Query     : {query!r}")
print(f"Relevant  : {chunk_relevant[:70]!r}...")
print(f"Irrelevant: {chunk_irrelevant[:70]!r}...")
print()
print(f"{'Model':<44}  {'sim(relevant)':>13}  {'sim(irrelevant)':>15}  {'gap':>6}")
print("-" * 84)

for name, model in models_to_compare.items():
    vecs = await model.get_embeddings([query, chunk_relevant, chunk_irrelevant])
    sim_rel = cosine_sim(vecs[0], vecs[1])
    sim_irr = cosine_sim(vecs[0], vecs[2])
    print(f"{name:<44}  {sim_rel:>13.3f}  {sim_irr:>15.3f}  {sim_rel - sim_irr:>6.3f}")

print("\nGap = sim(relevant) - sim(irrelevant). Larger gap -> better discrimination.")

**Disclaimer**: This is a quick example, not a rigorous evaluation. A single (query, chunk) pair can be misleading -> one model may score higher here and worse on a different example. Reliable model selection requires testing across many diverse queries and aggregating a metric like MRR or NDCG. Feature Track 2 shows how to do this systematically.

> Task: Probe the comparison with your own examples
> 1. Replace query, chunk_relevant, and chunk_irrelevant with other combinations 
> 2. Try a query where your phrasing differs (e.g. "carbon emissions" instead of "GWP"). Does any model bridge the gap better?
> 3. Try a query in German. Does paraphrase-multilingual-MiniLM-L12-v2 show a larger gap than the English-only default?

---

## Step 3: Inspect Retrieval (Before the LLM Sees Anything)

This is the **most important diagnostic step** in the whole pipeline:

> If the retrieved chunks are wrong, the final answer will be wrong regardless of how good the LLM is.

`inspect_retrieval()` runs the query through the embedding model, fetches the top-k most similar chunks from ChromaDB, and prints them with scores. Use this to verify that relevant documents are in the index, tune `top_k`, compare different query phrasings, and identify retrieval gaps before blaming the LLM.

The **similarity score** is the L2 distance, range [0,4], lower = more similar. L2 distance is used becuase it works for any vectors, normalised or not. Cosine similarity only makes sense for direction (magnitude doesn't matter), so it requires that vectors be unit-length to be meaningful. L2 makes no such assumption, making it the safer general default. ChromaDB defaults to L2 because it's simpler to compute and works even if vector magnitudes vary. When the embedding model produces equal-length vectors, we get cosine-equivalent ranking. The score numbers look different, but the top-5 results would be identical either way.

In [None]:
QUERY = "What materials is the Logypal 1 pallet made from?"

results = await inspect_retrieval(
    QUERY, vector_store, embedding_model, top_k=RETRIEVER_TOP_K
)

> Task: Inspect the retrieved chunks
> 1. Are all 5 chunks about the Logypal 1, or do other products appear? If so, what does that tell you about how the vector store handles similar-sounding queries?
> 2. Change the query to a synonym or a different phrasing (e.g. "composition of the Logypal 1" or "what is the Logypal 1 made of"). Do the retrieved chunks change? Do the scores shift?
> 3. Try reducing top_k to 2 or increasing it to 10 by passing a different retriever to inspect_retrieval. Is more context always better?

### Retrieval for a Product Outside the Portfolio

The PrimePack AG product catalog defines the portfolio boundary. The **Lara Pallet** is not in the catalog, it does not exist. Watch which chunks are returned and what scores they have. A **higher** minimum score (large L2 distance) signals *weaker semantic match*.

In [None]:
QUERY_OOK = "What materials is the Lara pallet made from?"

results_ook = await inspect_retrieval(
    QUERY_OOK, vector_store, embedding_model, top_k=RETRIEVER_TOP_K
)

> **Observation:** The retriever always returns the *closest* chunks it can find, it has no concept of "no match". For an unknown product the L2 distances are **higher** (the closest chunks are still about other pallets), but without a guard the LLM receives those chunks anyway and may silently answer about the wrong product.

---

## Step 4: Build the RAG Agent

`build_agent()` assembles the three components:

```
VectorStoreRetriever
    ‚îî‚îÄ ChromaDBVectorStore (on disk, persists across runs)
    ‚îî‚îÄ SentenceTransformerEmbeddings

RAG Agent
    ‚îú‚îÄ LLM (Ollama / OpenAI)
    ‚îú‚îÄ Retriever
    ‚îî‚îÄ System prompt
```

### The System Prompt

The system prompt is the is a key lever for controlling LLM behaviour. It is prepended to every conversation and defines the rules the model must follow:

```
You are a helpful AI assistant specialised in sustainability and product compliance
for PrimePack AG. 

You will receive document excerpts relevant to the user's question. Produce the best possible answer using only the information in those excerpts.
```

In [None]:
llm = build_llm(backend=BACKEND)

SYSTEM_PROMPT = (
    "You are a helpful AI assistant specialised in sustainability and product compliance for PrimePack AG.\n\n"
    "You will receive document excerpts relevant to the user's question. Produce the best possible answer using only the information in those excerpts."
)
agent = build_agent(
    vector_store=vector_store,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT,
    number_query_expansion=0,  # 0 = no expansion; see Feature Track 3 for more
)
print("RAG agent assembled.")

---

## Step 5: Ask a Question

`ask()` sends the query to the agent and returns the answer string. The internal flow is:

1. Embed the query
2. Retrieve top-k chunks
3. Build the prompt: `<system>` + `<sources>` XML block + user question
4. Generate the answer with the LLM
5. Return the answer and a list of cited source chunks

In [None]:
QUERY = "What materials is the Logypal 1 pallet made from?"

print("---------------------------")
print(f"Query: {QUERY!r}")
print("---------------------------")
answer = await ask(agent, QUERY)

---

## Probing Failure Modes

The dataset was designed with three deliberate challenges. Run the queries below and observe the answers.

### a) Out-of-Portfolio Query

The **Lara Pallet** does not exist. A good RAG must say so instead of describing a different pallet.

In [None]:
QUERY = "What materials is the Lara pallet made from?"
print("---------------------------")
print(f"Query: {QUERY!r}")
print("---------------------------")
answer_ook = await ask(agent, QUERY)

### b) Missing Data (LogyLight Pallet)

The LogyLight datasheet marks all LCA fields as *"not yet available"*. The correct answer is that we don't have the data, not a fabricated figure or saying that there is no infromation on it.

In [None]:
QUERY = "What is the GWP of the LogyLight pallet?"
print("---------------------------")
print(f"Query: {QUERY!r}")
print("---------------------------")
answer_gap = await ask(agent, QUERY)

### c) Conflicting Evidence (Relicyc GWP Figures)

The 2021 Relicyc datasheet reports **4.1 kg CO‚ÇÇe** per pallet. The 2023 EPD (third-party verified) reports a different, more recent figure. The RAG should flag the conflict and prefer the verified, more recent source.

In [None]:
QUERY = "What is the GWP of the Logypal 1 pallet, and how reliable is the figure?"
print("---------------------------")
print(f"Query: {QUERY!r}")
print("---------------------------")
answer_conflict = await ask(agent, QUERY)

### d) Unverified Supplier Claim (Tesa ECO Tape)

The tesa supplier brochure claims **68% CO‚ÇÇ reduction** compared to conventional tape. This is a self-declared marketing claim, there is no independent EPD. The RAG should report the claim but flag that it is unverified.

In [None]:
QUERY = (
    "How much lower is the carbon footprint of tesa ECO tape compared to standard tape?"
)
print("---------------------------")
print(f"Query: {QUERY!r}")
print("---------------------------")
answer_claim = await ask(agent, QUERY)

> üí¨ **Discuss with your peers:** Look back at the four queries you just ran:
> 1. Which failure modes did the RAG handle correctly, and which did it not?
> 2. Did your LLM backend matter? If you used a different backend (OpenAI vs Ollama), compare answers. Did one backend refuse to speculate more often? Did one add caveats the other didn't?
> 3. Which failure mode concerns you most in a real deployment?
> 4. What is the downstream consequence? Think concretely: CSRD reporting errors, a false marketing claim to a customer, a supplier selected on unverified data.
>
> Can you think of other ways the system might fail that aren't shown here?

---

## Multi-Turn Conversation

All previous queries were standalone: one question, one answer. Real usage looks different: a user asks a follow-up question that references the previous answer ("What about its end-of-life?", "Is that figure verified?").

The ask() function accepts a history argument, a list of prior LLMMessage objects, to support this. There is one subtlety: the retriever only sees the current query, not the conversation history. A follow-up like "What about its recycled content?" would embed the pronoun "its", which matches nothing in the corpus.

To prevent this, when history is provided the agent first rewrites the query into a self-contained form before retrieval, for example "What about its recycled content?" becomes "What is the recycled content of the Logypal 1 pallet?," and only then embeds and retrieves.

In [None]:
from conversational_toolkit.llms.base import LLMMessage, Roles

history: list[LLMMessage] = []


async def conversation_turn(query: str) -> str:
    global history
    answer = await agent.answer(QueryWithContext(query=query, history=history))
    history.append(LLMMessage(role=Roles.USER, content=query))
    history.append(LLMMessage(role=Roles.ASSISTANT, content=answer.content))
    return answer.content


QUERY1 = "Which pallets in our portfolio have a third-party verified EPD?"
QUERY2 = "What is the GWP figure reported in it for the Logypal 1?"

# Turn 1: ask about a specific product
reply1 = await conversation_turn(QUERY1)
print("---------------------------")
print(f"User: {QUERY1}")
print("---------------------------")
print(f"Assistant: {reply1}\n")

# Turn 2: follow-up using a pronoun ‚Äî the agent should resolve "it" before retrieval
reply2 = await conversation_turn(QUERY2)
print("---------------------------")
print(f"User: {QUERY2}")
print("---------------------------")
print(f"Assistant: {reply2}")

---

## Running the Full Pipeline in One Call

We have now gone through the pipeline step by step. For convenience, the `run_pipeline()` function executes all five steps end-to-end. It is also what the `__main__` entry point calls.

Use it for quick one-shot queries. Use the individual step functions above when you need
to inspect intermediate results or iterate on a specific stage.

In [None]:
from sme_kt_zh_collaboration_rag.feature0_baseline_rag import run_pipeline

answer = await run_pipeline(
    backend=BACKEND,
    query="What sustainability certifications do the pallets in the portfolio have?",
    reset_vs=False,
)
print(answer)

---

## Tasks & Discussion

Work through these in small groups. You don't need to do them all, pick what interests you or what matches your background.

---

### üí¨ Explore & Discuss

**A. Build a test set together**
Go through the data files (run the scratch cell below to see what's there) and write down questions where the answer *is* in the documents, but also questions about products that don't exist or data that isn't there, and questions where you suspect conflicting information. Example questions can be found in the EVALUATION_qa_ground_truth.md file. You can also create new files if you think this will help to test the RAG. You can then also run the test questions through the system to evaluate the current RAG. 

**B. Evaluate the outputs: would you trust them?**
Look back at the failure mode answers in Section 4. For each one: would you have trusted this answer without knowing it might be wrong? What would a *good* response look like? What would need to change in the system to prevent this failure?

**C. Think about your own context**
If you were to deploy this in your organisation, what documents would go into the knowledge base? What questions should it answer well? Who would use it, and what are the consequences of a wrong answer?

---

### üîß Code Experiments

**1. Layout quality audit on a different PDF**
Re-run the raw Markdown extraction and heading inspection from the ingestion section on a different PDF in the corpus. Are heading detection failures consistent across documents, or document-specific? How much content is dropped before the first heading?

**2. Try a different PDF parser**
Switch from the default `pymupdf4llm` engine to `markitdown` and compare the output for the same PDF:
```python
from conversational_toolkit.chunking import PDFChunker, MarkdownConverterEngine
chunks_md = PDFChunker().make_chunks(sample_pdf, engine=MarkdownConverterEngine.MARKITDOWN)
```
Are the headings and table structures preserved differently? Which parser gives cleaner chunks for this document? You can also implement a new PDF parser.

**3. Chunking strategy comparison**
Throughout this notebook we used `header_based_chunks`, even though we saw that many chunks exceed the 256-token limit of `all-MiniLM-L6-v2`. Try one of these fixes: switch to `paragraph_aware_chunks(target_chars=600)` to keep chunks within the limit, or keep `header_based_chunks` but switch to the OpenAI embedding model (`text-embedding-3-small`). Rebuild the vector store and re-run the failure mode queries. Do the retrieved chunks change? Do the answers improve?

**4. Images: what are we losing?**
The ingestion section showed that image references are zero by default. Enable extraction and inspect the results:
```python
img_chunks = PDFChunker().make_chunks(sample_pdf, write_images=True, image_path="tmp_images/")
```
Open the extracted files. Do they contain diagrams or figures a user might query? What would it take to make them searchable?

**5. Retrieval inspection**
Use `quick_retrieve()` in the scratch cell below with several different queries, both in-corpus and out-of-corpus. At roughly what L2 score do retrieved chunks stop being relevant? Could this threshold serve as an automatic ‚Äúno relevant data‚Äù flag?

**7. Top-k sensitivity**
Change `top_k` from 5 to 1 and re-run one of the failure mode queries from Section 4. Does the answer change? Try 10. Does more context improve the answer, or does irrelevant noise creep in?

**8. System prompt ablation**
Edit `SYSTEM_PROMPT` in the agent setup cell to address one of the failure modes: for example, add a rule about flagging conflicting figures or refusing to answer for out-of-portfolio products. Rebuild the agent and re-run the relevant query. Does the behaviour change?

**9. Query phrasing**
Run the same underlying question in different phrasings through `quick_retrieve()`: for example `"CO‚ÇÇ footprint Logypal 1"`, `"carbon emissions recycled pallet"`, and `"GWP A1-A3 EPD pallet"`. Do the retrieved chunks differ? Which phrasing scores highest? Try a query in German or French.


In [None]:
# Scratch cell, run your experiments here
async def quick_retrieve(query: str, top_k: int = 5):
    retriever = VectorStoreRetriever(embedding_model, vector_store, top_k=top_k)
    results = await retriever.retrieve(query)
    print(f"Query: {query!r}  (top_k={top_k})")
    for r in results:
        src = r.metadata.get("source_file", "?")
        print(f"  score={r.score:.4f}  {src}  {r.title!r}")


await quick_retrieve("PFAS-free tape declaration")

---

## Summary

| Step | Function |
|---|---|
| 1. Load & chunk | `load_chunks(max_files)` |
| 2. Embed & index | `build_vector_store(chunks, emb, reset)` |
| 3. Inspect retrieval | `inspect_retrieval(query, vs, emb, top_k)` |
| 4. Build agent | `build_agent(vs, emb, llm, top_k, system_prompt, number_query_expansion)` |
| 5. Generate answer | `ask(agent, query, history)` |