# Feature Track 2: Reliable & Structured Outputs

Building a RAG system that returns text is easy; building one that returns **predictable, machine-readable data** is what makes it production-ready.

In this notebook, we move beyond free-form chat. We want our RAG pipeline to extract specific attributes—like GWP values, certification dates, and supplier names—into a strict schema that can be used by other software or stored in a database without manual parsing.

### The Problem: The "Wall of Text"
In the previous stages, the LLM might provide the correct answer, but it is often buried in prose. For a compliance dashboard or an automated verification system, we do not want a paragraph; we want specific fields like product IDs and numeric values extracted accurately.

### Goals for this Notebook
1.  **Schema Definition:** Define our data requirements.
2.  **Constrained Generation:** Force the LLM to adhere to the schema using "Structured Outputs".
3.  **Data Validation:** Implement runtime checks to ensure extracted numbers and dates are within logical bounds.
4.  **Failure Handling:** Gracefully manage cases where the required information is missing from the retrieved context.

| Notebook | Focus |
|---|---|
| Feature 0 | Working baseline prototype |
| Feature 1 | Quantitative evaluation (RAGAS) |
| **Feature Track 2 (this notebook)** | **Reliable, structured outputs** |
| Feature Track 3 | Better retrieval strategies |
| Feature Track 4 | Multi-step agent workflows |

In [11]:
import os
import pathlib
import warnings


from conversational_toolkit.agents.base import QueryWithContext
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    EMBEDDING_MODEL,
    VS_PATH,
    SYSTEM_PROMPT,
    build_llm,
    build_agent,
)
from dotenv import load_dotenv

warnings.filterwarnings("ignore", category=DeprecationWarning)

load_dotenv(dotenv_path="../../.env.local")

_secret_path = pathlib.Path("/secrets/OPENAI_API_KEY")
if "OPENAI_API_KEY" not in os.environ and _secret_path.exists():
    os.environ["OPENAI_API_KEY"] = _secret_path.read_text().strip()

RETRIEVER_TOP_K = 5
BACKEND = "openai"  # "ollama" or "openai"

if not BACKEND:
    raise ValueError('Set BACKEND to "ollama" or "openai" before running.')

# RAG pipeline
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
vs = ChromaDBVectorStore(db_path=str(VS_PATH))
llm = build_llm(backend=BACKEND)
agent = build_agent(
    vector_store=vs,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT,
    number_query_expansion=0,
)

print(f"Embedding model : {EMBEDDING_MODEL}")
print(f"Vector store    : {VS_PATH}")
print(f"RAG agent LLM   : {BACKEND}")
print("RAGAS judge LLM : gpt-4o-mini (OpenAI)")
print("Setup complete.")

2026-02-25 10:22:29.775 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:__init__:57 - Sentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}
2026-02-25 10:22:29.791 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_llm:137 - LLM backend: OpenAI (gpt-4o-mini)
2026-02-25 10:22:29.810 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4o-mini; temperature: 0.3; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'text'}
2026-02-25 10:22:29.811 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_agent:336 - RAG agent ready (top_k=5  query_expansion=0)


Embedding model : sentence-transformers/all-MiniLM-L6-v2
Vector store    : /Users/tloiseau/Documents/SDSC Projects/sme-kt-zh-collaboration-rag/backend/data_vs.db
RAG agent LLM   : openai
RAGAS judge LLM : gpt-4o-mini (OpenAI)
Setup complete.


In [12]:
query = "Does PrimePack AG offer a product called the Lara Pallet?"

response = await agent.answer(QueryWithContext(query=query, history=[]))

print(f"Query: {query}")
print("-" * 20)
print(response.content)

2026-02-25 10:22:31.633 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Query: Does PrimePack AG offer a product called the Lara Pallet?
--------------------
Based on the provided excerpts, there is no mention of a product called the "Lara Pallet" being offered by PrimePack AG. The company currently offers three product categories: pallets, cardboard boxes, and tape, but the specific products within these categories are listed in a document called _product_overview.xlsx_, which is not provided here. 

Additionally, there is a clear statement that any product not listed in _product_overview.xlsx_ is not currently offered by PrimePack AG (source: 72b651a6-fdb7-40bc-8b3e-8cf45c2f53e1). Therefore, I cannot confirm the existence of the Lara Pallet in their portfolio.

If you need more detailed information about their specific products, I recommend checking the _product_overview.xlsx_ document directly.


In [13]:
response.sources

[ChunkMatch(title='## Portfolio Scope', content='## Portfolio Scope\n\n PrimePack AG currently offers three product categories: **pallets**, **cardboard boxes**, and **tape** . All active products and their suppliers are listed in _product_overview.xlsx_ .\n\n', mime_type='text/markdown', metadata={'title': '## Portfolio Scope', 'source_file': 'ART_product_catalog.pdf', 'chapters': '["# Product Portfolio Policy & Supplier Catalog", "## Portfolio Scope"]', 'source': 'ART_product_catalog.pdf', 'mime_type': 'text/markdown'}, id='7ca63a83-ec60-48af-ab25-307193aff6e9', embedding=[], score=0.673692524433136),
 ChunkMatch(title='### Products NOT in Our Portfolio', content='### Products NOT in Our Portfolio\n\n The following product types are **not** currently offered by PrimePack AG:\n\n●\u200b Any product from a supplier not listed in `product_overview.xlsx` ●\u200b Single-use bubble wrap or foam packaging ●\u200b Biodegradable tape products ●\u200b Compostable packaging of any kind\n\n ', m

In [7]:
# Redefining the System Prompt to explicitly request Source IDs
SYSTEM_PROMPT_WITH_SOURCES = (
    "You are a helpful AI assistant specialised in sustainability and product compliance for PrimePack AG.\n\n"
    "Each context chunk is prefixed with [ID: ...].\n"
    "Rules:\n"
    "- Use only the provided excerpts.\n"
    "- For every fact, include the ID in parentheses, e.g., (ID: 123).\n"
    "- List all unique IDs in a 'Sources' section at the end.\n"
    "- If unknown, say so clearly."
)

# Re-building the agent with the updated prompt
agent = build_agent(
    vector_store=vs,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT_WITH_SOURCES,
    number_query_expansion=0,
)

# Test query
query = "Which products have a third-party verified EPD?"
response = await agent.answer(QueryWithContext(query=query, history=[]))
print(response.content)

2026-02-25 10:04:46.499 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_agent:336 - RAG agent ready (top_k=5  query_expansion=0)
2026-02-25 10:04:46.577 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


The products that have a third-party verified Environmental Product Declaration (EPD) are as follows:

1. **IPG** - Products 50-100, 50-101 (Compliant, EPDs on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).
2. **CPR System** - Product 32-100 (Compliant, EPD on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).
3. **Relicyc** - Product 32-103 (Compliant, EPD on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).
4. **StabilPlastik** - Product 32-105 (Compliant, EPD on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).
5. **Redbox** - Product 11-100 (Compliant, EPD on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).
6. **Grupak** - Product 11-101 (Compliant, EPD on file) (ID: b26e7862-2871-446d-a14f-1ed48762b15b).

The other products mentioned either do not have a third-party verified EPD or are in progress towards obtaining one.

### Sources
- (ID: b26e7862-2871-446d-a14f-1ed48762b15b)
- (ID: 6b0e32bd-9379-47d4-a06c-2b1d55e17a67)
- (ID: 8b0ff64d-9fa9-4650-b116-170ab702c7ab)
- (ID: eedf4

In [15]:
import json

# Define the JSON Schema
compliance_schema = {
    "name": "ProductCompliance",
    "strict": True,
    "schema": {
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_name": {"type": "string"},
                        "has_verified_epd": {"type": "boolean"},
                        "co2_kg_co2e": {"type": ["number", "null"]},
                        "confidence_score": {"type": "number"},
                    },
                    "required": [
                        "product_name",
                        "has_verified_epd",
                        "co2_kg_co2e",
                        "confidence_score",
                    ],
                    "additionalProperties": False,
                },
            },
            "referenced_ids": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["products", "referenced_ids"],
        "additionalProperties": False,
    },
}

# Configure the LLM with the native response_format
structured_llm = build_llm(
    backend=BACKEND,
    response_format={"type": "json_schema", "json_schema": compliance_schema},
)


# Prepare context from retriever
query = "Which products have a third-party verified EPD and what are their CO2 values?"

agent = build_agent(
    vector_store=vs,
    embedding_model=embedding_model,
    llm=structured_llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT_WITH_SOURCES,
    number_query_expansion=0,
)

query = "Which products have a third-party verified EPD?"
response = await agent.answer(QueryWithContext(query=query, history=[]))
print(response.content)

2026-02-25 10:25:11.789 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:build_llm:137 - LLM backend: OpenAI (gpt-4o-mini)
2026-02-25 10:25:11.815 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4o-mini; temperature: 0.3; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'json_schema', 'json_schema': {'name': 'ProductCompliance', 'strict': True, 'schema': {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {'type': 'object', 'properties': {'product_name': {'type': 'string'}, 'has_verified_epd': {'type': 'boolean'}, 'co2_kg_co2e': {'type': ['number', 'null']}, 'confidence_score': {'type': 'number'}}, 'required': ['product_name', 'has_verified_epd', 'co2_kg_co2e', 'confidence_score'], 'additionalProperties': False}}, 'referenced_ids': {'type': 'array', 'items': {'type': 'string'}}}, 'required': ['products', 'referenced_ids'], 'additionalProperties': False}}}
2026-02-25 10:25:11.816 | INFO     | sme_

{"products":[{"product_name":"50-100","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"50-101","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"32-100","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"32-103","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"32-105","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"11-100","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1},{"product_name":"11-101","has_verified_epd":true,"co2_kg_co2e":null,"confidence_score":1}],"referenced_ids":["b26e7862-2871-446d-a14f-1ed48762b15b","6b0e32bd-9379-47d4-a06c-2b1d55e17a67","8b0ff64d-9fa9-4650-b116-170ab702c7ab"]}


## Filtering the sources

In [16]:
used_sources = json.loads(response.content)["referenced_ids"]

In [20]:
final_sources = [s for s in response.sources if s.id in used_sources]

[ChunkMatch(title='### 3.2 Environmental Product Declarations (EPD)', content='### 3.2 Environmental Product Declarations (EPD)\n\n By **31 December 2025**, all tier-1 suppliers must provide, for each product they supply:\n\nA.\u200b A valid third-party verified EPD conforming to ISO 14025 and the applicable Product\n\nCategory Rules (PCR), **or** B.\u200b A signed commitment letter with a credible EPD delivery schedule and an interim\n\nself-declared LCA study following ISO 14044.\n\nCompliance status (January 2025):\n\n|Supplier|Product(s)|Status| |---|---|---| |IPG|50-100, 50-101|Compliant, EPDs on file| |CPR System|32-100|Compliant, EPD on file| |CPR System|32-101, 32-102|Non-compliant, no EPD;<br>internal calculation only| |Relicyc|32-103|Compliant, EPD on file| |Relicyc|32-104 (LogyLight)|In progress, LCA<br>commissioned; EPD<br>expected Q2 2025| |StabilPlastik|32-105|Compliant, EPD on file| |Redbox|11-100|Compliant, EPD on file| |Grupak|11-101|Compliant, EPD on file| |Tesa SE|50