# Private Swiss Rental Law — RAG Pipeline (ChromaDB + Ollama)
**Module**: DSPRO1 \
**Authors**: Kiana Kiser, Theodora Haimoff

This notebook builds a **retrieval-augmented generation (RAG)** pipeline around your PDF sources (e.g., OR Mietrecht, VMWG, StGB excerpt).
It converts PDFs → text → chunks → embeddings → **ChromaDB**. At query time, we retrieve the most relevant chunks and pass them to **Ollama** for context-aware answers.

## 1. Install dependencies

In [1]:
%pip install --upgrade pip
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 2. Imports & Configuration

In [2]:
import os, sys
os.environ["CHROMA_TELEMETRY_ENABLED"] = "false"
os.environ["POSTHOG_DISABLED"] = "true"

import re
from pathlib import Path
import json, subprocess

import fitz
import chromadb

from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter

PDF_DIR = Path("./Data")
CHROMA_DIR = Path("./Chroma")
CHROMA_COLLECTION = "swiss_private_rental_law"
_CHROMA_CLIENT = None
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
OLLAMA_MODEL = "llama3:8b"
TOP_K = 5

[0;93m2025-10-14 21:14:23.054161688 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card5/device/vendor"[m
  from tqdm.autonotebook import tqdm, trange


In [3]:
pdfs = sorted(PDF_DIR.rglob("*.pdf"))
print("Found PDFs:", len(pdfs))
for pdf in pdfs:
    try:
        print(f"Processing {pdf} size={pdf.stat().st_size} bytes")
    except FileNotFoundError:
        print(f"PDF {pdf} not found")

Found PDFs: 2
Processing Data/OR.pdf size=86218 bytes
Processing Data/STGB.pdf size=31708 bytes


## 3. Helper functions

In [4]:
# Text extraction
def extract_pages_text(pdf_path: Path):
    try:
        with fitz.open(pdf_path) as doc:
            for i, page in enumerate(doc):
                # get_text("text") is the default; "blocks" sometimes helps, but simpler is fine for law PDFs
                text = page.get_text()
                yield i + 1, text
    except Exception as e:
        raise RuntimeError(f"Failed to extract pages from {pdf_path}: {e}")

def clean_text(t: str) -> str:
    t = t.replace('\x0c', ' ').replace('\u00ad', '') # form feeds, soft hyphens
    t = re.sub(r'\s+', ' ', t) # collapse whitespace
    return t.strip()

# Chunking
def chunk_text(text: str, chunk_size=900, chunk_overlap=200):
    splitter = RecursiveCharacterTextSplitter(separators=[
        "\n\n", # paragraphs
        ".\n", "? ", "! ", # sentence ends
        "\n", # single newline
        " " # fallback
    ], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_text(text)

# Embeddings
_embedder = None
def get_embedder(name: str = None):
    global _embedder
    if _embedder is None:
        model_name = name or EMBED_MODEL_NAME
        _embedder = SentenceTransformer(model_name)
    return _embedder

def embed_texts(texts):
    model = get_embedder()
    return model.encode(texts, show_progress_bar=True, normalize_embeddings=True).tolist()

# Chroma setup
def get_chroma_client(persist_dir_str: str) -> chromadb.PersistentClient:
    """Return a single shared PersistentClient instance for this process."""
    global _CHROMA_CLIENT
    if _CHROMA_CLIENT is None:
        _CHROMA_CLIENT = chromadb.PersistentClient(path=persist_dir_str)
    return _CHROMA_CLIENT

def get_chroma_collection(persist_dir: Path, collection_name: str):
    """Create or get a persistent collection using the shared client."""
    client = get_chroma_client(str(persist_dir))
    try:
        return client.get_collection(collection_name)
    except Exception:
        return client.create_collection(collection_name)

def count_collection(col):
    try:
        return col.count()
    except Exception as e:
        print("Count failed:", e)
        return None

## 4. Ingest PDFs &rarr; Chroma DB

In [5]:
def ingest_pdf_folder(pdf_dir: Path, persist_dir: Path, collection_name: str, source_tag: str = None):
    pdf_paths = sorted([p for p in pdf_dir.glob("**/*.pdf")])
    if not pdf_paths:
        print(f"No pdf files found in {pdf_dir}. Check the directory and try again.")
        return None

    collection = get_chroma_collection(persist_dir=persist_dir, collection_name=collection_name)
    print(f"Collection: {collection_name} | existing docs: {count_collection(collection)}\n")

    embedder = get_embedder()

    ids_batch, docs_batch, metas_batch, embeds_batch = [], [], [], []
    batch_size = 64
    doc_counter = count_collection(collection) or 0

    for pdf in pdf_paths:
        print(f"Processing {pdf.name}...")
        for page_no, page_text in tqdm(extract_pages_text(pdf), total=None):
            page_text = clean_text(page_text)
            if not page_text:
                continue
            chunks = chunk_text(page_text)
            if not chunks:
                continue

            # metadata
            for ch in chunks:
                ids_batch.append(f"{pdf.stem}-p{page_no}-{doc_counter}")
                docs_batch.append(ch)
                metas_batch.append({
                    "source": pdf.name,
                    "path": str(pdf.resolve()),
                    "page": page_no,
                })
                doc_counter += 1

                if len(ids_batch) >= batch_size:
                    embeds_batch = embed_texts(docs_batch)
                    collection.add(ids=ids_batch, documents=docs_batch, metadatas=metas_batch, embeddings=embeds_batch)
                    ids_batch, docs_batch, metas_batch, embeds_batch = [], [], [], []

    if docs_batch:
        embeds_batch = embed_texts(docs_batch)
        collection.add(ids=ids_batch, documents=docs_batch, metadatas=metas_batch, embeddings=embeds_batch)

    print(f"Finished processing {len(pdf_paths)} PDFs. Collection now has {count_collection(collection)} chunks.")
    return collection

# Run ingestion
_ = ingest_pdf_folder(PDF_DIR, CHROMA_DIR, CHROMA_COLLECTION)
collection = get_chroma_collection(CHROMA_DIR, CHROMA_COLLECTION)
print("Chunks: ", collection.count())

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
  return torch._C._cuda_getDeviceCount() > 0


Collection: swiss_private_rental_law | existing docs: 0

Processing OR.pdf...


12it [00:00, 223.43it/s]


Processing STGB.pdf...


1it [00:00, 238.08it/s]


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


Finished processing 2 PDFs. Collection now has 59 chunks.
Chunks:  59


## 5. Retrieval helper

In [6]:
def retrieve(query: str, k: int = TOP_K, collection_name: str = CHROMA_COLLECTION):
    collection = get_chroma_collection(CHROMA_DIR, collection_name)
    q_emb = embed_texts([query])[0]
    res = collection.query(query_embeddings=[q_emb], n_results=k, include=['documents', 'metadatas', 'distances'])

    # Flatten first result list
    docs = res.get('documents', [[]])[0]
    metas = res.get('metadatas', [[]])[0]
    dists = res.get('distances', [[]])[0]
    return list(zip(docs, metas, dists))

# Quick test:
retrieve("What are the tenant protections against retaliatory termination?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


[('des Mietverhältnisses, Anspruch des Mieters 1 Der Mieter kann die Erstreckung eines befristeten oder unbefristeten Mietverhältnisses verlangen, wenn die Beendigung der Miete für ihn oder seine Familie eine Härte zur Folge hätte, die durch die Interessen des Vermieters nicht zu rechtfertigen wäre. 2 Bei der Interessenabwägung berücksichtigt die zuständige Behörde insbesondere: a. die Umstände des Vertragsabschlusses und den Inhalt des Vertrags; b. die Dauer des Mietverhältnisses; c. die persönlichen, familiären und wirtschaftlichen Verhältnisse der Parteien und deren Verhalten; d. einen allfälligen Eigenbedarf des Vermieters für sich, nahe Verwandte oder Verschwägerte sowie die Dringlichkeit dieses Bedarfs; e. die Verhältnisse auf dem örtlichen Markt für Wohn- und Geschäftsräume. 3 Verlangt der Mieter eine zweite Erstreckung, so berücksichtigt die zuständige Behörde auch, ob er zur',
  {'page': 9,
   'path': '/home/theodora/PycharmProjects/HSLU_HS25_DSPRO1/Data/OR.pdf',
   'source': 

## 6. Generate answer with Ollama (context-packed)

In [7]:
def pack_context(retrieved, max_chars=8000):
    """Join top documents into a single context string with source markers, bounded by max_chars."""
    ctx = []
    total = 0
    for i, (doc, meta, dist) in enumerate(retrieved, 1):
        stamp = f"[source: {meta.get('source', '')} p.{meta.get('page')}]\n"
        block = f"{stamp}\n{doc.strip()}\n\n"
        if total + len(block) > max_chars:
            break
        ctx.append(block)
        total += len(block)
    return "".join(ctx)

def answer_with_ollama(question: str, k: int = TOP_K, model: str = OLLAMA_MODEL, max_ctx_chars: int = 8000):
    hits = retrieve(question, k=k)
    context = pack_context(hits, max_chars=max_ctx_chars)
    prompt = f""""You are a Swiss rental-law assistant. Answer in plain, simple English. Use ONLY the provided context. Be concise and practical. Format STRICTLY like this:
            1) One-sentence answer.
            2) Short numbered list of options/steps.
            3) Forms required (exact names, if present in context).
            4) Articles to read next (e.g., Art. 269, Art. 270 OR; Art. 14 VMWG).
            Then 'References:' with [filename p.X] for each distinct source used.
            If context is insufficient, say so. Context: {context} Question: {question} Answer: """

    # Call ollama
    try:
        proc = subprocess.run(
            ["ollama", "run", model],
            input=prompt.encode("utf-8"),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            check=False
        )
        out = proc.stdout.decode("utf-8", errors="ignore")
        err = proc.stderr.decode("utf-8", errors="ignore")
        return out, err, hits
    except FileNotFoundError:
        return "[Ollama not found. Install from https://ollama.com and ensure it's on PATH]", "", hits

# Example:
out, err, hits = answer_with_ollama("How do I contest a rent increase? What form is required?", k=6)
print(out)
hits[:2]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]




[('Auszug aus dem Strafgesetzbuch (StgB) Art. 325bis Widerhandlungen gegen die Bestimmungen zum Schutz der Mieter von Wohn- und Geschäftsräumen. Wer den Mieter unter Androhung von Nachteilen, insbesondere der späteren Kündigung des Mietverhältnisses, davon abhält oder abzuhalten versucht, Mietzinse oder sonstige Forderungen des Vermieters anzufechten, wer dem Mieter kündigt, weil dieser die ihm nach dem Obligationenrecht zustehenden Rechte wahrnimmt oder wahrnehmen will, wer Mietzinse oder sonstige Forderungen nach einem gescheiterten Einigungsversuch oder nach einem richterlichen Entscheid in unzulässiger Weise durchsetzt oder durchzusetzen versucht, wird auf Antrag des Mieters mit Busse bestraft. Art. 326bis Werden die im Artikel 325bis unter Strafe gestellten Handlungen beim Besorgen der Angelegenheiten einer juristischen Person, Kollektiv- oder Kommanditgesellschaft oder Einzelfirma',
  {'page': 1,
   'path': '/home/theodora/PycharmProjects/HSLU_HS25_DSPRO1/Data/STGB.pdf',
   'sour

## 7. Maintenance & utilities

In [8]:
def wipe_collection(persist_dir: Path = CHROMA_DIR, collection_name: str = CHROMA_COLLECTION):
    client = get_chroma_client(persist_dir)
    try:
        client.delete_collection(collection_name)
        print(f"Deleted collection {collection_name}")
    except Exception as e:
        print("Failed to delete collection:", e)

def list_collections(persist_dir: Path = CHROMA_DIR):
    client = get_chroma_client(str(persist_dir))
    return client.list_collections()

In [9]:
list_collections()

[Collection(id=ef25eb2b-2f9d-4e6b-9363-e1f40750a733, name=swiss_private_rental_law)]

In [10]:
wipe_collection()

Deleted collection swiss_private_rental_law
