# Indexing & Retrieval (JSON → ChromaDB)

**Goal:**  
Turn our per-article JSON files into a *persistent* ChromaDB index with semantic search.

**What we do here:**
1) Load all article-level JSONs (from `data/json/LAW/LAW_Art_*.json`)  
2) Embed each article using `sentence-transformers`  
3) Store vectors + rich metadata in **ChromaDB (PersistentClient)**  
4) Test retrieval (KNN) and inspect the hits for correctness

**Why this matters:**  
A clean, persistent index lets us (a) query instantly, (b) cite exact articles, and (c) add new sources later without redoing everything.


⚙️ Imports & Paths

In [1]:
import os, json, hashlib
from pathlib import Path

import chromadb, logging
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

# Paths (keep consistent with Notebook 1)
DATA_JSON = Path("../data/json")
CHROMA_DIR = Path("../store")
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Collection name
CHROMA_COLLECTION = "swiss_private_rental_law"

# Embedding model (fast + solid)
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# Retrieval knobs
TOP_K  = 5     # final results returned
PRE_K  = 20    # prefetch for (optional) re-ranking

logging.getLogger("chromadb").setLevel(logging.ERROR)


[0;93m2025-10-26 16:13:37.109133205 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card5/device/vendor"[m
  from tqdm.autonotebook import tqdm, trange


## Design Choices

- **Per-article documents**: Each vector represents exactly one legal article (header + body).  
- **Metadata**: We store `law`, `article`, `source`, `path`. This enables citations like **[OR Art. 269d – OR.pdf]**.  
- **Persistent index**: We use `chromadb.PersistentClient` so the index survives kernel restarts.  
- **Normalised embeddings**: Improves cosine-similarity behavior.


🧱 Chroma helpers

In [2]:
# Disable analytics / telemetry
os.environ["CHROMA_TELEMETRY_ENABLED"] = "false"
os.environ["POSTHOG_DISABLED"] = "true"

def get_client():
    return chromadb.PersistentClient(path=str(CHROMA_DIR))

def get_collection(client=None, name=CHROMA_COLLECTION):
    client = client or get_client()
    try:
        return client.get_collection(name)
    except Exception:
        return client.create_collection(name)

def list_collections():
    client = get_client()
    return client.list_collections()

def wipe_collection(name=CHROMA_COLLECTION):
    client = get_client()
    try:
        client.delete_collection(name)
        print(f"Deleted collection: {name}")
    except Exception as e:
        print("Delete failed:", e)

🧠 Embedder init

In [3]:
_embedder = None

def embedder():
    global _embedder
    if _embedder is None:
        _embedder = SentenceTransformer(EMBED_MODEL_NAME)
    return _embedder


📥 Load JSON files

In [4]:
def load_article_jsons(root: Path = DATA_JSON):
    files = sorted(root.rglob("*.json"))  # recursively loads all JSONs under all subfolders
    items = []
    for fp in files:
        try:
            data = json.loads(fp.read_text(encoding="utf-8"))
            doc_text = f"{data.get('header','')}\n{data.get('text','')}".strip()
            if len(doc_text) < 50:
                continue
            items.append({
                "id": hashlib.md5(fp.as_posix().encode("utf-8")).hexdigest()[:16],
                "text": doc_text,
                "meta": {
                    "source": data.get("source"),
                    "law": data.get("law"),
                    "article": data.get("article"),
                    "path": fp.as_posix()
                }
            })
        except Exception as e:
            print("Skip", fp, "→", e)
    return items

articles = load_article_jsons()
print("Found", len(articles), "articles.")
if articles:
    print("Example:", articles[0]["meta"])

Found 118 articles.
Example: {'source': 'OR.pdf', 'law': 'OR', 'article': '253', 'path': '../data/json/OR/OR_Art_253.json'}


## Build / Update the Index

We’ll:
- batch-embed the articles,
- add them to a persistent collection,
- print counts to confirm.

> Re-running is safe: Chroma deduplicates by IDs (we use md5 of file path).


🏗️ Build/Update index

In [5]:
def build_index(items, batch_size=64):
    client = get_client()
    col = get_collection(client)
    print("Collection:", CHROMA_COLLECTION, "| existing docs:", col.count())

    ids, docs, metas = [], [], []
    model = embedder()

    for it in tqdm(items, desc="Indexing"):
        ids.append(it["id"])
        docs.append(it["text"])
        metas.append(it["meta"])

        if len(ids) >= batch_size:
            embs = model.encode(docs, show_progress_bar=False, normalize_embeddings=True).tolist()
            col.add(ids=ids, documents=docs, metadatas=metas, embeddings=embs)
            ids, docs, metas = [], [], []

    if ids:
        embs = model.encode(docs, show_progress_bar=False, normalize_embeddings=True).tolist()
        col.add(ids=ids, documents=docs, metadatas=metas, embeddings=embs)

    print("Done. Chunks in collection:", col.count())
    return col

collection = build_index(articles)


Collection: swiss_private_rental_law | existing docs: 118


Indexing: 100%|██████████| 118/118 [00:00<00:00, 266.18it/s]

Done. Chunks in collection: 118





## Retrieval Helpers

- `retrieve(query, k, k_pre)`: embeds the query, does ANN search in Chroma, optionally re-ranks.  
- `pack_context(...)`: formats retrieved docs for readability and later prompting.


🧰 Retrieve & (optional) Re-rank

In [6]:
def retrieve(query: str, k: int = TOP_K, k_pre: int = PRE_K, collection_name: str = CHROMA_COLLECTION):
    col = get_collection()
    q_emb = embedder().encode([query], normalize_embeddings=True).tolist()[0]
    res = col.query(query_embeddings=[q_emb], n_results=k_pre, include=['documents','metadatas','distances'])

    docs  = res.get('documents', [[]])[0]
    metas = res.get('metadatas', [[]])[0]
    dists = res.get('distances', [[]])[0]
    prelim = list(zip(docs, metas, dists))

    # Optional cross-encoder re-rank: uncomment if you installed transformers/torch
    try:
        from sentence_transformers import CrossEncoder
        rnk = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        scores = rnk.predict([(query, d) for d,_,_ in prelim]).tolist()
        prelim = [p for p,_ in sorted(zip(prelim, scores), key=lambda x: x[1], reverse=True)]
    except Exception:
        # Fallback: sort by distance asc (smaller = closer)
        prelim = sorted(prelim, key=lambda x: x[2])

    return prelim[:k]

def pack_context(retrieved, max_chars=8000, per_source_cap=3):
    ctx, total, seen = [], 0, {}
    for doc, meta, dist in retrieved:
        key = (meta.get("law"), meta.get("article"))
        seen[key] = seen.get(key, 0) + 1
        if seen[key] > per_source_cap:
            continue
        stamp = f"[{meta.get('law','?')} Art.{meta.get('article','?')} – {meta.get('source')}]"
        block = f"{stamp}\n{doc.strip()}\n\n"
        if total + len(block) > max_chars:
            break
        ctx.append(block)
        total += len(block)
    return "".join(ctx)


## Quick Tests

We try a few canonical questions to verify that:
- the right laws show up (OR / VMWG / StGB),
- the retrieved articles look relevant,
- metadata is present for citations.


In [7]:
queries = [
    "Wie fechte ich eine Mietzinserhöhung an? Welches Formular ist nötig?",
    "Welche Rechte habe ich bei Mängeln in der Wohnung?",
    "Ist eine Kündigung während eines laufenden Schlichtungsverfahrens zulässig?",
]

for q in queries:
    print("Q:", q)
    hits = retrieve(q, k=5)
    for i, (doc, meta, dist) in enumerate(hits, 1):
        print(f"  {i}. [{meta.get('law')} Art.{meta.get('article')}] {meta.get('source')}  dist={dist:.3f}")
    print()


Q: Wie fechte ich eine Mietzinserhöhung an? Welches Formular ist nötig?
  1. [OR Art.269d] OR.pdf  dist=0.895
  2. [VMWG Art.19] VMWG.pdf  dist=0.834
  3. [OR Art.270b] OR.pdf  dist=0.942
  4. [OR Art.270c] OR.pdf  dist=0.700
  5. [VMWG Art.17] VMWG.pdf  dist=1.082

Q: Welche Rechte habe ich bei Mängeln in der Wohnung?
  1. [OR Art.258] OR.pdf  dist=1.235
  2. [OR Art.259a] OR.pdf  dist=1.374
  3. [OR Art.267a] OR.pdf  dist=1.191
  4. [OR Art.259b] OR.pdf  dist=1.034
  5. [OR Art.259] OR.pdf  dist=0.930

Q: Ist eine Kündigung während eines laufenden Schlichtungsverfahrens zulässig?
  1. [OR Art.270e] OR.pdf  dist=1.212
  2. [OR Art.274g] OR.pdf  dist=1.238
  3. [OR Art.271a] OR.pdf  dist=1.074
  4. [OR Art.274e] OR.pdf  dist=0.916
  5. [VMWG Art.21] VMWG.pdf  dist=1.202



👀  Inspect one context block

In [8]:
sample_q = "Wie fechte ich eine Mietzinserhöhung an? Welches Formular ist nötig?"
hits = retrieve(sample_q, k=6)
ctx = pack_context(hits, max_chars=3000)
print(ctx[:1500])


[OR Art.269d – OR.pdf]
Art. 269d, Mietzinserhöhungen und andere einseitige Vertragsänderungen durch den Vermieter
1 Der Vermieter kann den Mietzins jederzeit auf den nächstmöglichen Kündigungstermin erhöhen.
Er muss dem Mieter die Mietzinserhöhung mindestens zehn Tage vor Beginn der Kündigungsfrist
auf einem vom Kanton genehmigten Formular mitteilen und begründen.
2 Die Mietzinserhöhung ist nichtig, wenn der Vermieter:
a. sie nicht mit dem vorgeschriebenen Formular mitteilt;
b. sie nicht begründet;
c. mit der Mitteilung die Kündigung androht oder ausspricht.
3 Die Absätze 1 und 2 gelten auch, wenn der Vermieter beabsichtigt, sonstwie den Mietvertrag
einseitig zu Lasten des Mieters zu ändern, namentlich seine bisherigen Leistungen zu vermindern
oder neue Nebenkosten einzuführen.

[VMWG Art.19 – VMWG.pdf]
Art. 19 Formular zur Mitteilung von Mietzinserhöhungen und anderen einseitigen
Vertragsänderungen
(Art. 269d OR)
1 Das Formular für die Mitteilung von Mietzinserhöhungen und anderen ein

# ✅ Summary

- We built a **persistent ChromaDB index** from per-article JSONs.  
- Retrieval returns focused legal articles with clean metadata for citations.  
- Optional cross-encoder rerank is wired (enable if installed).

**Next:** `3_Answering_and_Evaluation.ipynb`  
We will:
- assemble prompts,
- answer via **Ollama HTTP** or **OpenAI API**,
- enforce a strict output format (1-sentence answer, steps, forms, references),
- run a small evaluation set (sanity checks, error cases).
