# Indexing & Retrieval (JSON ‚Üí ChromaDB)

**Goal:**  
Turn our per-article JSON files into a *persistent* ChromaDB index with semantic search.

**What we do here:**
1) Load all article-level JSONs (from `data/json/LAW/LAW_Art_*.json`)  
2) Embed each article using `sentence-transformers`  
3) Store vectors + rich metadata in **ChromaDB (PersistentClient)**  
4) Test retrieval (KNN) and inspect the hits for correctness

**Why this matters:**  
A clean, persistent index lets us (a) query instantly, (b) cite exact articles, and (c) add new sources later without redoing everything.


## ‚öôÔ∏è Imports & Paths

### Imports

In [2]:
import os, time, json, hashlib
from pathlib import Path
from typing import List

import chromadb, logging
from chromadb.config import Settings
from openai import OpenAI
from tqdm import tqdm

### Path Helper Function

In [None]:
def get_base_dir() -> Path:
    """
    Returns the project base directory that works both:
    - in normal scripts (via __file__)
    - in notebooks (via current working directory)
    """
    try:
        return Path(__file__).resolve().parent
    except NameError:
        # __file__ not defined (e.g., in Jupyter or interactive)
        return Path(os.getcwd()).resolve()

### Paths & Variables

In [None]:
BASE_DIR = get_base_dir()
STORE_DIR = (BASE_DIR.parent / "store").resolve()
STORE_DIR.mkdir(parents=True, exist_ok=True)
DATA_JSON = (BASE_DIR.parent / "data/json").resolve()

MODEL = "text-embedding-3-small"
EXPECTED_DIM = 1536

# Version tag (use a fixed one when releasing)
VERSION = time.strftime("v%Y-%m-%d")
COLLECTION = f"swiss_private_rental_law_oai_{VERSION}"
VERSION_DIR = f"{VERSION}_{MODEL}_{EXPECTED_DIM}"
TARGET_DIR = STORE_DIR / VERSION_DIR

CHROMA_SETTINGS = Settings(anonymized_telemetry=False, allow_reset=True)

# Retrieval knobs
TOP_K  = 5     # final results returned
PRE_K  = 20    # prefetch for (optional) re-ranking

logging.getLogger("chromadb").setLevel(logging.ERROR)
os.environ["CHROMA_TELEMETRY_ENABLED"] = "false"
os.environ["POSTHOG_DISABLED"] = "true"

## Initialise OpenAI

In [3]:
try:
    import tomllib  # Python ‚â•3.11
except ModuleNotFoundError:
    import tomli as tomllib

def load_oai_token() -> str:
    """
    Loads the OpenAI API token from:
    1) streamlit.secrets (if available)
    2) .streamlit/secrets.toml (searched from cwd upwards)
    3) Environment variables (OAI_TOKEN / OPENAI_API_KEY)
    Works in both notebooks and Streamlit apps.
    """
    # --- 1) Try Streamlit secrets ---
    try:
        import streamlit as st
        token = dict(st.secrets).get("env", {}).get("OAI_TOKEN")
        if token:
            return token
    except Exception:
        pass

    # --- 2) Try loading secrets.toml manually ---
    # In Jupyter, we don‚Äôt have __file__, so we start from cwd.
    cwd = Path.cwd()
    candidates = [
        cwd / ".streamlit" / "secrets.toml",
        cwd.parent / ".streamlit" / "secrets.toml",
        cwd.parent.parent / ".streamlit" / "secrets.toml",
    ]

    for p in candidates:
        if p.exists():
            try:
                with p.open("rb") as f:
                    data = tomllib.load(f)
                token = data.get("env", {}).get("OAI_TOKEN")
                if token:
                    return token
            except Exception:
                pass

    # --- 3) Fallback to environment vars ---
    token = os.getenv("OAI_TOKEN") or os.getenv("OPENAI_API_KEY") or ""
    return token

‚úÖ OpenAI token loaded: sk-s‚Ä¶QVsA


In [None]:
def mask(t: str) -> str:
    return t[:4] + "‚Ä¶" + t[-4:] if t and len(t) > 12 else "(unset)"

OAI_TOKEN = load_oai_token()
if not OAI_TOKEN:
    raise EnvironmentError(
        "OpenAI key not found. Put it in `.streamlit/secrets.toml` under [env].OAI_TOKEN "
        "or set OAI_TOKEN/OPENAI_API_KEY in your environment."
    )

print("‚úÖ OpenAI token loaded:", mask(OAI_TOKEN))

OAI = OpenAI(api_key=OAI_TOKEN)
EMBED_MODEL_NAME = "text-embedding-3-small"

## üß± Chroma helpers

In [4]:
def get_client_for_target():
    TARGET_DIR.mkdir(parents=True, exist_ok=True)
    return chromadb.PersistentClient(path=str(TARGET_DIR), settings=CHROMA_SETTINGS)

def get_collection(client=None):
    client = client or get_client_for_target()
    return client.get_or_create_collection(COLLECTION)

## üß† Embedder init

In [5]:
def embed_batch(texts: List[str], *, model: str = EMBED_MODEL_NAME, retries: int = 5) -> List[List[float]]:
    """
    Embed a batch of texts with OpenAI, with basic retries on 429/5xx.
    Hard-fail on 401 (bad/missing key).
    """
    delay = 1.0
    for attempt in range(retries):
        try:
            resp = OAI.embeddings.create(model=model, input=texts)
            return [d.embedding for d in resp.data]
        except Exception as e:
            # Inspect common API errors
            msg = str(e)
            if "401" in msg or "AuthenticationError" in msg:
                raise  # bad/missing key ‚Äì don't retry
            if any(code in msg for code in ("429", "500", "502", "503", "504")) and attempt < retries - 1:
                time.sleep(delay)
                delay = min(delay * 2, 10)
                continue
            # Not retriable or out of retries
            raise

def embed_query(text: str) -> List[float]:
    return embed_batch([text])[0]


## üì• Load JSON files

In [6]:
def load_article_jsons(root: Path = DATA_JSON):
    files = sorted(root.rglob("*.json"))  # recursively loads all JSONs under all subfolders
    items = []
    for fp in files:
        try:
            data = json.loads(fp.read_text(encoding="utf-8"))
            doc_text = f"{data.get('header','')}\n{data.get('text','')}".strip()
            if len(doc_text) < 50:
                continue
            items.append({
                "id": hashlib.md5(fp.as_posix().encode("utf-8")).hexdigest()[:16],
                "text": doc_text,
                "meta": {
                    "source": data.get("source"),
                    "law": data.get("law"),
                    "title": data.get("header"),
                    "article": data.get("article"),
                    "path": fp.as_posix()
                }
            })
        except Exception as e:
            print("Skip", fp, "‚Üí", e)
    return items

articles = load_article_jsons()
print("Found", len(articles), "articles.")
if articles:
    print("Example:", articles[0]["meta"])

Found 118 articles.
Example: {'source': 'OR.pdf', 'law': 'OR', 'title': 'Art. 253, Begriff und Geltungsbereich, Begriff', 'article': '253', 'path': '/home/theodora/PycharmProjects/HSLU_HS25_DSPRO1/data/json/OR/OR_Art_253.json'}


## Build / Update the Index

We‚Äôll:
- batch-embed the articles,
- add them to a persistent collection,
- print counts to confirm.

> Re-running is safe: Chroma deduplicates by IDs (we use md5 of file path).


### üèóÔ∏è Build/Update index

In [7]:
"""
def wipe_collection(name="swiss_private_rental_law"):
    chromadb.PersistentClient(path=str(CHROMA_DIR)).delete_collection(name)

wipe_collection("swiss_private_rental_law")
"""
def build_index(items, batch_size=96, sleep_s=0.0):
    """
    - Batches texts, calls OpenAI embeddings
    - Upserts (ids, documents, metadatas, embeddings) into Chroma
    """
    client = get_client_for_target()
    col = get_collection(client)
    print("Collection:", COLLECTION, "| existing docs:", col.count())

    ids_buf, docs_buf, metas_buf = [], [], []

    for it in tqdm(items, desc="Indexing"):
        ids_buf.append(it["id"])
        docs_buf.append(it["text"])
        metas_buf.append(it["meta"])

        if len(ids_buf) >= batch_size:
            embs = embed_batch(docs_buf)
            col.upsert(ids=ids_buf, documents=docs_buf, metadatas=metas_buf, embeddings=embs)
            ids_buf, docs_buf, metas_buf = [], [], []
            if sleep_s:
                time.sleep(sleep_s)

    if ids_buf:
        embs = embed_batch(docs_buf)
        col.upsert(ids=ids_buf, documents=docs_buf, metadatas=metas_buf, embeddings=embs)

    print("Done. Chunks in collection:", col.count())
    return col

collection = build_index(articles)


Collection: swiss_private_rental_law_oai_v2025-11-07 | existing docs: 0


Indexing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 118/118 [00:02<00:00, 51.17it/s]


Done. Chunks in collection: 118


In [8]:
def assert_collection_dim(col, expected_dim=EXPECTED_DIM):
    try:
        peek = col.get(limit=1, include=['embeddings'])
        if peek.get('embeddings'):
            dim = len(peek['embeddings'][0])
            if dim != expected_dim:
                raise RuntimeError(
                    f"Chroma collection has dim={dim}, expected {expected_dim}. "
                    "Delete or point to the correct collection."
                )
    except Exception:
        pass  # empty collection is fine

col = get_collection()
assert_collection_dim(col)

In [9]:
def write_manifest():
    mf = {
        "index_version": VERSION,
        "model": MODEL,
        "dim": EXPECTED_DIM,
        "dir": VERSION_DIR,
        "collection": COLLECTION,
    }
    (STORE_DIR / "manifest.json").write_text(json.dumps(mf, indent=2), encoding="utf-8")
    print("Wrote manifest:", mf)

write_manifest()

Wrote manifest: {'index_version': 'v2025-11-07', 'model': 'text-embedding-3-small', 'dim': 1536, 'dir': 'v2025-11-07_text-embedding-3-small_1536', 'collection': 'swiss_private_rental_law_oai_v2025-11-07'}


## Retrieval Helpers

- `retrieve(query, k, k_pre)`: embeds the query, does ANN search in Chroma, optionally re-ranks.  
- `pack_context(...)`: formats retrieved docs for readability and later prompting.


### üß∞ Retrieve & (optional) Re-rank

In [10]:
def retrieve(query: str, k: int = TOP_K, k_pre: int = PRE_K, collection_name: str = COLLECTION):
    col = get_collection()
    q_emb = embed_query(query)
    res = col.query(
        query_embeddings=[q_emb],
        n_results=k_pre,
        include=['documents','metadatas','distances']
    )

    docs  = res.get('documents', [[]])[0]
    metas = res.get('metadatas', [[]])[0]
    dists = res.get('distances', [[]])[0]
    prelim = list(zip(docs, metas, dists))

    # distance ascending (smaller = closer)
    prelim = sorted(prelim, key=lambda x: x[2])
    return prelim[:k]

def pack_context(retrieved, max_chars=8000, per_source_cap=3):
    ctx, total, seen = [], 0, {}
    for doc, meta, dist in retrieved:
        key = (meta.get("law"), meta.get("article"))
        seen[key] = seen.get(key, 0) + 1
        if seen[key] > per_source_cap:
            continue
        stamp = f"[{meta.get('law','?')} {meta.get('title','?')} ‚Äì {meta.get('source')}]"
        block = f"{stamp}\n{doc.strip()}\n\n"
        if total + len(block) > max_chars:
            break
        ctx.append(block)
        total += len(block)
    return "".join(ctx)


## Quick Tests

We try a few canonical questions to verify that:
- the right laws show up (OR / VMWG / StGB),
- the retrieved articles look relevant,
- metadata is present for citations.


In [11]:
queries = [
    "Wie fechte ich eine Mietzinserh√∂hung an? Welches Formular ist n√∂tig?",
    "Welche Rechte habe ich bei M√§ngeln in der Wohnung?",
    "Ist eine K√ºndigung w√§hrend eines laufenden Schlichtungsverfahrens zul√§ssig?",
]

for q in queries:
    print("Q:", q)
    hits = retrieve(q, k=5)
    for i, (doc, meta, dist) in enumerate(hits, 1):
        print(f"  {i}. [{meta.get('law')} {meta.get('title')}] {meta.get('source')}  dist={dist:.3f}")
    print()


Q: Wie fechte ich eine Mietzinserh√∂hung an? Welches Formular ist n√∂tig?
  1. [OR Art. 269d, Mietzinserh√∂hungen und andere einseitige Vertrags√§nderungen durch den Vermieter] OR.pdf  dist=0.582
  2. [VMWG Art. 19 Formular zur Mitteilung von Mietzinserh√∂hungen und anderen einseitigen] VMWG.pdf  dist=0.664
  3. [VMWG Art. 20 Begr√ºndungspflicht des Vermieters] VMWG.pdf  dist=0.706
  4. [OR Art. 270a, W√§hrend der Mietdauer] OR.pdf  dist=0.726
  5. [OR Art. 270, Anfechtung des Mietzinses, Herabsetzungsbegehren, Anfangsmietzins] OR.pdf  dist=0.736

Q: Welche Rechte habe ich bei M√§ngeln in der Wohnung?
  1. [OR Art. 259a, Rechte des Mieters, Im allgemeinen] OR.pdf  dist=0.580
  2. [OR Art. 259, M√§ngel w√§hrend der Mietdauer, Pflicht des Mieters zu kleinen Reinigungen u.] OR.pdf  dist=0.751
  3. [OR Art. 267a, Pr√ºfung der Sache und Meldung an den Mieter] OR.pdf  dist=0.753
  4. [OR Art. 257g, Meldepflicht] OR.pdf  dist=0.753
  5. [OR Art. 259b, Beseitigung des Mangels, Grundsatz] OR.pd

### üëÄ  Inspect one context block

In [12]:
sample_q = "Wie fechte ich eine Mietzinserh√∂hung an? Welches Formular ist n√∂tig?"
hits = retrieve(sample_q, k=6)
ctx = pack_context(hits, max_chars=3000)
print(ctx[:1500])


[OR Art. 269d, Mietzinserh√∂hungen und andere einseitige Vertrags√§nderungen durch den Vermieter ‚Äì OR.pdf]
Art. 269d, Mietzinserh√∂hungen und andere einseitige Vertrags√§nderungen durch den Vermieter
1 Der Vermieter kann den Mietzins jederzeit auf den n√§chstm√∂glichen K√ºndigungstermin erh√∂hen.
Er muss dem Mieter die Mietzinserh√∂hung mindestens zehn Tage vor Beginn der K√ºndigungsfrist
auf einem vom Kanton genehmigten Formular mitteilen und begr√ºnden.
2 Die Mietzinserh√∂hung ist nichtig, wenn der Vermieter:
a. sie nicht mit dem vorgeschriebenen Formular mitteilt;
b. sie nicht begr√ºndet;
c. mit der Mitteilung die K√ºndigung androht oder ausspricht.
3 Die Abs√§tze 1 und 2 gelten auch, wenn der Vermieter beabsichtigt, sonstwie den Mietvertrag
einseitig zu Lasten des Mieters zu √§ndern, namentlich seine bisherigen Leistungen zu vermindern
oder neue Nebenkosten einzuf√ºhren.

[VMWG Art. 19 Formular zur Mitteilung von Mietzinserh√∂hungen und anderen einseitigen ‚Äì VMWG.pdf]
Art. 19 F

# ‚úÖ Summary

- We built a **persistent ChromaDB index** from per-article JSONs.  
- Retrieval returns focused legal articles with clean metadata for citations.  
- Optional cross-encoder rerank is wired (enable if installed).

**Next:** `3_Answering_and_Evaluation.ipynb`  
We will:
- assemble prompts,
- answer via **Ollama HTTP** or **OpenAI API**,
- enforce a strict output format (1-sentence answer, steps, forms, references),
- run a small evaluation set (sanity checks, error cases).
