# Data Preparation (PDF ‚Üí JSON)

**Goal:**  
Convert all Swiss rental-law PDFs (OR, VMWG, StGB) into clean, structured JSON files ‚Äî  
where **each JSON = exactly one legal article**.

This makes later retrieval and referencing much easier and more accurate.

**Context:**  
- Splitting at *article-level granularity* instead of page chunks.
- Adding metadata (law name, article number, source).
- Keeping a clean and reproducible data pipeline.

## ‚öôÔ∏è Imports and Setup

In [1]:
import re, json
from pathlib import Path
import pymupdf
from tqdm import tqdm

# Paths
DATA_RAW = Path("../data/raw")     # PDFs go here
DATA_JSON = Path("../data/json")   # Will hold one JSON per article
DATA_JSON.mkdir(parents=True, exist_ok=True)


### Explanation
We‚Äôll use:
- **PyMuPDF (fitz)** to extract text page by page.  
- **Regex** to detect ‚ÄúArt. XXX‚Äù headers.  
- **tqdm** for nice progress bars.  

We‚Äôll store results as JSON so each file can be directly embedded later.


### üß© Helper Functions

In [2]:
# --- Cleaning & Splitting ---
ART_HEADER = re.compile(r"(?m)^\s*(Art\.\s*\d+[a-zA-Z]*\b[^\n]*)\s*$")

def clean_text(t: str) -> str:
    """Normalize whitespace and remove artifacts."""
    t = t.replace("\x0c", " ").replace("\u00ad", "")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\s+\n", "\n", t)
    t = re.sub(r"\n\s+", "\n", t)
    return t.strip()

def read_pdf_text(pdf_path: Path) -> str:
    """Extract all text from a PDF using PyMuPDF."""
    pages = []
    with pymupdf.open(pdf_path) as doc:
        for p in doc:
            pages.append(p.get_text("text"))
    return clean_text("\n".join(pages))

def split_articles(full_text: str):
    """Split a document into (header, body) per article."""
    headers = list(ART_HEADER.finditer(full_text))
    articles = []
    for i, m in enumerate(headers):
        start = m.start()
        end = headers[i+1].start() if i+1 < len(headers) else len(full_text)
        block = full_text[start:end].strip()
        header_line = m.group(1).strip()
        body = block[len(header_line):].strip()
        articles.append((header_line, body))
    return articles

def parse_article_number(header_line: str):
    m = re.search(r"Art\.\s*(\d+[a-zA-Z]*)", header_line)
    return m.group(1) if m else None


### Explanation
Each article in Swiss laws starts with `Art.` followed by a number or letter.  
This regex isolates those headers and splits the PDF into article blocks.  
We also extract the article number (e.g. `269d`, `325bis`) for metadata.


## üìÑ Process PDFs ‚Üí Save JSON

In [3]:
def detect_law_tag(stem: str) -> str:
    s = stem.upper()
    if "OR" in s: return "OR"
    if "VMWG" in s: return "VMWG"
    if "STGB" in s or "STG" in s: return "StGB"
    if "SRL" in s: return "SRL"
    return stem

def ingest_pdf(pdf_path: Path):
    law = detect_law_tag(pdf_path.stem)
    text = read_pdf_text(pdf_path)
    articles = split_articles(text)

    out_dir = DATA_JSON / law
    out_dir.mkdir(parents=True, exist_ok=True)

    for header, body in articles:
        art_nr = parse_article_number(header) or "NA"
        payload = {
            "law": law,
            "article": art_nr,
            "header": header,
            "text": body,
            "source": pdf_path.name
        }
        out_fp = out_dir / f"{law}_Art_{art_nr}.json"
        out_fp.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")

    return len(articles)


### ‚ñ∂Ô∏è Run Conversion

In [4]:
pdfs = sorted(DATA_RAW.glob("*.pdf"))
print("Found PDFs:", [p.name for p in pdfs])

total_articles = 0
for pdf in tqdm(pdfs, desc="Processing PDFs"):
    total_articles += ingest_pdf(pdf)

print(f"‚úÖ Done! Created ~{total_articles} article JSON files.")


Found PDFs: ['OR.pdf', 'SRL.pdf', 'STGB.pdf', 'VMWG.pdf']


Processing PDFs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 62.17it/s]

‚úÖ Done! Created ~119 article JSON files.





### Explanation
Each PDF is scanned and split into articles.
Every JSON file now represents **exactly one article** (e.g. `OR_Art_269d.json`).
We'll use these later to build embeddings with ChromaDB.


## üóíÔ∏è Process Lucerne forms

In [5]:
FORMS_DIR = DATA_JSON / "LU_FORMS"
FORMS_DIR.mkdir(parents=True, exist_ok=True)

FORMS = [
    {
        "id": "lu_mietvertragsaenderung",
        "name": "Formular Mietvertrags√§nderung",
        "description": (
            "Offizielles Luzerner Formular f√ºr Mietzinserh√∂hungen und andere einseitige Vertrags√§nderungen des Vermieters."
        ),
        "used_when": (
            "PERS: landlord | "
            "POS: mietzinserh√∂hung, mietzins erh√∂hen, erh√∂hung, mietzinsanpassung, energetische verbesserung, vertrags√§nderung, vermieter √§ndert vertrag | "
            "NEG: anfangsmietzins, neuabschluss, k√ºndigung"
        ),
        "download_url": "https://gerichte.lu.ch/-/media/Gerichte/Dokumente/rechtsgebiete/formulare/Mietrecht/Mietvertragsaenderung_pdf.pdf",
    },
    {
        "id": "lu_anfangsmietzins",
        "name": "Mitteilung des Anfangsmietzinses",
        "description": (
            "Formular nach Art. 269d OR zur Mitteilung des Anfangsmietzinses, wenn die Formularpflicht im Kanton Luzern gilt."
        ),
        "used_when": (
            "PERS: landlord | "
            "POS: anfangsmietzins, neuvermietung, neuer mietvertrag, erstvermietung, vermieter teilt mietzins mit | "
            "NEG: mietzinserh√∂hung, k√ºndigung"
        ),
        "download_url": "https://gerichte.lu.ch/-/media/Gerichte/Dokumente/rechtsgebiete/formulare/Mietrecht/Mitteilung_des_Anfangsmietzinses.pdf",
    },
    {
        "id": "lu_kuendigung",
        "name": "K√ºndigungsformular Wohn- und Gesch√§ftsr√§ume",
        "description": (
            "Offizielles K√ºndigungsformular f√ºr Wohn- und Gesch√§ftsr√§ume im Kanton Luzern."
        ),
        "used_when": (
            "PERS: both | "
            "POS: k√ºndigung, mietvertrag beenden, auszug, beendigung des vertrags | "
            "NEG: anfangsmietzins, mietzinserh√∂hung"
        ),
        "download_url": "https://gerichte.lu.ch/-/media/Gerichte/Dokumente/rechtsgebiete/formulare/Mietvertrag_Kuendigung_pdf.pdf",
    },
    {
        "id": "lu_schlichtungsgesuch",
        "name": "Schlichtungsgesuch Miete/Pacht",
        "description": (
            "Gesuch an die Schlichtungsbeh√∂rde Miete/Pacht Luzern zur Einleitung eines Verfahrens bei Streitigkeiten (z.B. Mietzinserh√∂hung, K√ºndigung)."
        ),
        "used_when": (
            "PERS: both | "
            "POS: schlichtung, streitigkeit, anfechtung, konflikt, "
            "mietzinserh√∂hung anfechten, k√ºndigung anfechten, mietstreit | "
            "NEG: rein formale k√ºndigung ohne streit"
        ),
        "download_url": "https://gerichte.lu.ch/-/media/Gerichte/Dokumente/rechtsgebiete/formulare/schlichtungsgesuch_sbm_pdf.pdf",
    },
    {
        "id": "lu_vollmacht_schlichtung",
        "name": "Vollmacht f√ºr Schlichtungsverhandlung",
        "description": (
            "Vollmacht gem√§ss Art. 204 Abs. 3 lit. d ZPO f√ºr Vertretung in der Schlichtungsverhandlung."
        ),
        "used_when": (
            "PERS: both | "
            "POS: vollmacht, vertretung, anwalt vertritt, dritte person vertritt, "
            "schlichtungsverhandlung | "
            "NEG: keine vertretung n√∂tig"
        ),
        "download_url": "https://gerichte.lu.ch/-/media/Gerichte/Dokumente/Organisation/Fromular_Gesuch_um_Entbindung_von_der_persnlichen_Erscheinungspflicht.docx",
    },
]

for f in FORMS:
    payload = {
        # keep the existing schema so the embedding script still works
        "law": "LU_FORMS",
        "article": f["id"],
        "header": f["name"],
        "text": f["description"] + "\n\nVerwendet wenn: " + "; ".join(f["used_when"]),
        "source": f["download_url"],

        # extra fields your app can use later
        "form_name": f["name"],
        "form_url": f["download_url"],
        "used_when": f["used_when"],
        "jurisdiction": "Kanton Luzern",
    }

    out_fp = FORMS_DIR / f"{f['id']}.json"
    out_fp.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")

print(f"‚úÖ Added {len(FORMS)} Lucerne form metadata files.")

‚úÖ Added 5 Lucerne form metadata files.


### Explanation

This step adds the Lucerne rental forms as small metadata JSON files. Each file represents **one form** with its name, purpose and download link. \
They are indexed like the law articles above so the model can show the correct form and a download button when needed.

## üëÄ Quick Inspection

In [6]:
samples = list((DATA_JSON / "OR").glob("*.json"))[:3]
for s in samples:
    print("File:", s.name)
    data = json.loads(s.read_text(encoding="utf-8"))
    print(f"Header: {data['header']}")
    print(f"Excerpt: {data['text'][:250]}...\n")


File: OR_Art_270b.json
Header: Art. 270b, Anfechtung von Mietzinserh√∂hungen und andern einseitigen Vertrags√§nderungen
Excerpt: 1 Der Mieter kann eine Mietzinserh√∂hung innert 30 Tagen, nachdem sie ihm mitgeteilt worden
ist, bei der Schlichtungsbeh√∂rde als missbr√§uchlich im Sinne der Artikel 269 und 269a anfechten.
2 Absatz 1 gilt auch, wenn der Vermieter sonstwie den Mietvert...

File: OR_Art_268b.json
Header: Art. 268b, Geltendmachung
Excerpt: 1 Will der Mieter wegziehen oder die in den gemieteten R√§umen befindlichen Sachen fortschaffen,
so kann der Vermieter mit Hilfe der zust√§ndigen Amtsstelle so viele Gegenst√§nde
zur√ºckhalten, als zur Deckung seiner Forderung notwendig sind.
2 Heimlich ...

File: OR_Art_266.json
Header: Art. 266, Beendigung des Mietverh√§ltnisses, Ablauf der vereinbarten Dauer
Excerpt: 1 Haben die Parteien eine bestimmte Dauer ausdr√ºcklich oder stillschweigend vereinbart, so
endet das Mietverh√§ltnis ohne K√ºndigung mit Ablauf dieser Dauer.
2 Setzen die Pa

### Explanation
We quickly check that:
- Articles are correctly separated.  
- Text doesn‚Äôt include the next article.  
- Metadata (law, article number) is stored correctly.


# ‚úÖ Summary
We now have a clean, article-level dataset ready for indexing.

**Next notebook: `2_Indexing_and_Retrieval.ipynb`**
We'll:
- Load all JSONs,
- Embed them with Sentence Transformers,
- Store them in a persistent ChromaDB collection for fast semantic search.

**Benefits of this structure**
- Easier to debug and explain
- Perfect granularity (one legal article per data point)
- Can easily add new laws or update existing ones
