# Docling Pipelines: Enrichment + Contextual Chunking

This notebook shows how to:
- Run existing scripts in `scripts/` from a notebook.
- Combine picture description + formula enrichment in a single pass and then perform contextual hybrid chunking.
- Inspect outputs in `output/`.

Inputs go in `source/`. Outputs are written to `output/`.

## 0) Environment setup

In [1]:
# Install deps if needed (safe to re-run)
!pip install -q -r requirements.txt


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from pathlib import Path

project_root = Path().resolve()
src_dir = project_root / 'source'
out_dir = project_root / 'output'
out_dir.mkdir(parents=True, exist_ok=True)

print('Project root:', project_root)
print('Source dir:', src_dir)
print('Output dir:', out_dir)
print('Source files:', list(src_dir.glob('*')))

## 1) Run existing scripts from the notebook
Place files in `source/` before running.

In [None]:
# Contextual hybrid chunking (.md, .pdf, .docx, .html)
%run scripts/contextual_hybrid_chunking.py

In [None]:
# PDF formula enrichment (.md and .html outputs)
%run scripts/enrich_formula_understanding.py

In [None]:
# PDF picture description enrichment (.md output)
%run scripts/enrich_picture_description.py

## 2) One-pass combination: picture description + formula enrichment + contextual chunking
This cell shows how to perform both enrichments in the converter, then pass the resulting document to `HybridChunker` and write raw + contextualized chunks.

In [None]:
from datetime import datetime
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.chunking import HybridChunker

# Find a PDF in source/
pdf_path = next((p for p in src_dir.glob('*.pdf')), None)
assert pdf_path is not None, 'Put a PDF into source/'

# Enable both enrichments in the pipeline
pipe = PdfPipelineOptions()
pipe.do_picture_description = True
pipe.do_formula_enrichment = True

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipe)}
)

dl_doc = converter.convert(str(pdf_path)).document

# Chunk and contextualize
chunker = HybridChunker()
chunks = list(chunker.chunk(dl_doc=dl_doc))

ts = datetime.now().strftime('%Y%m%d_%H%M%S')
base = f'{pdf_path.stem}__combo__{ts}'
txt_path = out_dir / f'{base}.txt'
jsonl_path = out_dir / f'{base}.jsonl'

import json
with txt_path.open('w', encoding='utf-8') as f_txt, jsonl_path.open('w', encoding='utf-8') as f_jsonl:
    for i, ch in enumerate(chunks):
        raw = ch.text or ''
        enriched = chunker.contextualize(chunk=ch)
        # Write TXT
        f_txt.write(f'=== {i} ===\n')
        f_txt.write('-- raw --\n')
        f_txt.write(raw + '\n')
        f_txt.write('-- enriched --\n')
        f_txt.write(enriched + '\n\n')
        # Write JSONL
        f_jsonl.write(json.dumps({
            'index': i,
            'raw': raw,
            'enriched': enriched,
            'path': getattr(ch, 'path', None),
            'id': getattr(ch, 'id', None),
        }, ensure_ascii=False) + '\n')

print('Wrote:', txt_path)
print('Wrote:', jsonl_path)

## 3) Inspect outputs

In [None]:
# List output files
for p in sorted(out_dir.glob('*')):
    print(p.name)

## 4) Notes on modalities
- To chunk Markdown/Docx/HTML, place files in `source/` and re-run the contextual chunking cell.
- Enrichment options apply to PDFs via `PdfPipelineOptions` (as shown in the combined cell).