# Docling Pipelines: All use-cases + Combined flow

This notebook lets you run every script in `scripts/` with a brief explanation, and also run a one-pass combined pipeline (Picture Description + Formula Enrichment + Contextual Hybrid Chunking).

Inputs go in `source/`. Outputs are written to `output/`.

In [None]:
# Ensure dependencies (safe to re-run)
!pip install -q -r requirements.txt

In [None]:
from pathlib import Path

project_root = Path().resolve()
src_dir = project_root / 'source'
out_dir = project_root / 'output'
out_dir.mkdir(parents=True, exist_ok=True)

print('Project root:', project_root)
print('Source dir:', src_dir)
print('Output dir:', out_dir)
print('Source files:', list(src_dir.glob('*')))

## 1) general_convert.py
Converts PDFs or URLs into Markdown and JSON using Docling's basic conversion.
- Input: files in `source/` (e.g., PDFs) or URLs (edit the script if needed).
- Output: `.md` and `.json` under `output/`.

In [None]:
%run scripts/general_convert.py

## 2) vlm_image_understanding.py
Uses the VLM pipeline (SmolDocling) to better understand image-heavy PDFs (figures, charts).
- Input: image-heavy PDFs in `source/`.
- Output: `.md` in `output/` (with enhanced figure understanding).

In [None]:
%run scripts/vlm_image_understanding.py

## 3) maths_processing.py
Converts documents then extracts math snippets/equations heuristically.
- Input: math-containing files in `source/` (e.g., MD or PDF).
- Output: Markdown and extracted math artifacts in `output/`.

In [None]:
%run scripts/maths_processing.py

## 4) contextual_hybrid_chunking.py
Creates chunks using `HybridChunker` and computes a context-enriched string for each chunk.
- Input: `.md`, `.pdf`, `.docx`, `.html` in `source/`.
- Output: `.txt` and `.jsonl` with raw and enriched chunk text in `output/`.

In [None]:
%run scripts/contextual_hybrid_chunking.py

## 5) enrich_formula_understanding.py
Enables Formula Understanding (`PdfPipelineOptions.do_formula_enrichment = True`).
- Adds LaTeX extraction and improved formula rendering (MathML in HTML export).
- Input: PDFs in `source/`.
- Output: `.md` and `.html` in `output/`.

In [None]:
%run scripts/enrich_formula_understanding.py

## 6) enrich_picture_description.py
Enables Picture Description (`PdfPipelineOptions.do_picture_description = True`) to caption figures and images.
- Optionally select VLM presets in the script (e.g., SmolVLM or Granite).
- Input: PDFs in `source/`.
- Output: `.md` in `output/`.

In [None]:
%run scripts/enrich_picture_description.py

## 7) One-pass combination: picture description + formula enrichment + contextual chunking
This cell performs both enrichments in the converter, then passes the resulting document to `HybridChunker` and writes raw + contextualized chunks.

In [None]:
from datetime import datetime
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.chunking import HybridChunker

# Find a PDF in source/
pdf_path = next((p for p in src_dir.glob('*.pdf')), None)
assert pdf_path is not None, 'Put a PDF into source/'

# Enable both enrichments in the pipeline
pipe = PdfPipelineOptions()
pipe.do_picture_description = True
pipe.do_formula_enrichment = True

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipe)}
)

dl_doc = converter.convert(str(pdf_path)).document

# Chunk and contextualize
chunker = HybridChunker()
chunks = list(chunker.chunk(dl_doc=dl_doc))

ts = datetime.now().strftime('%Y%m%d_%H%M%S')
base = f'{pdf_path.stem}__combo__{ts}'
txt_path = out_dir / f'{base}.txt'
jsonl_path = out_dir / f'{base}.jsonl'

import json
with txt_path.open('w', encoding='utf-8') as f_txt, jsonl_path.open('w', encoding='utf-8') as f_jsonl:
    for i, ch in enumerate(chunks):
        raw = ch.text or ''
        enriched = chunker.contextualize(chunk=ch)
        # Write TXT
        f_txt.write(f'=== {i} ===\n')
        f_txt.write('-- raw --\n')
        f_txt.write(raw + '\n')
        f_txt.write('-- enriched --\n')
        f_txt.write(enriched + '\n\n')
        # Write JSONL
        f_jsonl.write(json.dumps({
            'index': i,
            'raw': raw,
            'enriched': enriched,
            'path': getattr(ch, 'path', None),
            'id': getattr(ch, 'id', None),
        }, ensure_ascii=False) + '\n')

print('Wrote:', txt_path)
print('Wrote:', jsonl_path)

## 8) Inspect outputs

In [None]:
for p in sorted(out_dir.glob('*')):
    print(p.name)