# Docling Pipelines: All use-cases + LLM Enrichment + Toggles

This notebook lets you:
- Run all scripts in `scripts/` (with short descriptions).
- Toggle Formula Understanding and Picture Description before conversion.
- Select LLM provider via env (gemini or gpt) and enrich chunks with structured outputs.

Inputs go in `source/`. Outputs are written to `output/`.

In [1]:
# Install dependencies (safe to re-run)
!pip install -q -r requirements.txt


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Load environment and configure provider
import os
from pathlib import Path
from dotenv import load_dotenv

# Ensure CUDA is visible
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Load .env if present. You can copy env.example to .env and fill keys.
if Path('.env').exists():
    load_dotenv('.env')
else:
    load_dotenv()  # load from environment only

# Provider: 'gemini' or 'gpt' (from env PROVIDER)
PROVIDER = os.getenv('PROVIDER', 'gemini').strip().lower()
OPENAI_MODEL = os.getenv('OPENAI_MODEL', 'gpt-4o-mini')
GEMINI_MODEL = os.getenv('GEMINI_MODEL', 'gemini-2.5-flash')

print('Provider:', PROVIDER)
print('OPENAI_MODEL:', OPENAI_MODEL)
print('GEMINI_MODEL:', GEMINI_MODEL)

Provider: gemini
OPENAI_MODEL: gpt-4o-mini
GEMINI_MODEL: gemini-2.5-flash


In [2]:
# Paths
project_root = Path().resolve()
src_dir = project_root / 'source'
out_dir = project_root / 'output'
out_dir.mkdir(parents=True, exist_ok=True)
print('Project root:', project_root)
print('Source dir:', src_dir)
print('Output dir:', out_dir)
print('Source files:', list(src_dir.glob('*')))

Project root: D:\GIT\docling-pipelines
Source dir: D:\GIT\docling-pipelines\source
Output dir: D:\GIT\docling-pipelines\output
Source files: [WindowsPath('D:/GIT/docling-pipelines/source/.gitkeep'), WindowsPath('D:/GIT/docling-pipelines/source/CHEMISTRY GRADE 9 - REVIEW 2023 2 (1).pdf')]


In [3]:
# Verify CUDA setup
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'CUDA version: {torch.version.cuda}')
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB')
    # Test tensor on GPU
    test_tensor = torch.ones(2, 2).cuda()
    print(f'Test tensor created on: {test_tensor.device}')
else:
    print('⚠️ CUDA not available - Docling will use CPU')

PyTorch version: 2.5.1+cu121
CUDA available: True
CUDA version: 12.1
GPU: NVIDIA GeForce RTX 4070 Laptop GPU
GPU Memory: 8.00 GB
Test tensor created on: cuda:0


## 1) Scripts quick-run (what each does)

- `general_convert.py`: basic PDF/URL conversion to Markdown/JSON.

In [None]:
%run scripts/general_convert.py

- `vlm_image_understanding.py`: VLM (SmolDocling) for image-heavy PDFs.

In [None]:
%run scripts/vlm_image_understanding.py

- `maths_processing.py`: converts and extracts math snippets heuristically.

In [None]:
%run scripts/maths_processing.py

- `contextual_hybrid_chunking.py`: HybridChunker raw + contextualized chunks.

In [None]:
%run scripts/contextual_hybrid_chunking.py

- `enrich_formula_understanding.py`: Formula Understanding enrichment (LaTeX/MathML).

In [None]:
%run scripts/enrich_formula_understanding.py

- `enrich_picture_description.py`: Picture Description enrichment (VLM captions).

In [5]:
%run scripts/enrich_picture_description.py

Processing file: Mathematics G9 (Main)- WEB 2023_3.pdf


KeyboardInterrupt: 

- `enrich_formula_table_picture.py`:  Formula + Picture + Table enrichment.

In [1]:
%run scripts/enrich_formula_table_picture.py

Processing file: Bharata Natyam Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Bharata Natyam Grade 9__formula__20250820_221429.md
Wrote: D:\GIT\Docling-SmolDoc\output\Bharata Natyam Grade 9__formula__20250820_221429.html
Moved Bharata Natyam Grade 9.pdf to archive
Finished processing: Bharata Natyam Grade 9.pdf
Processing file: BIOLOGY GRADE 9 - REVIEW 2023 3.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\BIOLOGY GRADE 9 - REVIEW 2023 3__formula__20250820_223232.md
Wrote: D:\GIT\Docling-SmolDoc\output\BIOLOGY GRADE 9 - REVIEW 2023 3__formula__20250820_223232.html
Moved BIOLOGY GRADE 9 - REVIEW 2023 3.pdf to archive
Finished processing: BIOLOGY GRADE 9 - REVIEW 2023 3.pdf
Processing file: Book Kathak Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Book Kathak Grade 9__formula__20250820_224941.md
Wrote: D:\GIT\Docling-SmolDoc\output\Book Kathak Grade 9__formula__20250820_224941.html
Moved Book Kathak Grade 9.pdf to archive
Finished processing: Book Kathak Grade 9.pdf
Processing file: Cambridge-International-AS-and-A-Level-Accounting-by-Ian-Harrison.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Cambridge-International-AS-and-A-Level-Accounting-by-Ian-Harrison__formula__20250820_234543.md
Wrote: D:\GIT\Docling-SmolDoc\output\Cambridge-International-AS-and-A-Level-Accounting-by-Ian-Harrison__formula__20250820_234543.html
Moved Cambridge-International-AS-and-A-Level-Accounting-by-Ian-Harrison.pdf to archive
Finished processing: Cambridge-International-AS-and-A-Level-Accounting-by-Ian-Harrison.pdf
Processing file: Design and Technology Grade 9 (2023) 4.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Design and Technology Grade 9 (2023) 4__formula__20250821_000353.md
Wrote: D:\GIT\Docling-SmolDoc\output\Design and Technology Grade 9 (2023) 4__formula__20250821_000353.html
Moved Design and Technology Grade 9 (2023) 4.pdf to archive
Finished processing: Design and Technology Grade 9 (2023) 4.pdf
Processing file: Drama _ Theatre 2021 - Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Drama _ Theatre 2021 - Grade 9__formula__20250821_000751.md
Wrote: D:\GIT\Docling-SmolDoc\output\Drama _ Theatre 2021 - Grade 9__formula__20250821_000751.html
Moved Drama _ Theatre 2021 - Grade 9.pdf to archive
Finished processing: Drama _ Theatre 2021 - Grade 9.pdf
Processing file: English Grade 9 - REVIEW 2021 (WEB).pdf




Wrote: D:\GIT\Docling-SmolDoc\output\English Grade 9 - REVIEW 2021 (WEB)__formula__20250821_003311.md
Wrote: D:\GIT\Docling-SmolDoc\output\English Grade 9 - REVIEW 2021 (WEB)__formula__20250821_003311.html
Moved English Grade 9 - REVIEW 2021 (WEB).pdf to archive
Finished processing: English Grade 9 - REVIEW 2021 (WEB).pdf
Processing file: Food _ Textiles Studies Grade 9 (2023) 6.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Food _ Textiles Studies Grade 9 (2023) 6__formula__20250821_005017.md
Wrote: D:\GIT\Docling-SmolDoc\output\Food _ Textiles Studies Grade 9 (2023) 6__formula__20250821_005017.html
Moved Food _ Textiles Studies Grade 9 (2023) 6.pdf to archive
Finished processing: Food _ Textiles Studies Grade 9 (2023) 6.pdf
Processing file: French Grade 9 (2021).pdf




Wrote: D:\GIT\Docling-SmolDoc\output\French Grade 9 (2021)__formula__20250821_010430.md
Wrote: D:\GIT\Docling-SmolDoc\output\French Grade 9 (2021)__formula__20250821_010430.html
Moved French Grade 9 (2021).pdf to archive
Finished processing: French Grade 9 (2021).pdf
Processing file: French Literature (2021).pdf




Wrote: D:\GIT\Docling-SmolDoc\output\French Literature (2021)__formula__20250821_010837.md
Wrote: D:\GIT\Docling-SmolDoc\output\French Literature (2021)__formula__20250821_010837.html
Moved French Literature (2021).pdf to archive
Finished processing: French Literature (2021).pdf
Processing file: ICT G9 Main_Reprint 2021.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\ICT G9 Main_Reprint 2021__formula__20250821_014954.md
Wrote: D:\GIT\Docling-SmolDoc\output\ICT G9 Main_Reprint 2021__formula__20250821_014954.html
Moved ICT G9 Main_Reprint 2021.pdf to archive
Finished processing: ICT G9 Main_Reprint 2021.pdf
Processing file: KreolMorisien_G9_Mainstream.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\KreolMorisien_G9_Mainstream__formula__20250821_020500.md
Wrote: D:\GIT\Docling-SmolDoc\output\KreolMorisien_G9_Mainstream__formula__20250821_020500.html
Moved KreolMorisien_G9_Mainstream.pdf to archive
Finished processing: KreolMorisien_G9_Mainstream.pdf
Processing file: Kuchipudi Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Kuchipudi Grade 9__formula__20250821_023122.md
Wrote: D:\GIT\Docling-SmolDoc\output\Kuchipudi Grade 9__formula__20250821_023122.html
Moved Kuchipudi Grade 9.pdf to archive
Finished processing: Kuchipudi Grade 9.pdf
Processing file: LIFESKILLS G9-Educator's Book-REPRINT_2021.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\LIFESKILLS G9-Educator's Book-REPRINT_2021__formula__20250821_023751.md
Wrote: D:\GIT\Docling-SmolDoc\output\LIFESKILLS G9-Educator's Book-REPRINT_2021__formula__20250821_023751.html
Moved LIFESKILLS G9-Educator's Book-REPRINT_2021.pdf to archive
Finished processing: LIFESKILLS G9-Educator's Book-REPRINT_2021.pdf
Processing file: Literature in English - GRADE 9 (2021).pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Literature in English - GRADE 9 (2021)__formula__20250821_024302.md
Wrote: D:\GIT\Docling-SmolDoc\output\Literature in English - GRADE 9 (2021)__formula__20250821_024302.html
Moved Literature in English - GRADE 9 (2021).pdf to archive
Finished processing: Literature in English - GRADE 9 (2021).pdf
Processing file: PE Main G9-REPRINT 2021.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\PE Main G9-REPRINT 2021__formula__20250821_033553.md
Wrote: D:\GIT\Docling-SmolDoc\output\PE Main G9-REPRINT 2021__formula__20250821_033553.html
Moved PE Main G9-REPRINT 2021.pdf to archive
Finished processing: PE Main G9-REPRINT 2021.pdf
Processing file: PHYSICS Grade 9 Mainstream WEB-5.11.24.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\PHYSICS Grade 9 Mainstream WEB-5.11.24__formula__20250821_035648.md
Wrote: D:\GIT\Docling-SmolDoc\output\PHYSICS Grade 9 Mainstream WEB-5.11.24__formula__20250821_035648.html
Moved PHYSICS Grade 9 Mainstream WEB-5.11.24.pdf to archive
Finished processing: PHYSICS Grade 9 Mainstream WEB-5.11.24.pdf
Processing file: Sexuality_G9_webversion[1].pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Sexuality_G9_webversion[1]__formula__20250821_040254.md
Wrote: D:\GIT\Docling-SmolDoc\output\Sexuality_G9_webversion[1]__formula__20250821_040254.html
Moved Sexuality_G9_webversion[1].pdf to archive
Finished processing: Sexuality_G9_webversion[1].pdf
Processing file: Sitar Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Sitar Grade 9__formula__20250821_042423.md
Wrote: D:\GIT\Docling-SmolDoc\output\Sitar Grade 9__formula__20250821_042423.html
Moved Sitar Grade 9.pdf to archive
Finished processing: Sitar Grade 9.pdf
Processing file: SMS Grade 9 Suggested answers post Workshop Feb 2020_HD.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\SMS Grade 9 Suggested answers post Workshop Feb 2020_HD__formula__20250821_042456.md
Wrote: D:\GIT\Docling-SmolDoc\output\SMS Grade 9 Suggested answers post Workshop Feb 2020_HD__formula__20250821_042456.html
Moved SMS Grade 9 Suggested answers post Workshop Feb 2020_HD.pdf to archive
Finished processing: SMS Grade 9 Suggested answers post Workshop Feb 2020_HD.pdf
Processing file: SMS Grade 9 Suggested answers post Workshop Feb 2020_HD_2.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\SMS Grade 9 Suggested answers post Workshop Feb 2020_HD_2__formula__20250821_042522.md
Wrote: D:\GIT\Docling-SmolDoc\output\SMS Grade 9 Suggested answers post Workshop Feb 2020_HD_2__formula__20250821_042522.html
Moved SMS Grade 9 Suggested answers post Workshop Feb 2020_HD_2.pdf to archive
Finished processing: SMS Grade 9 Suggested answers post Workshop Feb 2020_HD_2.pdf
Processing file: SMS_G9_rep2022.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\SMS_G9_rep2022__formula__20250821_045515.md
Wrote: D:\GIT\Docling-SmolDoc\output\SMS_G9_rep2022__formula__20250821_045515.html
Moved SMS_G9_rep2022.pdf to archive
Finished processing: SMS_G9_rep2022.pdf
Processing file: Tabla Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Tabla Grade 9__formula__20250821_054746.md
Wrote: D:\GIT\Docling-SmolDoc\output\Tabla Grade 9__formula__20250821_054746.html
Moved Tabla Grade 9.pdf to archive
Finished processing: Tabla Grade 9.pdf
Processing file: Vocal Hindustani Grade 9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\Vocal Hindustani Grade 9__formula__20250821_061149.md
Wrote: D:\GIT\Docling-SmolDoc\output\Vocal Hindustani Grade 9__formula__20250821_061149.html
Moved Vocal Hindustani Grade 9.pdf to archive
Finished processing: Vocal Hindustani Grade 9.pdf
Processing file: WEB REPRINT BEE G9.pdf




Wrote: D:\GIT\Docling-SmolDoc\output\WEB REPRINT BEE G9__formula__20250821_063351.md
Wrote: D:\GIT\Docling-SmolDoc\output\WEB REPRINT BEE G9__formula__20250821_063351.html
Moved WEB REPRINT BEE G9.pdf to archive
Finished processing: WEB REPRINT BEE G9.pdf


## 2) One-pass combination: picture description + formula enrichment + contextual chunkings + LLM enrichment options
- Toggle Formula/Picture in the cell below.
- Select provider via PROVIDER in .env (gemini/gpt).
- Produces TXT/JSONL of enriched chunks (structured).

In [6]:
# Toggles (set True/False or override via env if you prefer)
DO_FORMULA = True   # <- set False to disable Formula Understanding
DO_PICTURE = True   # <- set False to disable Picture Description
DO_LLM = False       # <- set False to disable LLM enrichment

print('DO_FORMULA:', DO_FORMULA, '| DO_PICTURE:', DO_PICTURE, '| DO_LLM:', DO_LLM)

# Check CUDA availability and set device
import torch
if torch.cuda.is_available():
    device = 'cuda'
    print(f'CUDA is available! Using GPU: {torch.cuda.get_device_name(0)}')
    print(f'CUDA version: {torch.version.cuda}')
    # Set CUDA device for PyTorch
    torch.cuda.set_device(0)
else:
    device = 'cpu'
    print('CUDA not available, using CPU')

from datetime import datetime
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.chunking import HybridChunker

# Configure Docling pipeline based on toggles
pdf_opts = PdfPipelineOptions()
pdf_opts.do_formula_enrichment = bool(DO_FORMULA)
pdf_opts.do_picture_description = bool(DO_PICTURE)
# Tip: you could choose a picture description preset here if desired
# from docling.datamodel.pipeline_options import smolvlm_picture_description
# pdf_opts.picture_description_options = smolvlm_picture_description

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts)}
)

# Pick a PDF
pdf_path = next((p for p in src_dir.glob('*.pdf')), None)
assert pdf_path is not None, 'Put a PDF into source/'

dl_doc = converter.convert(str(pdf_path)).document

# Chunk
chunker = HybridChunker()
chunks = list(chunker.chunk(dl_doc=dl_doc))

# LLM enrichment (conditional on DO_LLM)
if DO_LLM:
    # Create selected LLM client
    if PROVIDER == 'gpt':
        from clients.openai_client import OpenAIClient, EnrichedChunk
        llm = OpenAIClient(model=OPENAI_MODEL)
    elif PROVIDER == 'gemini':
        from clients.gemini_client import GeminiClient, EnrichedChunk
        llm = GeminiClient(model=GEMINI_MODEL)
    else:
        raise ValueError(f'Unsupported PROVIDER: {PROVIDER}')

# Write outputs
ts = datetime.now().strftime('%Y%m%d_%H%M%S')
base = f'{pdf_path.stem}__combo_llm__{PROVIDER}__{ts}' if DO_LLM else f'{pdf_path.stem}__combo_basic__{ts}'
txt_path = out_dir / f'{base}.txt'
jsonl_path = out_dir / f'{base}.jsonl'

import json
with txt_path.open('w', encoding='utf-8') as f_txt, jsonl_path.open('w', encoding='utf-8') as f_jsonl:
    for i, ch in enumerate(chunks):
        raw = ch.text or ''
        structural = chunker.contextualize(chunk=ch)
        
        if DO_LLM:
            enriched = llm.enrich_chunk(raw, context=structural)  # Pydantic validated
            # TXT with LLM enrichment
            f_txt.write(f'=== {i} ===\n')
            f_txt.write('-- title --\n' + (enriched.title or '') + '\n')
            f_txt.write('-- summary --\n' + enriched.summary + '\n')
            f_txt.write('-- key_points --\n' + '\n'.join('- ' + kp for kp in enriched.key_points) + '\n')
            f_txt.write('-- enriched_text --\n' + enriched.enriched_text + '\n\n')
            # JSONL with LLM enrichment
            f_jsonl.write(json.dumps({
                'index': i,
                'title': enriched.title,
                'summary': enriched.summary,
                'key_points': enriched.key_points,
                'enriched_text': enriched.enriched_text,
                'path': getattr(ch, 'path', None),
                'id': getattr(ch, 'id', None),
            }, ensure_ascii=False) + '\n')
        else:
            # Basic output without LLM enrichment
            f_txt.write(f'=== {i} ===\n')
            f_txt.write('-- raw_text --\n' + raw + '\n')
            f_txt.write('-- structural_context --\n' + structural + '\n\n')
            # JSONL without LLM enrichment
            f_jsonl.write(json.dumps({
                'index': i,
                'raw_text': raw,
                'structural_context': structural,
                'path': getattr(ch, 'path', None),
                'id': getattr(ch, 'id', None),
            }, ensure_ascii=False) + '\n')

print('Wrote:', txt_path)
print('Wrote:', jsonl_path)

DO_FORMULA: True | DO_PICTURE: True | DO_LLM: False
CUDA is available! Using GPU: NVIDIA GeForce RTX 4050 Laptop GPU
CUDA version: 12.4


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not in

KeyboardInterrupt: 

## 3) Inspect outputs

In [None]:
for p in sorted(out_dir.glob('*')):
    print(p.name)