
# Docling Semantic Extraction — Sample Notebook

This notebook shows a **practical Docling pipeline** to extract **semantic groups** from PDFs (headings, paragraphs, lists, tables, figures/captions, etc.), and to export to Markdown/HTML/JSON for downstream use.

> **Requirements**: Python 3.10+ recommended. Run the first cell to install packages in your own environment.


In [None]:

# If running on your own machine, uncomment and run this cell.
# It installs Docling and optional OCR backends.
# Note: Internet access is required for installation.
# !pip install -U pip
# !pip install docling
# Optional OCR choices (choose one or more):
# !pip install "docling[easyocr]"    # EasyOCR
# !pip install "docling[tesseract]"  # Tesseract (requires Tesseract installed in OS)
# !pip install "docling[paddleocr]"  # PaddleOCR


In [None]:

from pathlib import Path
import json
import os

# Docling imports
# Core converter
from docling.document_converter import DocumentConverter
# (Optional) PDF configuration helpers for custom pipelines
# See https://docling-project.github.io/docling/examples/custom_convert/
try:
    from docling.datamodel.document import DoclingDocument
except Exception as e:
    # In case import fails before installation
    print("Reminder: Install docling first. See the first cell.")

# Folder setup
INPUT_DIR = Path("input_docs")
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Input folder: {INPUT_DIR.resolve()}")
print(f"Output folder: {OUTPUT_DIR.resolve()}")



## 1) Quickstart — Convert a PDF

Supply a **local file path** or a **URL** to `DocumentConverter().convert(...)`.

The resulting `result.document` is a `DoclingDocument` you can export or traverse.


In [None]:

# Path to your PDF. Examples:
# source = 'https://arxiv.org/pdf/2408.09869'  # URL (requires internet when running locally)
# source = str(INPUT_DIR / 'sample.pdf')       # Local file
source = str(INPUT_DIR / 'sample.pdf')  # <- replace with your PDF path

converter = DocumentConverter()
result = converter.convert(source)
doc = result.document

print(type(doc))
print("Pages:", len(doc.pages))



## 2) Export to Markdown / HTML / JSON

Docling can export to common formats while preserving structure and reading order.


In [None]:

md_path = OUTPUT_DIR / 'document.md'
html_path = OUTPUT_DIR / 'document.html'
json_path = OUTPUT_DIR / 'document.json'

# Markdown
md_text = doc.export_to_markdown()
md_path.write_text(md_text, encoding='utf-8')

# HTML
html_text = doc.export_to_html()
html_path.write_text(html_text, encoding='utf-8')

# JSON (rich structured output)
doc_json = doc.as_dict()  # or doc.to_dict() depending on version
json_path.write_text(json.dumps(doc_json, ensure_ascii=False, indent=2), encoding='utf-8')

print("Wrote:", md_path, html_path, json_path, sep="\n- ")



## 3) Traverse Semantic Blocks

Each page contains **blocks** (e.g., `heading`, `paragraph`, `list`, `table`, `figure`, `caption`, `footnote`).  
Use these to build your own exporters or analytics.


In [None]:

# Simple traversal and preview of first N blocks per page
N = 8
for i, page in enumerate(doc.pages, start=1):
    print(f"\n=== Page {i} ===")
    for b in page.blocks[:N]:
        btype = getattr(b, "type", getattr(b, "category", "block"))
        text = getattr(b, "text", "")
        # Shorten for display
        text_snippet = (text[:140] + "…") if len(text) > 140 else text
        print(f"- [{btype}] {text_snippet}")



### 3a) Headings and Lists

You can filter blocks by type to extract headings for your ToC or list items for structured output.


In [None]:

headings = []
list_items = []

for page in doc.pages:
    for b in page.blocks:
        btype = getattr(b, "type", getattr(b, "category", ""))
        if btype == "heading":
            headings.append(getattr(b, "text", ""))
        elif btype == "list_item" or btype == "list":
            list_items.append(getattr(b, "text", ""))

print("Headings found:", len(headings))
print("First 10 headings:", headings[:10])
print("\nList items found:", len(list_items))
print("First 10 list items:", list_items[:10])



## 4) Tables — Preview and Export to CSV

Docling parses **tables** (including cells). Below is a lightweight exporter that writes each table to CSV.


In [None]:

import csv

table_count = 0
for page_idx, page in enumerate(doc.pages, start=1):
    for block_idx, b in enumerate(page.blocks):
        btype = getattr(b, "type", getattr(b, "category", ""))
        if btype == "table":
            table_count += 1
            csv_path = OUTPUT_DIR / f"table_p{page_idx}_{block_idx}.csv"
            # Docling tables usually have a rows/columns cell structure
            rows = []
            try:
                for row in b.cells:  # depending on version: b.rows or b.cells
                    rows.append([getattr(cell, "text", "") for cell in row])
            except Exception:
                # Fallback: if stored differently across versions
                if hasattr(b, "rows"):
                    for row in b.rows:
                        rows.append([getattr(cell, "text", "") for cell in row])
                else:
                    rows = []

            with open(csv_path, "w", newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                for r in rows:
                    writer.writerow(r)
            print("Wrote table:", csv_path)

print("Total tables exported:", table_count)



## 5) Figures & Captions

Extract figure images and their nearby captions for pairing.


In [None]:

# Save figure images (if embedded/thumbnails are available in your Docling version)
# and collect captions.
captions = []

IMG_DIR = OUTPUT_DIR / "figures"
IMG_DIR.mkdir(exist_ok=True)

fig_count = 0
for page_idx, page in enumerate(doc.pages, start=1):
    for block_idx, b in enumerate(page.blocks):
        btype = getattr(b, "type", getattr(b, "category", ""))
        if btype == "caption":
            captions.append(getattr(b, "text", ""))
        if btype == "figure":
            fig_count += 1
            # Depending on version, image bytes may be present; many times you use 'resources' from converter result.
            # Here we only record placeholders and rely on exported HTML/Markdown for images.
            print(f"Figure found on page {page_idx}, block {block_idx}")

print("Captions found:", len(captions))
print("First 5 captions:", captions[:5])



## 6) Handling Scanned PDFs (OCR)

To process scanned PDFs, enable an OCR backend in the converter’s PDF pipeline.  
Below shows a minimal pattern — see **Docling’s custom conversion docs** for full options.


In [None]:

# Example: toggling OCR in a custom conversion (pseudo-code; refer to docs for exact API in your version)
# from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
# from docling.pipeline.ocr import OcrOptions, OcrEngine

# ocr_opts = OcrOptions(engine=OcrEngine.PADDLE)  # or .EASYOCR / .TESSERACT
# pipeline = StandardPdfPipeline(ocr_options=ocr_opts)

# converter_ocr = DocumentConverter(pipeline=pipeline)
# doc_ocr = converter_ocr.convert(source).document
# print("OCR-enabled pages:", len(doc_ocr.pages))



## 7) Batch Conversion (Folder of PDFs)

A small helper that converts all PDFs in a folder and writes Markdown/HTML/JSON outputs.


In [None]:

def convert_one(source_path: Path, out_dir: Path):
    result = converter.convert(str(source_path))
    d = result.document

    md = d.export_to_markdown()
    html = d.export_to_html()
    js = d.as_dict()

    stem = source_path.stem
    (out_dir / f"{stem}.md").write_text(md, encoding="utf-8")
    (out_dir / f"{stem}.html").write_text(html, encoding="utf-8")
    (out_dir / f"{stem}.json").write_text(json.dumps(js, ensure_ascii=False, indent=2), encoding="utf-8")

    return d

# Convert all PDFs in INPUT_DIR
for pdf_path in sorted(INPUT_DIR.glob("*.pdf")):
    print("Converting:", pdf_path.name)
    _ = convert_one(pdf_path, OUTPUT_DIR)
print("Done.")



## 8) Minimal Semantic Chunking for RAG

Create chunks by grouping adjacent blocks until a token/character budget is reached, while keeping headings as boundaries.


In [None]:

def iter_blocks(d):
    for page in d.pages:
        for b in page.blocks:
            btype = getattr(b, "type", getattr(b, "category", ""))
            text = getattr(b, "text", "")
            yield btype, text

def semantic_chunks(d, max_chars=1200):
    chunks = []
    buf = []
    size = 0
    for btype, text in iter_blocks(d):
        if not text:
            continue
        # Start new chunk at headings
        if btype == "heading" and buf:
            chunks.append("\n".join(buf).strip())
            buf, size = [], 0
        if size + len(text) > max_chars and buf:
            chunks.append("\n".join(buf).strip())
            buf, size = [], 0
        buf.append(text.strip())
        size += len(text)
    if buf:
        chunks.append("\n".join(buf).strip())
    return chunks

chunks = semantic_chunks(doc, max_chars=1500)
print(f"Total chunks: {len(chunks)}")
print("\nSample chunk:\n", chunks[0][:500], "…")



---

### Notes
- API names can evolve slightly between Docling versions. If an attribute is missing (e.g., `cells` vs `rows` on tables), check the official docs and adapt accordingly.
- For advanced pipelines (OCR, page range, GPU, backends), see the **Custom Conversion** guide.

Happy parsing!
