# Document Ingestion with Docling

This notebook explores document ingestion using the Docling library.

In [None]:
# Install dependencies if needed
# !pip install docling

In [6]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
import os

In [2]:
# Set the PDF path
pdf_path = "../docs/DocLayNet.pdf"
print(f"Processing: {pdf_path}")
print(f"File exists: {os.path.exists(pdf_path)}")

Processing: ../docs/DocLayNet.pdf
File exists: True


In [25]:
# Configure pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True  # Enable OCR, which may be needed for image extraction
pipeline_options.do_table_structure = True  # Enable table structure detection

# Create converter
converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(
        pipeline_options=pipeline_options,
        backend=PyPdfiumDocumentBackend
    )}
)
print("Converter created")

Converter created


In [28]:
# Try to import VLM model for image extraction
try:
    from docling.models.granite_vlm_model import GraniteDoclingModel
    vlm_model = GraniteDoclingModel()
    print("VLM model loaded")
except ImportError as e:
    print(f"VLM model not available: {e}")
    vlm_model = None

VLM model not available: No module named 'docling.models.granite_vlm_model'


In [26]:
# Convert the document
result = converter.convert(pdf_path)
print(f"Conversion successful: {result.status}")
document = result.document
print(f"Document type: {type(document)}")
print(f"Number of pages: {len(document.pages)}")

[32m[INFO] 2026-02-07 22:36:35,469 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-07 22:36:35,474 [RapidOCR] download_file.py:60: File exists and is valid: /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-07 22:36:35,474 [RapidOCR] main.py:53: Using /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-07 22:36:35,514 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-07 22:36:35,516 [RapidOCR] download_file.py:60: File exists and is valid: /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-07 22:36:35,517 [RapidOCR] main.py:53: Using /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-07 22:36:35,546 [RapidOCR] base.py:22: Using eng

Conversion successful: success
Document type: <class 'docling_core.types.doc.document.DoclingDocument'>
Number of pages: 9


In [9]:
# Extract text
full_text = document.export_to_markdown()
print("Full text (first 1000 characters):")
print(full_text[:1000])

Full text (first 1000 characters):
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com

Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com

Ahmed S. Nassar IBM Research

Rueschlikon, Switzerland ahn@zurich.ibm.com

Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com

Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com

Figure 1: Four examples of complex page layouts across different document categories

<!-- image -->

## KEYWORDS

PDF document conversion, layout segmentation, object-detection, data set, Machine Learning

## ACM Reference Format:

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A Large Human-Annotated Dataset for DocumentLayout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14â€“18, 2022, Washingto

In [10]:
# Explore document structure
print(f"Document body has {len(document.body.children)} children")
for i, child in enumerate(document.body.children[:5]):  # First 5
    print(f"Child {i+1}: {type(child)} - {str(child)[:100]}...")

Document body has 110 children
Child 1: <class 'docling_core.types.doc.document.RefItem'> - cref='#/texts/0'...
Child 2: <class 'docling_core.types.doc.document.RefItem'> - cref='#/texts/1'...
Child 3: <class 'docling_core.types.doc.document.RefItem'> - cref='#/texts/2'...
Child 4: <class 'docling_core.types.doc.document.RefItem'> - cref='#/texts/3'...
Child 5: <class 'docling_core.types.doc.document.RefItem'> - cref='#/texts/4'...


In [11]:
# Extract tables
tables = document.tables
print(f"Number of tables: {len(tables)}")
if tables:
    first_table = tables[0]
    print("First table:")
    print(first_table.export_to_markdown())

Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.


Number of tables: 5
First table:
|                |         | % of Total         | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   |
|----------------|---------|--------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
| class label    | Count   | Train Test Val     |                                             | All Fin Man Sci Law Pat Ten                 |                                             |                                    

In [27]:
# Extract images (if any)
images = document.pictures
print(f"Number of images: {len(images)}")
if images:
    print("Image info:")
    for i, img in enumerate(images[:3]):
        uri = img.image.uri if img.image else "No URI"
        caption = img.caption_text(document)
        print(f"Image {i+1}: {uri} - {caption}")
        # Try to get the image
        pil_img = img.get_image(document)
        if pil_img:
            # Convert to base64
            import io
            import base64
            buffer = io.BytesIO()
            pil_img.save(buffer, format='PNG')
            img_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
            print(f"  Base64: data:image/png;base64,{img_base64[:50]}...")  # Show first 50 chars
            # Display in notebook
            from IPython.display import display, HTML
            display(HTML(f'<img src="data:image/png;base64,{img_base64}" width="200"/>'))
        else:
            print("  No image data available - images may be vector graphics or text-based figures not extractable as bitmaps")
else:
    print("No images found in the document.")

Number of images: 6
Image info:
Image 1: No URI - Figure 1: Four examples of complex page layouts across different document categories
  No image data available - images may be vector graphics or text-based figures not extractable as bitmaps
Image 2: No URI - Figure 2: Distribution of DocLayNet pages across document categories.
  No image data available - images may be vector graphics or text-based figures not extractable as bitmaps
Image 3: No URI - Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
  No image data available - images may be vector graphics or text-based figures not extractable as bitmaps


In [29]:
# Try exporting to markdown with embedded images
try:
    from docling.datamodel.base_models import DocumentStream
    from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
    from docling.datamodel.pipeline_options import PdfPipelineOptions
    from docling.document_converter import DocumentConverter, InputFormat
    from docling.datamodel.document import ConversionResult
    from docling_core.transforms.export.markdown import MarkdownExportConfig

    # Create export config with embedded images
    export_config = MarkdownExportConfig()
    export_config.image_mode = "embedded"  # This should embed images as base64

    # Export to markdown
    markdown_output = document.export_to_markdown(export_config=export_config)
    print("Markdown export with embedded images:")
    print(markdown_output[:2000])  # Show first 2000 chars

except Exception as e:
    print(f"Error exporting to markdown: {e}")

Error exporting to markdown: No module named 'docling_core.transforms.export'


In [30]:
# Try with default converter (no custom pipeline options)
try:
    default_converter = DocumentConverter()
    default_result = default_converter.convert(pdf_path)
    default_document = default_result.document
    print("Default conversion successful")

    # Check images with default settings
    default_images = default_document.pictures
    print(f"Default conversion - Number of images: {len(default_images)}")
    if default_images:
        for i, img in enumerate(default_images[:3]):
            caption = img.caption_text(default_document)
            pil_img = img.get_image(default_document)
            if pil_img:
                print(f"Image {i+1} extracted successfully: {caption[:50]}...")
                # Convert to base64
                import io
                import base64
                buffer = io.BytesIO()
                pil_img.save(buffer, format='PNG')
                img_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
                print(f"  Base64 length: {len(img_base64)} chars")
                # Display
                from IPython.display import display, HTML
                display(HTML(f'<img src="data:image/png;base64,{img_base64}" width="200"/>'))
            else:
                print(f"Image {i+1} not extracted: {caption[:50]}...")

except Exception as e:
    print(f"Error with default converter: {e}")

[32m[INFO] 2026-02-07 22:39:18,013 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-07 22:39:18,018 [RapidOCR] download_file.py:60: File exists and is valid: /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-07 22:39:18,019 [RapidOCR] main.py:53: Using /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-07 22:39:18,057 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-07 22:39:18,059 [RapidOCR] download_file.py:60: File exists and is valid: /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-07 22:39:18,060 [RapidOCR] main.py:53: Using /home/zord/learn-rag/venv/lib/python3.10/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-07 22:39:18,086 [RapidOCR] base.py:22: Using eng

Default conversion successful
Default conversion - Number of images: 6
Image 1 not extracted: Figure 1: Four examples of complex page layouts ac...
Image 2 not extracted: Figure 2: Distribution of DocLayNet pages across d...
Image 3 not extracted: Figure 3: Corpus Conversion Service annotation use...


## Notes on Image Extraction

Docling successfully detected 6 images/figures in the PDF, including their captions and locations. However, the images are not extractable as bitmap images (PIL images) because they appear to be vector graphics or complex text-based figures rather than embedded raster images.

- **Why this happens**: Docling's image extraction works best for embedded bitmap images in PDFs. Figures that are composed of vector paths, text, or complex layouts (like the examples in this academic paper) cannot be directly extracted as images.
- **Why CLI might work**: The "simple CLI docling" you mentioned likely uses Vision-Language Model (VLM) processing, which can generate base64-encoded images from the PDF content. To enable this in code, you need to:
  1. Install additional dependencies: `pip install transformers torch`
  2. Use the GraniteDoclingModel: `from docling.models.granite_vlm_model import GraniteDoclingModel`
  3. Set `pipeline_options.do_vlm = True` and `pipeline_options.vlm_model = GraniteDoclingModel()`
- **Viewing the figures**: You can open the PDF file directly (`../docs/DocLayNet.pdf`) to view the figures.
- **Alternative extraction**: If you need to extract page images or render figures, consider using libraries like PyMuPDF (`fitz`) or pdfplumber, which can render entire pages as images.

The text and table extraction is working perfectly, and the document structure is well-parsed.