<a href="https://colab.research.google.com/github/vagenas/llama_index/blob/add-docling-reader/docs/docs/examples/data_connectors/DoclingPDFReaderDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Docling PDF Reader

## Overview

[Docling](https://github.com/DS4SD/docling) extracts PDF content into a rich representation (incl. layout, tables etc.) which it can export to Markdown or JSON.

In the sections below, we show how to use `DoclingPDFReader` to bring PDF files into your RAG pipeline with ease and speed — while also leveraging Docling's rich format for advanced, document-native grounding.

## Setup

In [None]:
%pip install -q llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus python-dotenv rich

Note: you may need to restart the kernel to use updated packages.


In [None]:
import rich
from rich.pretty import pprint
from warnings import filterwarnings
import os
from tempfile import mkdtemp
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()
filterwarnings(action="ignore", category=UserWarning, module="pydantic")
filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
os.environ["TOKENIZERS_PARALLELISM"] = "false"


def get_env_var(key, default=None):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.environ.get(key, default)

In [None]:
HF_TOKEN = get_env_var("HF_TOKEN")
MILVUS_URI = get_env_var("MILVUS_URI", str(Path(mkdtemp()) / "docling.db"))
FILE_PATH = "https://arxiv.org/pdf/2408.09869"  # Docling Technical Report
QUERY = "Which are the main AI models in Docling?"

## Using Markdown export

By default, `DoclingPDFReader` exports to Markdown. Basic usage looks like this:

In [None]:
from llama_index.readers.docling import DoclingPDFReader

reader = DoclingPDFReader()
docs = reader.load_data(file_path=FILE_PATH)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Let's inspect a doc snippet:

In [None]:
md_snippet = docs[0].text[408:1200]
rich.print(f"...{md_snippet}...")

Next, we extract the chunks using a standard Markdown node parser:

In [None]:
from llama_index.core.node_parser import MarkdownNodeParser

node_parser = MarkdownNodeParser()
nodes = node_parser.get_nodes_from_documents(documents=docs)

Let's preview an example chunk:

In [None]:
pprint(nodes[7], max_length=2, max_string=250, max_depth=2)

We now put together a simple ingestion pipeline:

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import Settings, StorageContext, VectorStoreIndex

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
Settings.llm = HuggingFaceInferenceAPI(
    token=HF_TOKEN,
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)


def ingest(nodes):
    vector_store = MilvusVectorStore(
        uri=MILVUS_URI,
        dim=len(Settings.embed_model.get_text_embedding("hi")),
        overwrite=True,
    )
    index = VectorStoreIndex(
        nodes=nodes,
        storage_context=StorageContext.from_defaults(
            vector_store=vector_store
        ),
        show_progress=True,
    )
    return index

With all pieces in place, we are now ready to ask questions on our document content:

In [None]:
index = ingest(nodes)
query_res = index.as_query_engine().query(QUERY)
pprint(query_res, max_length=5, max_string=170, max_depth=4)

Generating embeddings:   0%|          | 0/33 [00:00<?, ?it/s]

## Using Docling's native format

To leverage Docling's rich native format, this time we set `export_type` to `"json"`.

In [None]:
reader = DoclingPDFReader(export_type="json")
docs = reader.load_data(file_path=FILE_PATH)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Previewing a snippet of the doc content, we see it is indeed a JSON string:

In [None]:
pprint(f"{docs[0].text[:50]}...")

In order to appropriately parse / chunk the above-generated Docling native format, we use a `DoclingNodeParser`:

In [None]:
from llama_index.node_parser.docling import DoclingNodeParser

node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(documents=docs)

Inspecting a node, we see it not only contains the text, but also various metadata:

In [None]:
pprint(nodes[5], max_length=5, max_string=200, max_depth=2)

Let's now repeat the pipeline with these nodes.

As shown below, besides the response itself, we are also getting document-native grounding, incl. page number and bounding box information:

In [None]:
index = ingest(nodes)
query_res = index.as_query_engine().query(QUERY)
pprint(query_res, max_length=5, max_string=170, max_depth=5)

Generating embeddings:   0%|          | 0/83 [00:00<?, ?it/s]