<a href="https://colab.research.google.com/github/vagenas/llama_index/blob/add-docling-reader/docs/docs/examples/data_connectors/DoclingPDFReaderDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!-- TODO update link -->

# Docling PDF Reader

## Overview

[Docling](https://github.com/DS4SD/docling) extracts PDF content into a rich representation (incl. layout, tables etc.) which it can export to Markdown or JSON.

The `DoclingPDFReader` seamlessly integrates Docling into LlamaIndex, enabling you to:
- use PDF files in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.

## Example setup

In [None]:
%pip install -q llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [None]:
from warnings import filterwarnings
import os
from dotenv import load_dotenv

load_dotenv()
filterwarnings(action="ignore", category=UserWarning, module="pydantic")
filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Using Markdown export

To create a simple RAG pipeline, we:
- define a `DoclingPDFReader`, which by default exports to Markdown, and
- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`

In [None]:
from llama_index.readers.docling import DoclingPDFReader
from llama_index.core.node_parser import MarkdownNodeParser

reader = DoclingPDFReader()
node_parser = MarkdownNodeParser()

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

We now define the relevant parameters as well as a couple useful helper functions for this example:

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.readers.base import BaseReader
from llama_index.core.node_parser import NodeParser
from llama_index.core.base.response.schema import RESPONSE_TYPE
from tempfile import mkdtemp
from pathlib import Path


def get_env_var(key, default=None):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.environ.get(key, default)


HF_TOKEN = get_env_var("HF_TOKEN")  # optional env var, for higher quota
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")  # temp path for this example
FILE_PATH = "https://arxiv.org/pdf/2408.09869"  # Docling Technical Report
QUERY = "Which are the main AI models in Docling?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
GEN_MODEL = HuggingFaceInferenceAPI(
    token=HF_TOKEN,
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)


def ingest(reader: BaseReader, node_parser: NodeParser) -> VectorStoreIndex:
    vector_store = MilvusVectorStore(
        uri=MILVUS_URI,
        dim=len(EMBED_MODEL.get_text_embedding("hi")),
        overwrite=True,
    )
    docs = reader.load_data(file_path=FILE_PATH)
    nodes = node_parser.get_nodes_from_documents(documents=docs)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(
        nodes=nodes,
        embed_model=EMBED_MODEL,
        storage_context=storage_context,
        show_progress=True,
    )
    return index


def print_qa(query_res: RESPONSE_TYPE):
    print(f"Question:\n{QUERY}\n\nAnswer:\n{query_res.response.strip()}")
    for i, res in enumerate(query_res.source_nodes):
        print()
        print(f"Source {i+1}:")
        print(f"  text: {repr(res.text.strip())}")
        for key in res.metadata:
            print(f"  {key}: {res.metadata.get(key)}")

With all pieces in place, we are now ready to ingest into the vector store and ask questions on our document content:

In [None]:
index = ingest(reader=reader, node_parser=node_parser)
query_res = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print_qa(query_res=query_res)

Generating embeddings:   0%|          | 0/33 [00:00<?, ?it/s]

Question:
Which are the main AI models in Docling?

Answer:
The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Source 1:
  text: '3.2 AI models\n\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'
  dl_doc_hash: 556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa
  Header_2: 3.2 AI models



## Using native Docling format

To leverage Docling's rich native format, we:
- create a `DoclingPDFReader` with `export_type="json"`, and
- employ a `DoclingNodeParser` in order to appropriately parse that native Docling format.

In [None]:
from llama_index.node_parser.docling import DoclingNodeParser

reader = DoclingPDFReader(export_type="json")
node_parser = DoclingNodeParser()

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Let's repeat the pipeline with this setup. Notice how, besides the response itself, we now also get document-native grounding, incl. page number and bounding box information:

In [None]:
index = ingest(reader=reader, node_parser=node_parser)
query_res = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print_qa(query_res=query_res)

Generating embeddings:   0%|          | 0/83 [00:00<?, ?it/s]

Question:
Which are the main AI models in Docling?

Answer:
The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Source 1:
  text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'
  dl_doc_hash: 556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa
  path: #/main-text/36
  heading: 3.2 AI mod