<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Docling Reader

## Overview

[Docling](https://github.com/DS4SD/docling) extracts PDF documents into a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.

The `DoclingReader` seamlessly integrates Docling into LlamaIndex, enabling you to:
- use PDF documents in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.

## Notebook setup

> ðŸ‘‰ For best conversion speed, use GPU acceleration whenever available (e.g. if running on Colab, use a GPU-enabled runtime).

In [None]:
%pip install -q llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
import os
from dotenv import load_dotenv

load_dotenv()
source = "https://arxiv.org/pdf/2408.09869"  # Docling Technical Report
query = "Which are the main AI models in Docling?"
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
gen_model = HuggingFaceInferenceAPI(
    token=os.getenv("HF_TOKEN"),
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)

## Using Markdown export

To create a simple RAG pipeline, we can:
- define a `DoclingPDFReader`, which by default exports to Markdown, and
- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.docling import DoclingReader

reader = DoclingReader()
node_parser = MarkdownNodeParser()

index = VectorStoreIndex.from_documents(
    documents=reader.load_data(source),
    transformations=[node_parser],
    embed_model=embed_model,
)
result = index.as_query_engine(llm=gen_model).query(query)
print(f"Q: {query}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Q: Which are the main AI models in Docling?
A: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Sources:


[('3.2 AI models\n\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',
  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',
   'Header_2': '3.2 AI models'}),
 ("5 Applications\n\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, p

## Using Docling format

To leverage Docling's rich native format, we:
- create a `DoclingPDFReader` with JSON export type, and
- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

In [None]:
from llama_index.node_parser.docling import DoclingNodeParser

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()

index = VectorStoreIndex.from_documents(
    documents=reader.load_data(source),
    transformations=[node_parser],
    embed_model=embed_model,
)
result = index.as_query_engine(llm=gen_model).query(query)
print(f"Q: {query}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Q: Which are the main AI models in Docling?
A: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Sources:


[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',
  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',
   'path': '#/main-text/37',
   'heading': '3.2 AI models',
   'page': 3,
   'bbox': [107.36903381347656,
    330.07513427734375,
    506.29705810546875,
    407.3725280761719]}),
 ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, spe

## With Simple Directory Reader

To demonstrate this usage pattern, we first set up a test document directory.

In [None]:
from pathlib import Path
from tempfile import mkdtemp
import requests

tmp_dir_path = Path(mkdtemp())
r = requests.get(source)
with open(tmp_dir_path / f"{Path(source).name}.pdf", "wb") as out_file:
    out_file.write(r.content)

Using the `reader` and `node_parser` definitions from any of the above variants, usage with `SimpleDirectoryReader` then looks as follows:

In [None]:
from llama_index.core import SimpleDirectoryReader

dir_reader = SimpleDirectoryReader(
    input_dir=tmp_dir_path,
    file_extractor={".pdf": reader},
)

index = VectorStoreIndex.from_documents(
    documents=dir_reader.load_data(source),
    transformations=[node_parser],
    embed_model=embed_model,
)
result = index.as_query_engine(llm=gen_model).query(query)
print(f"Q: {query}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

Loading files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:10<00:00, 10.29s/file]


Q: Which are the main AI models in Docling?
A: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Sources:


[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',
  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmpgrhz7355/2408.09869.pdf',
   'file_name': '2408.09869.pdf',
   'file_type': 'application/pdf',
   'file_size': 5566574,
   'creation_date': '2024-10-07',
   'last_modified_date': '2024-10-07',
   'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',
   'path': '#/main-text/37',
   'headi