# Query engine for Docling parsed Markdown files 

This notebook demonstrates the use of the `DoclingMdQueryEngine` for retrieval-augmented question answering over documents. It shows how to set up the engine with Docling parsed Markdown files, and execute natural language queries against the indexed data. 

The `DoclingMdQueryEngine` integrates persistent ChromaDB vector storage with LlamaIndex for efficient document retrieval.

In [None]:
%pip install llama-index-vector-stores-chroma==0.4.1
%pip install llama-index==0.12.16

In [1]:
import os

import autogen

config_list = autogen.config_list_from_json(env_or_file="../OAI_CONFIG_LIST")

assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])

# Put the OpenAI API key into the environment
os.environ["OPENAI_API_KEY"] = config_list[0]["api_key"]

models to use:  ['gpt-4o', '<your Azure OpenAI deployment name>']


In [2]:
from autogen.agentchat.contrib.rag.docling_query_engine import DoclingMdQueryEngine

query_engine = DoclingMdQueryEngine(db_path="./tmp/chroma")

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [3]:
input_dir = "/workspaces/ag2/test/agentchat/contrib/rag/pdf_parsed/"
query_engine.init_db(input_dir=input_dir)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Collection None was created in the database.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading docs from directory: /workspaces/ag2/test/agentchat/contrib/rag/pdf_parsed/
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents are loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:VectorDB index was created with input documents


In [4]:
print(query_engine.get_collection_name())

docling-parsed-docs


In [5]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.


In [6]:
input_docs = ["/workspaces/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md"]
query_engine.init_db(input_doc_paths=input_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Collection None was created in the database.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /workspaces/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents are loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:VectorDB index was created with input documents


In [15]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.


In [17]:
new_docs = ["/workspaces/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

TypeError: DoclingMdQueryEngine.add_docs() missing 1 required positional argument: 'new_doc_dir'

In [None]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

In 2024, Toast reported a net income of $56 million for the three months ended September 30, and a net loss of $13 million for the nine months ended September 30.
