# Basic RAG with Haystack

## Overview

This example leverages the [Docling](https://github.com/DS4SD/docling) converter
integration for [Haystack](https://github.com/deepset-ai/haystack/), along with
in-memory document store and retriever instances.

The presented `DoclingConverter` component enables you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.

`DoclingConverter` supports two different export modes:
- `ExportType.MARKDOWN`: if you want to capture each input document as a separate
  Haystack document, or
- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and
  to then capture each individual chunk as a separate Haystack document downstream.

The example allows to explore both modes via parameter `EXPORT_TYPE`; depending on the
value set, the ingestion and RAG pipelines are then set up accordingly.

## Setup

In [1]:
# TODO: uncomment when package available on PyPI:
# %pip install -q --progress-bar off docling-haystack haystack-ai docling python-dotenv

%pip install -q --progress-bar off haystack-ai docling python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
from dotenv import load_dotenv

_ = load_dotenv()

In [3]:
import os

HF_TOKEN = os.getenv("HF_API_KEY", "")
PATHS = [
    "https://arxiv.org/pdf/2408.09869",  # Docling Technical Report
    # ... additional docs can be listed here
]
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3

## Indexing pipeline

In [4]:
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

from docling_haystack.converter import DoclingConverter

document_store = InMemoryDocumentStore()

idx_pipe = Pipeline()
idx_pipe.add_component("converter", DoclingConverter())
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
idx_pipe.connect("converter", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})

  from .autonotebook import tqdm as notebook_tqdm
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 97541.95it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors


{'writer': {'documents_written': 54}}

## RAG pipeline

In [5]:
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.utils import Secret

prompt_template = """
    Given these documents, answer the question.
    Documents:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    Question: {{query}}
    Answer:
    """

rag_pipe = Pipeline()
rag_pipe.add_component(
    "retriever",
    InMemoryBM25Retriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
    "llm",
    HuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": GENERATION_MODEL_ID},
        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
    ),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")
rag_res = rag_pipe.run(
    {
        "retriever": {"query": QUESTION},
        "prompt_builder": {"query": QUESTION},
        "answer_builder": {"query": QUESTION},
    }
)



In [6]:
from docling.chunking import DocChunk

print(f"Question:\n{QUESTION}\n")
print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n")
print("Sources:")
sources = rag_res["answer_builder"]["answers"][0].documents
for source in sources:
    doc_chunk = DocChunk.model_validate(source.meta["dl_meta"])
    print(f"- text: {repr(doc_chunk.text)}")
    if doc_chunk.meta.origin:
        print(f"  file: {doc_chunk.meta.origin.filename}")
    if doc_chunk.meta.headings:
        print(f"  section: {' / '.join(doc_chunk.meta.headings)}")
    bbox = doc_chunk.meta.doc_items[0].prov[0].bbox
    print(
        f"  page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, "
        f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]"
    )

Question:
Which are the main AI models in Docling?

Answer:
The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks.

Sources:
- text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code 