### References:
* [docling_rag repository by shashanka300](https://github.com/shashanka300/docling_rag?utm_source=pocket_shared)
* [multimodal RAG system with Docling and Granite](https://www.ibm.com/think/tutorials/build-multimodal-rag-langchain-with-docling-granite?utm_source=pocket_saves)
* [Llama_index example with RAG and Reraking workflow](https://docs.llamaindex.ai/en/stable/examples/workflow/rag/)

In [1]:
import converter

In [2]:
processor = converter.DocumentProcessor(summarizer_model='gemma3:4b')

In [3]:
conversions = processor.convert('/Users/rauldemaio/Projects Local/agent_rag/data/fattura.pdf')

100%|██████████| 1/1 [00:11<00:00, 11.05s/it]


In [4]:
nodes, mapping = processor.chunk_documents(conversions)

Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.
Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.
Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.
Usage of TableItem.export_to_markdown() without `doc` argument is deprecated.


In [5]:
nodes = processor.add_embeddings(nodes, method='summary')

100%|██████████| 9/9 [01:02<00:00,  6.95s/it]


In [6]:
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core import VectorStoreIndex
#vectorstore = SimpleVectorStore()
#vectorstore.add(nodes)

index = VectorStoreIndex(nodes, embed_model=processor.embed_model)
query_engine = index.as_query_engine(similarity_top_k=5, response_mode="refine",llm = processor.summarizer)

In [7]:
index.as_retriever(similarity_top_k=5, response_mode='refine').retrieve(
    "Which items have been purchased?"
)

[NodeWithScore(node=IndexNode(id_='0df9ac65-dcd0-4dcb-9d28-81877ef679e9', embedding=None, metadata={'source': '/Users/rauldemaio/Projects Local/agent_rag/data/fattura.pdf', 'ref': '#/tables/1', 'page_info': 2, 'content_type': 'TABLE', 'headings': ['Tipo documento: Documento commerciale'], 'doc_id': 0}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.PARENT: '4'>: RelatedNodeInfo(node_id='text_5', node_type=None, metadata={}, hash=None)}, metadata_template='{key}: {value}', metadata_separator='\n', text='Here’s a summary of the invoice data:\n\nThis invoice (Documento di trasporto: 245237849) was shipped on September 23, 2024, by TWS. It includes one item: STOKKE YOYO Car Seat Adapters (Codice articolo: 1283521), with a quantity of 1 EA. The invoiced amount is €57.37, referencing order number 2207610837 and using the VAT code V22.', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{met

In [8]:
from RAGWorkflow import RAGWorkflow
w = RAGWorkflow()

In [9]:
result = await w.run(
    query="What is the total cost, included VAT, for the purchased items?",
    index=index
)

async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Query the database with: What is the total cost, included VAT, for the purchased items?
Retrieved 5 nodes.


WorkflowRuntimeError: Error in step 'rerank': 'coroutine' object has no attribute 'summarizer'

In [26]:
result = await w.run(
    query="Which items have been purchased? Answer in italian.",
    index=index
)

async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Query the database with: Which items have been purchased? Answer in italian.
Retrieved 5 nodes.
Which items have been purchased? Answer in italian.
Reranked nodes to 2
Sono stati acquistati adattatori per seggiolino Stokke YOYO (codice articolo 1283521) per un totale di 1 EA.

In [27]:
result = await w.run(
    query="Who is the owner of the invoice? Answer in italian.",
    index=index
)

async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Query the database with: Who is the owner of the invoice? Answer in italian.
Retrieved 5 nodes.
Who is the owner of the invoice? Answer in italian.
Reranked nodes to 2
L'intestatario della fattura è indicato come CSE003.

In [28]:
result = await w.run(
    query="Which is the shipping or delivery address? What is the delivery and invoice dates? Are they different? Answer in italian.",
    index=index
)

async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Query the database with: Which is the shipping or delivery address? What is the delivery and invoice dates? Are they different? Answer in italian.
Retrieved 5 nodes.
Which is the shipping or delivery address? What is the delivery and invoice dates? Are they different? Answer in italian.
Reranked nodes to 2
L’indirizzo di consegna è Via vestricio spurinna 57 roma, 00175 ITA. Non sono indicate date di fattura o di consegna.

In [29]:
result = await w.run(
    query="What is the delivery and invoice dates? Are they different? Answer in italian.",
    index=index
)

async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Query the database with: What is the delivery and invoice dates? Are they different? Answer in italian.
Retrieved 5 nodes.
What is the delivery and invoice dates? Are they different? Answer in italian.
Reranked nodes to 2
La data di fattura è 23/09/2024. La data di spedizione è il 23 settembre 2024. Sono uguali.

---

Alternativa:
1. Leggi il documento
2. Estrai il markdown
    * Usa un Markdown Node Parser per ottenere i nodi testuali
3. Estrai le tabelle
    * Costruisci l'index object

Altrimenti usare hierarchical Chunker di docling su cui però non c'è particolare controllo. 


Una volta ottenuto l'index e il retriever, si può optare per un [https://docs.llamaindex.ai/en/stable/use_cases/agents/](agent con llama_index)

----

First Version Without Source code in .py file

In [None]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    PdfPipelineOptions,
    TableFormerMode
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from sentence_transformers import SentenceTransformer

from typing import List, Dict, Any
from tqdm.notebook import tqdm

In [None]:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang = ["en","it"]
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=8, device=AcceleratorDevice.MPS
)

converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options
                )
            }
        )

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding('sentence-transformers/all-MiniLM-L6-v2')

# embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
sources = ['/Users/rauldemaio/Projects Local/agent_rag/data/fattura.pdf']
conversions = {}

for source in tqdm(sources):
    # Convert the document to a Document object
    conversions[source] = converter.convert(source=source).document


In [None]:
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.document import TableItem

doc_id = 0



In [None]:
def extract_chunk_metadata(chunk) -> Dict[str, Any]:
    """Extract essential metadata from a chunk"""
    metadata = {
        "text": chunk.text,
        "headings": [],
        "page_info": None,
        "content_type": None
    }
        
    if hasattr(chunk, 'meta'):
        # Extract headings
        if hasattr(chunk.meta, 'headings') and chunk.meta.headings:
            metadata["headings"] = chunk.meta.headings
            
        # Extract page information and content type
        if hasattr(chunk.meta, 'doc_items'):
            for item in chunk.meta.doc_items:
                if hasattr(item, 'label'):
                    metadata["content_type"] = str(item.label)
                    
                if hasattr(item, 'prov') and item.prov:
                    for prov in item.prov:
                        if hasattr(prov, 'page_no'):
                            metadata["page_info"] = prov.page_no
        
    return metadata

In [None]:
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

nodes = []
chunker = HybridChunker(tokenizer="jinaai/jina-embeddings-v3")

for source, docling_document in conversions.items():
    for chunk in chunker.chunk(docling_document):
        
        items = chunk.meta.doc_items

        if len(items) == 1 and isinstance(items[0], TableItem):

            continue # we will process tables later

        refs = " ".join(map(lambda item: item.get_ref().cref, items))

        print(refs)

        text = chunk.text
        chunk_metadata = extract_chunk_metadata(chunk)

        node = TextNode(
            id_=str(doc_id+1),
            text=text,
            #embedding = embed_model.encode(text),
            metadata ={
                'source':source,
                'ref':refs,
                "page_info": chunk_metadata["page_info"],
                "content_type": chunk_metadata["content_type"],
                "headings": chunk_metadata["headings"]
                
            }
        )

        nodes.append(node)

        doc_id += 1
        # Add relationships
        



In [None]:
from docling_core.types.doc.labels import DocItemLabel

doc_id = len(nodes)

tables = []

for source, docling_document in conversions.items():

    for table in docling_document.tables:

        if table.label in [DocItemLabel.TABLE]:

            ref = table.get_ref().cref

            print(ref)

            text = table.export_to_markdown()

            #chunk_metadata = extract_chunk_metadata(chunk)

            node = TextNode(
            id_=str(doc_id+1),
            text=text,
            #embedding = embed_model.encode(text),
            metadata ={
                'source':source,
                'ref':refs,
                #"page_info": chunk_metadata["page_info"],
                "content_type": 'TABLE',
                #"headings": chunk_metadata["headings"]
                #'data': table.export_to_dataframe()
            }
        )

        tables.append(node)

        doc_id += 1

            

In [None]:
import itertools

for document in itertools.chain(nodes, tables):

    print(f"Document ID: {document.id_}")

    print(f"Source: {document.metadata['source']}")

    print(f"Content Type: {document.metadata['content_type']}")

    print(f"Content:\n{document.text[:50]}...")

    print("=" * 80) # Separator for clarity



In [None]:
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="gemma3:4b", request_timeout=60.0)

In [None]:
from llama_index.core.vector_stores import SimpleVectorStore

vectorstore = SimpleVectorStore()

In [None]:
for node in nodes:

    node.embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode='all')
        )

In [None]:
for table in tables:
    table.embedding = embed_model.get_text_embedding(
        table.get_content()
        )

In [None]:
vectorstore.add(list(itertools.chain(nodes, tables)))

In [None]:
index = VectorStoreIndex(list(itertools.chain(nodes, tables)), embed_model=embed_model)

In [None]:
query_engine = index.as_query_engine(similarity_top_k=3, response_mode="tree_summarize")

In [None]:
response = query_engine.query("Quale è la data della spedizione?")
display(response)
for node in response.source_nodes:
    print(f"Node ID: {node.id_}")
    print(f"Source: {node.metadata['source']}")
    print(f"Content Type: {node.metadata['content_type']}")
    print(f"Content:\n{node.text}.")
    print(f"Score: {node.score}")
    print("=" * 80)  # Separator for clarity

In [None]:
# aggiungere un summarizer alle tabelle
# aggiungere reranking
# adottare un agent tool o un workflow
