The goal of this notebook is to understand a chunking strategy to use properly docling converted document. 
Indeed, with a quick test using DoclingReader in llama-index, it seems that some information in DoclingDocument are lost when using llama-index. This make necessary some sperimentation with both version (Docling and DoclingReader in llama-index) to find the most adaptable solution.

In [1]:
from pathlib import Path

all_files_gen = Path("./data/").rglob("*")
all_files = [f.resolve() for f in all_files_gen]
all_pdf_files = [f for f in all_files if f.suffix.lower() == ".pdf"]
len(all_pdf_files)

3

In [3]:
from docling.document_converter import DocumentConverter

doc = DocumentConverter().convert(all_pdf_files[0])
doc

ConversionResult(input=InputDocument(file=PosixPath('/Users/rauldemaio/Projects Local/agent_rag/data/bonifico.pdf'), document_hash='4ba901717f73b2ec7617e4fd676460c0d4ee2aabe75d2d4317453c39f0528bf7', valid=True, limits=DocumentLimits(max_num_pages=9223372036854775807, max_file_size=9223372036854775807, page_range=(1, 9223372036854775807)), format=<InputFormat.PDF: 'pdf'>, filesize=7680, page_count=2), status=<ConversionStatus.SUCCESS: 'success'>, errors=[], pages=[Page(page_no=0, size=Size(width=595.280029296875, height=841.8800048828125), cells=[Cell(id=0, text='Eseguito Bonifico Europeo Unico in data 20.11.2024', bbox=BoundingBox(l=72.08000354743776, t=135.5440007861405, r=337.436016607009, b=146.34400084877927, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=1, text='20.11.2024 21:38', bbox=BoundingBox(l=72.08000354743776, t=149.42400086664293, r=157.07600773054014, b=160.22400092928183, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>)), Cell(id=2, text='Internet Banking', bbo

Let's start analysing the result of DocumentConverter

In [5]:
for key,value in doc.model_dump().items():
    print(f'Component {key} - Value:  {value}')
    print('##############################################')

Component input - Value:  {'file': PosixPath('/Users/rauldemaio/Projects Local/agent_rag/data/bonifico.pdf'), 'document_hash': '4ba901717f73b2ec7617e4fd676460c0d4ee2aabe75d2d4317453c39f0528bf7', 'valid': True, 'limits': {'max_num_pages': 9223372036854775807, 'max_file_size': 9223372036854775807, 'page_range': (1, 9223372036854775807)}, 'format': <InputFormat.PDF: 'pdf'>, 'filesize': 7680, 'page_count': 2}
##############################################
Component status - Value:  ConversionStatus.SUCCESS
##############################################
Component errors - Value:  []
##############################################
Component pages - Value:  [{'page_no': 0, 'size': {'width': 595.280029296875, 'height': 841.8800048828125}, 'cells': [{'id': 0, 'text': 'Eseguito Bonifico Europeo Unico in data 20.11.2024', 'bbox': {'l': 72.08000354743776, 't': 135.5440007861405, 'r': 337.436016607009, 'b': 146.34400084877927, 'coord_origin': <CoordOrigin.TOPLEFT: 'TOPLEFT'>}}, {'id': 1, 'text': '20

* input: contains the input metadata
* status: contains the conversion status
* errors: in this case no error but I assume it contains the list of errors encountered during conversion
* pages: it is the list of all converted pages
* assembled: it is interesting. Seems to be the list of all assembled extract components
* document: it is the DoclingDocument object itself

In [12]:
for key,value in doc.assembled.model_dump().items():
    print(f'Component {key} - Value:  {value}')
    print('##############################################')

Component elements - Value:  [{'label': <DocItemLabel.KEY_VALUE_REGION: 'key_value_region'>, 'id': 59, 'page_no': 0, 'cluster': {'id': 59, 'label': <DocItemLabel.KEY_VALUE_REGION: 'key_value_region'>, 'bbox': {'l': 72.08000354743776, 't': 135.5440007861405, 'r': 518.7880255322993, 'b': 758.5740043996468, 'coord_origin': <CoordOrigin.TOPLEFT: 'TOPLEFT'>}, 'confidence': 0.5649534463882446, 'cells': [{'id': 0, 'text': 'Eseguito Bonifico Europeo Unico in data 20.11.2024', 'bbox': {'l': 72.08000354743776, 't': 135.5440007861405, 'r': 337.436016607009, 'b': 146.34400084877927, 'coord_origin': <CoordOrigin.TOPLEFT: 'TOPLEFT'>}}, {'id': 1, 'text': '20.11.2024 21:38', 'bbox': {'l': 72.08000354743776, 't': 149.42400086664293, 'r': 157.07600773054014, 'b': 160.22400092928183, 'coord_origin': <CoordOrigin.TOPLEFT: 'TOPLEFT'>}}, {'id': 2, 'text': 'Internet Banking', 'bbox': {'l': 72.08000354743776, 't': 180.7740010484696, 'r': 153.06800753328528, 'b': 191.57400111110826, 'coord_origin': <CoordOrigi

In [16]:
from docling.chunking import BaseChunker,HybridChunker, HierarchicalChunker

In [45]:
chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc.document)

Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors


In [46]:
for i, chunk in enumerate(chunk_iter):
    print(f"=== {i} ===")
    print(f"chunk.text:\n{repr(f'{chunk.text[:300]}…')}")

    enriched_text = chunker.serialize(chunk=chunk)
    print(f"chunker.serialize(chunk):\n{repr(f'{enriched_text[:300]}…')}")
    print()


=== 0 ===
chunk.text:
"Internet Banking\nVi confermiamo il Vostro ordine di Bonifico Europeo Unico in data 20.11.2024\nNumero ordine\nINTER20241120BOSBE350192748\nOrdinante De Maio Raul - Pantaleo Rossella Filiale ROMA-TUSCOLANA\nN. C/C\n1000/00014233\nDati dell'operazione\nBeneficiario\nEDIL FIORINI SNC\nIndirizzo\nLocalit\n-\nPaese\n…"
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nInternet Banking\nVi confermiamo il Vostro ordine di Bonifico Europeo Unico in data 20.11.2024\nNumero ordine\nINTER20241120BOSBE350192748\nOrdinante De Maio Raul - Pantaleo Rossella Filiale ROMA-TUSCOLANA\nN. C/C\n1000/00014233\nDati dell…'

=== 1 ===
chunk.text:
'operazione" secondo le modalita\' concordate in sede di stipula del contratto di conto corrente e/o di successive variazioni concordate, nel quale potra\' trovare ogni dettaglio in proposito. In sede di liquidazione periodica di queste spese potra\' verificare il dettaglio dei conteggi, che viene 

In [44]:
for chunk in list(chunker.chunk(dl_doc=doc.document)):
    print(chunk.model_dump().keys())
    print(chunk.model_dump()['meta'].keys())
    print(chunk.model_dump()['meta'])
    break

dict_keys(['text', 'meta'])
dict_keys(['schema_name', 'version', 'doc_items', 'headings', 'captions', 'origin'])
{'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/1', 'parent': {'cref': '#/groups/0'}, 'children': [], 'content_layer': <ContentLayer.BODY: 'body'>, 'label': <DocItemLabel.TEXT: 'text'>, 'prov': [{'page_no': 1, 'bbox': {'l': 72.08000354743776, 't': 661.1060038343429, 'r': 153.06800753328528, 'b': 650.3060037717042, 'coord_origin': <CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>}, 'charspan': (0, 16)}]}, {'self_ref': '#/texts/2', 'parent': {'cref': '#/groups/0'}, 'children': [], 'content_layer': <ContentLayer.BODY: 'body'>, 'label': <DocItemLabel.TEXT: 'text'>, 'prov': [{'page_no': 1, 'bbox': {'l': 72.08000354743776, 't': 644.6860037391087, 'r': 455.7080224277992, 'b': 633.88600367647, 'coord_origin': <CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>}, 'charspan': (0, 76)}]}, {'self_ref': '#/texts/3', 'parent': {'cref': '#/groups/

In [None]:
chunker = HierarchicalChunker()


In [58]:
chunk_iter = chunker.chunk(dl_doc=doc.document)

for i, chunk in enumerate(chunk_iter):
    print(f"=== {i} ===")
    print(f"chunk.text:\n{repr(f'{chunk.text}…')}")

    enriched_text = chunker.serialize(chunk=chunk)
    print(f"chunker.serialize(chunk):\n{repr(f'{enriched_text}…')}")
    print()


=== 0 ===
chunk.text:
'Internet Banking…'
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nInternet Banking…'

=== 1 ===
chunk.text:
'Vi confermiamo il Vostro ordine di Bonifico Europeo Unico in data 20.11.2024…'
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nVi confermiamo il Vostro ordine di Bonifico Europeo Unico in data 20.11.2024…'

=== 2 ===
chunk.text:
'Numero ordine…'
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nNumero ordine…'

=== 3 ===
chunk.text:
'INTER20241120BOSBE350192748…'
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nINTER20241120BOSBE350192748…'

=== 4 ===
chunk.text:
'Ordinante De Maio Raul - Pantaleo Rossella Filiale ROMA-TUSCOLANA…'
chunker.serialize(chunk):
'Eseguito Bonifico Europeo Unico in data 20.11.2024 20.11.2024 21:38\nOrdinante De Maio Raul - Pantaleo Rossella Fi

---

Some references:

* [Recursive Retriever in Llama-index](https://docs.llamaindex.ai/en/stable/examples/query_engine/pdf_tables/recursive_retriever/?utm_source=pocket_shared): The concept of recursive retrieval is that we not only explore the directly most relevant nodes, but also explore node relationships to additional retrievers/query engines and execute them. For instance, a node may represent a concise summary of a structured table, and link to a SQL/Pandas query engine over that structured table. Then if the node is retrieved, we want to also query the underlying query engine for the answer.
* [Recursive Retriever and Document Agents](https://docs.llamaindex.ai/en/stable/examples/query_engine/recursive_retriever_agents/): This guide shows how to combine recursive retrieval and "document agents" for advanced decision making over heterogeneous documents.
* [Joint Tabular/Semantic QA over Tesla 10K](https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table/): In this example, we show how to ask questions over 10K with understanding of both the unstructured text as well as embedded tables.
* [Example of RAG with texts and tables](https://howaibuildthis.substack.com/p/a-guide-to-processing-tables-in-rag?utm_source=pocket_shared)
* [The same thing explained better but with langchain](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb?ref=blog.langchain.dev&utm_source=pocket_saves)

Some notes:
* with Docling we have full control on .pdf parsing
* if Docling API in Llama index (aka Docling Reader and DoclingNodeParser) does not satisfy our requirements for tables we can build customlly our nodes
    * differentiate between table nodes and text nodes
    * add nodes relationship 
    * integrate metadata from docling metadata
* in a second step we may look to integrate summaries and recursive retrievers

There are two key points that we need to understand and solve:
* how to combine text nodes based on their position
* how to link text nodes and tables nodes to preserve page order