# Attempt 3

In [1]:

print("PDF CONVERSION CALL FLOW:")
print("=" * 50)

print("\n1. User calls DocumentConverter.convert()")
print("   → DocumentConverter._get_pipeline() selects PDF pipeline")
print("   → Returns StandardPdfPipeline or ThreadedStandardPdfPipeline")

print("\n2. Pipeline.execute() processes the document")
print("   → Per-page processing stages:")
print("     - PagePreprocessingModel: image scaling, preparation")
print("     - OcrModel: text extraction from images")
print("     - LayoutModel: detect text blocks, tables, figures")
print("     - TableStructureModel: analyze table structure")
print("     - PageAssembleModel: create page elements")

print("\n3. After all pages processed, pipeline calls _assemble_document()")
print("   → Collects all page elements into conv_res.assembled")
print("   → **CRITICAL POINT**: conv_res.document = self.reading_order_model(conv_res)")

print("\n4. ReadingOrderModel.__call__(conv_res) is invoked")
print("   → predict_reading_order(): sorts elements by reading order")
print("   → predict_to_captions(): links captions to figures/tables")
print("   → predict_to_footnotes(): links footnotes to text")
print("   → predict_merges(): merges related elements")
print("   → **KEY METHOD**: _readingorder_elements_to_docling_doc()")

print("\n5. _readingorder_elements_to_docling_doc() builds final document")
print("   → Creates empty DoclingDocument")
print("   → Iterates through sorted elements")
print("   → For each element:")
print("     - _handle_text_element() ← FILTER HERE!")
print("     - _add_child_elements() ← FILTER HERE!")
print("   → Returns populated DoclingDocument")

print("\n6. Document assembly methods where filtering should happen:")
print("   → _handle_text_element(): checks element.label before adding")
print("   → _add_child_elements(): checks child.label before adding")
print("   → These call doc.add_text(), doc.add_heading(), etc.")

PDF CONVERSION CALL FLOW:

1. User calls DocumentConverter.convert()
   → DocumentConverter._get_pipeline() selects PDF pipeline
   → Returns StandardPdfPipeline or ThreadedStandardPdfPipeline

2. Pipeline.execute() processes the document
   → Per-page processing stages:
     - PagePreprocessingModel: image scaling, preparation
     - OcrModel: text extraction from images
     - LayoutModel: detect text blocks, tables, figures
     - TableStructureModel: analyze table structure
     - PageAssembleModel: create page elements

3. After all pages processed, pipeline calls _assemble_document()
   → Collects all page elements into conv_res.assembled
   → **CRITICAL POINT**: conv_res.document = self.reading_order_model(conv_res)

4. ReadingOrderModel.__call__(conv_res) is invoked
   → predict_reading_order(): sorts elements by reading order
   → predict_to_captions(): links captions to figures/tables
   → predict_to_footnotes(): links footnotes to text
   → predict_merges(): merges related e

In [1]:
from pathlib import Path
from typing import Dict, List

from docling_core.types.doc import (
    DocItemLabel,
    DoclingDocument,
    DocumentOrigin,
    GroupLabel,
    NodeItem,
    ProvenanceItem,
    RefItem,
    TableData,
)
from docling_core.types.doc.document import ContentLayer
from docling_ibm_models.list_item_normalizer.list_marker_processor import (
    ListItemMarkerProcessor,
)
from docling_ibm_models.reading_order.reading_order_rb import (
    PageElement as ReadingOrderPageElement,
    ReadingOrderPredictor,
)
from pydantic import BaseModel, ConfigDict

from docling.datamodel.base_models import (
    BasePageElement,
    Cluster,
    ContainerElement,
    FigureElement,
    Table,
    TextElement,
)
from docling.datamodel.document import ConversionResult
from docling.utils.profiling import ProfilingScope, TimeRecorder

from docling.models.readingorder_model import ReadingOrderModel

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Now, import your CUDA-dependent library (e.g., PyTorch)
import torch
print(f"Number of visible CUDA devices: {torch.cuda.device_count()}")

Number of visible CUDA devices: 1


In [4]:
import time
from pathlib import Path
from typing import Dict, Set

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
from docling.datamodel.base_models import InputFormat
from docling.utils.layout_postprocessor import LayoutPostprocessor
from docling_core.types.doc import DocItemLabel

def patch_latyout_postprocessor():
    thresholds = {
        DocItemLabel.TEXT: 0.9, 
        DocItemLabel.SECTION_HEADER: 0.8
    }
    for label in thresholds:
        LayoutPostprocessor.CONFIDENCE_THRESHOLDS[label] = thresholds[label]

def patch_reading_order_model_with_filter(keep_labels: Set[DocItemLabel] = None):
    """
    Monkey patch the ReadingOrderModel to filter clusters during document assembly.
    Use this if you want to modify existing conversion pipelines.
    """
    if keep_labels is None:
        keep_labels = {DocItemLabel.TEXT, DocItemLabel.SECTION_HEADER}
    
    def custom_readingorder_elements_to_docling_doc(  # noqa: C901
        self,
        conv_res: ConversionResult,
        ro_elements: List[ReadingOrderPageElement],
        el_to_captions_mapping: Dict[int, List[int]],
        el_to_footnotes_mapping: Dict[int, List[int]],
        el_merges_mapping: Dict[int, List[int]],
    ) -> DoclingDocument:
        id_to_elem = {
            RefItem(cref=f"#/{elem.page_no}/{elem.cluster.id}").cref: elem
            for elem in conv_res.assembled.elements
        }
        cid_to_rels = {rel.cid: rel for rel in ro_elements}

        origin = DocumentOrigin(
            mimetype="application/pdf",
            filename=conv_res.input.file.name,
            binary_hash=conv_res.input.document_hash,
        )
        doc_name = Path(origin.filename).stem
        out_doc: DoclingDocument = DoclingDocument(name=doc_name, origin=origin)

        for page in conv_res.pages:
            page_no = page.page_no + 1
            size = page.size

            assert size is not None, "Page size is not initialized."

            out_doc.add_page(page_no=page_no, size=size)

        current_list = None
        skippable_cids = {
            cid
            for mapping in (
                el_to_captions_mapping,
                el_to_footnotes_mapping,
                el_merges_mapping,
            )
            for lst in mapping.values()
            for cid in lst
        }

        page_no_to_pages = {p.page_no: p for p in conv_res.pages}

        for rel in ro_elements:
            if rel.cid in skippable_cids: continue
            element = id_to_elem[rel.ref.cref]
            page_height = page_no_to_pages[element.page_no].size.height  # type: ignore

            if element.label not in keep_labels: continue # ONLY CARE ABOUT THESE SHITS
                
            new_item, current_list = self._handle_text_element(
                element, out_doc, current_list, page_height
            )

            if rel.cid in el_merges_mapping.keys():
                for merged_cid in el_merges_mapping[rel.cid]:
                    merged_elem = id_to_elem[cid_to_rels[merged_cid].ref.cref]

                    self._merge_elements(
                        element, merged_elem, new_item, page_height
                    )

        return out_doc
    
    # Apply patches
    # ReadingOrderModel._handle_text_element = custom_handle_text_element
    # ReadingOrderModel._add_child_elements = custom_add_child_elements
    ReadingOrderModel._readingorder_elements_to_docling_doc = custom_readingorder_elements_to_docling_doc

In [11]:
def create_converter(num_threads: int = 32) -> DocumentConverter:
    
    # Create converter with pipeline options
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options.lang = ["en"]
    pipeline_options.layout_options.create_orphan_clusters = False
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=num_threads, 
        device=AcceleratorDevice.AUTO
    )
    
    return DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

In [7]:
# Apply the patch once
patch_reading_order_model_with_filter({DocItemLabel.TEXT, DocItemLabel.SECTION_HEADER})
patch_latyout_postprocessor()

In [7]:
# ReadingOrderModel._readingorder_elements_to_docling_doc??

In [12]:
converter = create_converter()

In [13]:
doc = converter.convert("../testdata/05.pdf").document

2025-09-16 06:56:55,076 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-09-16 06:56:55,113 - INFO - Going to convert document batch...
2025-09-16 06:56:55,114 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 3c084c1449dda55ebb0219f601cf7b5a
2025-09-16 06:56:55,115 - INFO - Accelerator device: 'cuda:0'
2025-09-16 06:56:57,250 - INFO - Accelerator device: 'cuda:0'
2025-09-16 06:56:59,264 - INFO - Accelerator device: 'cuda:0'
2025-09-16 06:56:59,753 - INFO - Processing document 05.pdf
2025-09-16 06:57:22,005 - INFO - Finished converting document 05.pdf in 26.93 sec.


In [14]:
print(doc.export_to_markdown())

## GSTZ 1 -1 Deficiency Activates NRF 2 /IGF 1 R Axis in HCC via Accumulation of Oncometabolite Succinylacetone

Fan Yang 1 , 2 , † , Jingjing Li 1 , † , Haijun Deng 1 , † , Yihao Wang 3 , Chong Lei 1 , Qiujie Wang 1 , Jin Xiang 1 , Li Liang 1 , Jie Xia 1 , Xuanming Pan 1 , Xiaosong Li 2 , Quanxin Long 1 , Lei Chang 3 , Ping Xu 3 , Ailong Huang 1 ,* , Kai Wang 1 ,** &amp; Ni Tang 1 ,***

## Abstract

The IGF 1 R signaling is important in the malignant progression of cancer. However, overexpression of IGF 1 R has not been properly assessed in HCC. Here, we revealed that GSTZ 1 -1 , the enzyme in phenylalanine/tyrosine catabolism, is downregulated in HCC, and its expression was negatively correlated with IGF 1 R. Mechanistically, GSTZ 1 -1 deficiency led to succinylacetone accumulation, alkylation modification of KEAP 1 , and NRF 2 activation, thus promoting IGF 1 R transcription by recruiting SP 1 to its promoter. Moreover, inhibition of IGF 1 R or NRF 2 significantly inhibited tumor-pr

# Something