Text rects overlap with tables and images that should be excluded

Originally opened this as a discussion, but after getting into the code, it appears to be an issue that impacts the extraction of not only tables but also images with text on them. 

The problem is that bboxes that are supposed to be avoided (images and tables) during text box detection are still finding themselves within the final joint text bboxes. This results in the text of the table being extracted in-place as raw text, and the formatted table being shifted to the bottom of the merged text bbox.

Here are a PDF file presenting a simple mock case, the markdown that PyMuPDF4LLM outputs, and the expected output.
[table_sample.pdf](https://github.com/user-attachments/files/17441781/table_sample.pdf)
[table_sample.md](https://github.com/user-attachments/files/17441901/table_sample.md)
[table_expected.md](https://github.com/user-attachments/files/17441779/table_expected.md)

The issue is happening in column_boxes(): The rects passed in the `avoid` param can get re-included because we're not checking the intersection of the new block (temp) with them at these calls: 
`check = can_extend(temp, nbb, nblocks, vert_bboxes) # Lines [417, 427] multi_column.py`
Including the `img_bboxes` in the checks does seem to fix the issue at this point. 

Afterwards, the call to `join_rects_phase3() # Line [440] multi_column.py` re-includes the excluded rects once again because it merges without checking whether it intersects with an avoidable rect:
``` 
# Lines [245 - 250] multi_column.py:
                    temp = prect0 | prect1
                    test = set(
                        [tuple(b) for b in prects + new_rects if b.intersects(temp)]
                    )
                    if test == set((tuple(prect0), tuple(prect1))):
                        prect0 |= prect1
```


### Discussed in https://github.com/pymupdf/RAG/discussions/168

<div type='discussions-op-text'>

<sup>Originally posted by **Meaveryway** October 13, 2024</sup>
Hello there, 

Thanks for the wonderful work! this outperforms even most commercial solutions out there!
I have a question regarding tables extraction: when extracting a PDF page that has a table to markdown, it seems that the table's raw text is first extracted and put in place of the table, then the formatted table at the bottom of the page.

Is this the desired output? Why? </div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text rects overlap with tables and images that should be excluded #171

Discussed in #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Text rects overlap with tables and images that should be excluded #171

Description

Discussed in #168

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions