Get image inside table's cell #3587

vinniec2 · 2024-06-16T11:23:25Z

vinniec2
Jun 16, 2024

Hi, I am just now trying pymupdf, I want to extract the contents of some tables and it seems to work but in a column sometimes images appear and in this case the tab.extract() method of the Table class returns empty strings for these cells.
The only thing I could think that I can do is to check the list of images and see if there is one that have it's imagebbox is inside the cellbbox.
Is there a simpler solution?
Thanks :)

Answered by JorjMcKie

Jun 16, 2024

Converting this from "Issues" to "Discussions".

Thanks for your interest in PyMuPDF!

Your idea is the only way to match this information. You can extract a list of images via page.get_image_info(). This delivers metadata for all images - without extracting the image binaries themselves.

It is not really difficult to do:

imglist = page.get_image_info()

# copy of the table's text content:
tab_text = tab.extract()[:]
# the table's cell bboxes as Rect objects:
tab_cells=[[pymupdf.Rect(c) for c in r.cells] for r in tab.rows]

Both of the above are lists of lists with matching (row, col) indices.

for img_idx, img in enumerate(imglist):
    if not img["bbox"] in pymupdf.Rect(tab.bbox):
        c…

View full answer

JorjMcKie · 2024-06-16T12:39:56Z

JorjMcKie
Jun 16, 2024
Maintainer

Converting this from "Issues" to "Discussions".

Thanks for your interest in PyMuPDF!

Your idea is the only way to match this information. You can extract a list of images via page.get_image_info(). This delivers metadata for all images - without extracting the image binaries themselves.

It is not really difficult to do:

imglist = page.get_image_info()

# copy of the table's text content:
tab_text = tab.extract()[:]
# the table's cell bboxes as Rect objects:
tab_cells=[[pymupdf.Rect(c) for c in r.cells] for r in tab.rows]

Both of the above are lists of lists with matching (row, col) indices.

for img_idx, img in enumerate(imglist):
    if not img["bbox"] in pymupdf.Rect(tab.bbox):
        continue  # avoid expensive loop over all the cells
    for r in range(tab.row_count):
        for c in range(tab.col_count):
            if img["bbox"] in tab_cells[r][c]:
                tab_text[r][c] = f"image {img_idx}"

The question is what you really want to do once you have a match between an img["bbox"] and a table cell.

4 replies

vinniec2 Jun 17, 2024
Author

Okay, I tried the code you passed me and it works fine!
I was stuck for a moment because I hadn't guessed that you had rewritten the operator in for the Rect class. Supernice! so just see if the cordinates are inside the rectangle!

The question is what you really want to do once you have a match between an img["bbox"] and a table cell.

For now I just want to fill a dictionary to index things and then I will see how to use the content.
I had found some code to write the images to disk so I think I will do that, I will export a json with references to the images.

Now I just need to manage the pages (okay, just do one page at a time) and manage the text out of the tables (but I think you already gave me the useful advice with Rect, and then I found this other issue: #2908 )

vinniec2 Jun 17, 2024
Author

OK, I'm sorry, I already have another problem.

I tried two ways:
1)removing the boundaryboxes of the tables and then using the apply_redactions() method and then get_text() to get the text.
2)using get_textbox() on a rectangle in the area outside the tables.
Both methods work but return the text in reverse order! from the bottom row up.

If instead I use get_text() without apply_redaction() the text is returned in the right order, but includes unwanted tab text.

Do I have to reorder the text myself with splitlines() and reverse()?

JorjMcKie Jun 17, 2024
Maintainer

Please open another Discussions thread for new questions.
It's easier to follow for others.

vinniec2 Jun 22, 2024
Author

Sorry for the reactive time, but after a bit of gestation I managed to get what I needed with pymupdf, all tables and text between tables parsed and indexed!
I wanted to share the piece of code that allowed me to split the text from the tables, considering that in my case the sections all alternate vertically (no text or tables that are placed side by side horizontally).
I know it's trivial for the person who wrote this library, however, part of it is to thank and recognize his help, and part of it is because maybe someone else inexperienced with the same problem as me might find it useful.

It is an experiment, I generalize by creating a Rect representing the page and a list of Rect representing the list that returns page.find_tables() of pymupdf.

#!/usr/bin/env python3
from pymupdf import Rect

page = Rect((0,0), (50,100))

tabs = [ Rect((0,20),(50,40)), Rect((0,60),(50,80)) ]
#tabs = [ Rect((0,0),(50,40)), Rect((0,60),(50,80)) ]
#tabs = [ Rect((0,20),(50,40)), Rect((0,40),(50,80)) ]
#tabs = [ Rect((0,0),(50,40)), Rect((0,40),(50,80)) ]
#tabs = [ Rect((0,0),(50,40)), Rect((0,40),(50,100)) ]

squares = [page]
for tab in tabs:
    new_squares = []
    for sqr in squares:
        if tab in sqr:
            sq1 = Rect((sqr.x0, sqr.y0), (sqr.x1, tab.y0))
            sq2 = Rect((sqr.x0, tab.y1), (sqr.x1, sqr.y1))
            if sq1.height > 0:
                new_squares.append(sq1)
            if sq2.height > 0:
                new_squares.append(sq2)
        else:
            new_squares.append(sqr)
    squares = new_squares


ordine = sorted(squares + tabs,  key=lambda l: l[1])
    
#print(squares)  
print(ordine)

JorjMcKie · 2024-06-16T12:42:32Z

JorjMcKie
Jun 16, 2024
Maintainer

A similar answer could be given / developed for another question we have heard: How can hyperlink information be stored back into table cells ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get image inside table's cell #3587

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Get image inside table's cell #3587

vinniec2 Jun 16, 2024

Replies: 2 comments · 4 replies

JorjMcKie Jun 16, 2024 Maintainer

vinniec2 Jun 17, 2024 Author

vinniec2 Jun 17, 2024 Author

JorjMcKie Jun 17, 2024 Maintainer

vinniec2 Jun 22, 2024 Author

JorjMcKie Jun 16, 2024 Maintainer

vinniec2
Jun 16, 2024

Replies: 2 comments 4 replies

JorjMcKie
Jun 16, 2024
Maintainer

vinniec2 Jun 17, 2024
Author

vinniec2 Jun 17, 2024
Author

JorjMcKie Jun 17, 2024
Maintainer

vinniec2 Jun 22, 2024
Author

JorjMcKie
Jun 16, 2024
Maintainer