-
Hi, I am just now trying pymupdf, I want to extract the contents of some tables and it seems to work but in a column sometimes images appear and in this case the tab.extract() method of the Table class returns empty strings for these cells. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Converting this from "Issues" to "Discussions". Thanks for your interest in PyMuPDF! Your idea is the only way to match this information. You can extract a list of images via It is not really difficult to do: imglist = page.get_image_info()
# copy of the table's text content:
tab_text = tab.extract()[:]
# the table's cell bboxes as Rect objects:
tab_cells=[[pymupdf.Rect(c) for c in r.cells] for r in tab.rows] Both of the above are lists of lists with matching (row, col) indices. for img_idx, img in enumerate(imglist):
if not img["bbox"] in pymupdf.Rect(tab.bbox):
continue # avoid expensive loop over all the cells
for r in range(tab.row_count):
for c in range(tab.col_count):
if img["bbox"] in tab_cells[r][c]:
tab_text[r][c] = f"image {img_idx}" The question is what you really want to do once you have a match between an |
Beta Was this translation helpful? Give feedback.
-
A similar answer could be given / developed for another question we have heard: How can hyperlink information be stored back into table cells ... |
Beta Was this translation helpful? Give feedback.
Converting this from "Issues" to "Discussions".
Thanks for your interest in PyMuPDF!
Your idea is the only way to match this information. You can extract a list of images via
page.get_image_info()
. This delivers metadata for all images - without extracting the image binaries themselves.It is not really difficult to do:
Both of the above are lists of lists with matching (row, col) indices.