Question / Comment: Error when get bbox with page.getImageBbox() #699

dothinking · 2020-10-22T08:07:21Z

Hi JorjMcKie,

I got problems when extracting image bbox with page.getImageBbox() from this pdf.

File "D:\89_Program_Files\Python368\lib\site-packages\fitz\fitz.py", line 4791, in get ImageBbox
    raise ValueError("unsupported image item")
ValueError: unsupported image item

What I did:

import fitz

doc = fitz.open("test.pdf")
page = doc[0]

# I can get the image from page
items = page.getImageList(full=True)
print(items)
# [(8, 0, 1600, 939, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode', 7)]

# but failed to get the bbox
bbox = page.getImageBbox(items[0])
print(bbox)
# ValueError: unsupported image item

Check fitz.py line 4791:

if item[-1] != 0:
    raise ValueError("unsupported image item")

In this case, Seems the last xref item[-1]==7 != 0 caused the error.

So, how to get bbox of image with item[-1] != 0? Thanks in advance.

The text was updated successfully, but these errors were encountered:

dothinking · 2020-10-22T08:11:47Z

I tried to cross check the bbox with page.getText(), but unfortunately I can't get any image blocks, seems it's due to this image is partly outside the page.

print(page.getText('rawdict'))
# {'width': 612.0, 'height': 792.0, 'blocks': []}

JorjMcKie · 2020-10-22T09:53:32Z

As your analysis correctly showed:
Image bboxes can only be determined if the page directly displays the image. In your case, a so-called "Form XObject" is invoked (xref 7), which in turn displays an image.
You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number), which contains a tuple for the bbox.
The reason for this restriction mainly is that the resp. MuPDF algorithm is not bug-free. It is indeed a highly complex topic, because XObjects can be nested inside each other at arbitrary levels, each with its own transformation matrix, bbox and what not.

dothinking · 2020-10-22T14:39:39Z

Thanks for the explanation.

You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number)

I still can't get the right bbox with getPageXObjectList, considered also page.transformationMatrix, as documented in the Docs. Indeed, "It is indeed a highly complex topic".

Then I try to bypass item[-1] != 0, and see what happen from MuPDF. Luckily I get the result. Maybe "MuPDF algorithm is not bug-free", but at least works for this case. So a workaround for my case, bypass item[-1] != 0 to give MuPDF a chance. Of course, wrap in try/except clause for safe.

import fitz

doc = fitz.open("test.pdf")
page = doc[0]
items = page.getImageList(full=True)

# bypass `item[-1] != 0`
item = list(items[0])
item[-1]=0

bbox = page.getImageBbox(item)
print(bbox)
# Rect(57.900001525878906, 129.5078125, 688.8927612304688, 582.1100463867188)

JorjMcKie · 2020-10-24T16:05:54Z

Maybe I should follow your implicit recommendation, and no longer refuse to do it like that.
The only thing I do not want to see happening is having to deal with issues, when it didn' twork correctly ... 😎.
BTW, the risk to see exception is minor - which is unfortunate in this case: you will get an incorrect bbox without necessarily noticing it.
Maybe I let it go through and warn if I detect that the image is embedded in an XObject.

dothinking added the question label Oct 22, 2020

dothinking assigned JorjMcKie Oct 22, 2020

dothinking mentioned this issue Oct 22, 2020

Processing images ArtifexSoftware/pdf2docx#57

Closed

JorjMcKie closed this as completed Oct 24, 2020

JorjMcKie added the resolved fixed / implemented / answered label Nov 11, 2020

jmac105 mentioned this issue Dec 2, 2020

Question / Comment: getting dimensions of an image as it appears in the PDF #743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question / Comment: Error when get bbox with page.getImageBbox() #699

Question / Comment: Error when get bbox with page.getImageBbox() #699

dothinking commented Oct 22, 2020

dothinking commented Oct 22, 2020

JorjMcKie commented Oct 22, 2020

dothinking commented Oct 22, 2020

JorjMcKie commented Oct 24, 2020

Question / Comment: Error when get bbox with page.getImageBbox() #699

Question / Comment: Error when get bbox with page.getImageBbox() #699

Comments

dothinking commented Oct 22, 2020

dothinking commented Oct 22, 2020

JorjMcKie commented Oct 22, 2020

dothinking commented Oct 22, 2020

JorjMcKie commented Oct 24, 2020