Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return read data instead of throwing "Unexpected EOD in RunLengthDecode/ASCIIHexDecode"? #2303

Closed
stefan6419846 opened this issue Nov 20, 2023 · 2 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@stefan6419846
Copy link
Collaborator

I am currently experiencing some issues about Unexpected EOD in RunLengthDecode when extracting images from some PDF files. Is there any reason to use a hard exception there instead of logger_warning and returning the read data?

In a specific case, PDFBox Debugger, MuPDF and Evince are able to correctly extract the image; replacing the exception with a return value in pypdf.filters.RunLengthDecode.decode seems to produce an image which only seems to contain the wrong colors.

Environment

$ python -m platform
Linux-5.14.21-150400.24.97-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.18.0'), PIL=10.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

for index, page in enumerate(PdfReader('out1.pdf').pages):
    print(index, page)
    for key in page.images.keys():
        print(key)
        print(page.images[key].indirect_reference)

out1.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/home/stefan/temp/run.py", line 7, in <module>
    print(page.images[key].indirect_reference)
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2704, in __getitem__
    return self.get_function(index)
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 547, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/filters.py", line 747, in _xobj_to_image
    data = x_object_obj.get_data()  # type: ignore
  File "/home/stefan/pdf/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 969, in get_data
    decoded.set_data(b_(decode_stream_data(self)))
  File "/home/stefan/pdf/venv/lib/python3.9/site-packages/pypdf/filters.py", line 686, in decode_stream_data
    data = RunLengthDecode.decode(data)
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/filters.py", line 343, in decode
    raise PdfStreamError("Unexpected EOD in RunLengthDecode")
pypdf.errors.PdfStreamError: Unexpected EOD in RunLengthDecode
@stefan6419846
Copy link
Collaborator Author

Doing some more research, it seems like there are two issues in this case:

  1. pypdf fails to read/decode the actual image, while other programs just seem to stop there.

  2. pypdf.filters._xobj_to_image does not seem to correctly handle the palette from the color space definition

    ['/Indexed', '/DeviceRGB', 164, b'\xed\x1c$\xed\x1e&\xed \'\xed (\xed"*\xee#+\xee$,\xee&-\xee\'.\xee(/\xee(0\xee*2\xee,3\xee-4\xee.5\xee/6\xef07\xef29\xef3:\xef4;\xef5<\xef6=\xef7>\xef8?\xef9@\xef:A\xef;B\xf0<C\xf0=D\xf0>E\xf0@F\xf0AH\xf0BI\xf0CJ\xf0EL\xf0FL\xf0HN\xf1IP\xf1JQ\xf1KR\xf1LR\xf1NT\xf1PV\xf1RX\xf1TZ\xf2U[\xf2V\\\xf2X^\xf2Z`\xf2\\b\xf2^d\xf2af\xf3bh\xf3ej\xf3fk\xf3gl\xf3hm\xf3in\xf3jp\xf3lq\xf3ns\xf4ot\xf4pu\xf4rw\xf4sx\xf4uz\xf4v{\xf4w|\xf4x|\xf4y~\xf4z\x7f\xf5{\x80\xf5~\x82\xf5\x80\x84\xf5\x81\x86\xf5\x82\x87\xf5\x84\x88\xf5\x86\x8a\xf6\x88\x8c\xf6\x89\x8e\xf6\x8a\x8e\xf6\x8c\x90\xf6\x8e\x92\xf6\x90\x94\xf6\x92\x95\xf6\x92\x96\xf7\x94\x98\xf7\x97\x9b\xf7\x99\x9c\xf7\x9a\x9d\xf7\x9a\x9e\xf7\x9c\xa0\xf7\x9e\xa1\xf7\x9f\xa2\xf8\xa1\xa4\xf8\xa2\xa5\xf8\xa3\xa7\xf8\xa4\xa7\xf8\xa5\xa8\xf8\xa6\xa9\xf8\xa8\xab\xf8\xa9\xac\xf8\xaa\xad\xf8\xab\xae\xf8\xac\xaf\xf9\xae\xb1\xf9\xb0\xb3\xf9\xb2\xb4\xf9\xb3\xb6\xf9\xb4\xb7\xf9\xb5\xb8\xf9\xb8\xba\xf9\xba\xbc\xfa\xba\xbd\xfa\xbd\xbf\xfa\xbe\xc0\xfa\xc0\xc2\xfa\xc2\xc5\xfa\xc4\xc6\xfb\xc7\xc9\xfb\xc8\xca\xfb\xca\xcb\xfb\xca\xcc\xfb\xcc\xce\xfb\xce\xd0\xfb\xd0\xd1\xfb\xd1\xd2\xfb\xd2\xd3\xfb\xd2\xd4\xfc\xd3\xd5\xfc\xd4\xd6\xfc\xd6\xd7\xfc\xd7\xd8\xfc\xd8\xd9\xfc\xd9\xda\xfc\xdb\xdd\xfc\xdc\xde\xfc\xde\xdf\xfc\xdf\xe0\xfd\xe0\xe1\xfd\xe1\xe2\xfd\xe3\xe4\xfd\xe4\xe5\xfd\xe5\xe6\xfd\xe6\xe7\xfd\xe7\xe8\xfd\xe8\xe9\xfd\xe9\xea\xfd\xea\xea\xfd\xeb\xec\xfe\xec\xed\xfe\xed\xee\xfe\xee\xef\xfe\xef\xf0\xfe\xf1\xf1\xfe\xf3\xf3\xfe\xf3\xf4\xfe\xf4\xf5\xfe\xf5\xf6\xfe\xf6\xf6\xfe\xf8\xf8\xff\xfa\xfa\xff\xfb\xfc\xff\xfc\xfc\xff\xfe\xfe']
    

    For this reason, the generated image is not red, but gray when replacing (1) with a warning and return of the read data. Somehow img.putpalette(color_space[-1]) seems to be missing, although I am not sure where to put this and when to apply it correctly.

@stefan6419846
Copy link
Collaborator Author

I just stumbled upon another similar EOD case for ASCIIHexDecode where just returning the existing data would generate the correct image as well. Thus I assume that pypdf is rather strict (too strict?) here.

@stefan6419846 stefan6419846 changed the title Return read data instead of throwing "Unexpected EOD in RunLengthDecode"? Return read data instead of throwing "Unexpected EOD in RunLengthDecode/ASCIIHexDecode"? Nov 24, 2023
@stefan6419846 stefan6419846 added workflow-images From a users perspective, image handling is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

1 participant