Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

Closed
andreaskagedal opened this issue May 21, 2024 · 3 comments · Fixed by #2675
Closed

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

andreaskagedal opened this issue May 21, 2024 · 3 comments · Fixed by #2675
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow

Comments

@andreaskagedal
Copy link

andreaskagedal commented May 21, 2024

When merging PDF containing images (one per page) some images were alterd in the resulting merged file.

The issue was discussed on stackoverflow here:
https://stackoverflow.com/questions/78508800/pypdf-does-not-give-me-the-right-image
where it was proposed to report this here.

The conclusion in the stackoverflow discussion was that (copying from the answer by K.J. on stackoverflow):

The image pixels have not changed, but the colour index has been incorrectly interpreted and rewritten in the merge.

The Source colours are written as an Octal literal string.

14 0 obj
[/Indexed
/DeviceGray
15
(\377\376\000\375\001\002\374\373\003\000\000\000\000\000\000\000)]endobj

This has been mis-interpreted as a UTF-16 string and replaced with:

12 0 obj
[ /Indexed /DeviceGray 15 (þÿý\000\002\001ûü\000\003\000\000\000\000\000\000) ]
endobj

The net effect is all the colours are now bit rotten. Many, but not all, will look reversed from dark to light.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-4.15.0-213-lowlatency-x86_64-with-glibc2.27

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter

all_pages = PdfReader('original_before_merge.pdf').pages
writer = PdfWriter()
for page in all_pages:
    writer.add_page(page)
writer.write("after_merge.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Note that it is the second page which is the problem.

original_before_merge.pdf

And the result I get from the above code is this:

after_merge.pdf

@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow labels May 21, 2024
@pubpub-zz
Copy link
Collaborator

The issue is coming from the color table. it starts with \xFF\xFE which is understood as UTF-16LE BOM which induce byte swapping because of #1884. Under analysis.

@andreaskagedal
Copy link
Author

andreaskagedal commented May 22, 2024

Note that there are more comments in the Stack Overflow answer. I should have copied it all here. Sorry for that. Below is the rest of the info (including the observation about the UTF-16LE BOM).


(continued from the stack overflow answer from K J)
...
The replacement will not always happen as normally we would expect the "Index" to be in <Hexidecimal Pairs> rather than octets, and the fact the first two equate to a UTF-16 BOM (\377\376) = <FFFE> is likely the cause of this odd reversed interpretation.

So perhaps this was a bug, "waiting to happen", only in such a rare combination!

In order to correct for the bug, the colours should have been read as hex.

14 0 obj
[/Indexed/DeviceGray 15<FFFE00FD0102FCFB0300000000000000>]
endobj

Then that be transferred in the merge.

Answer
One way to fix the output is to over patch with the shorter hex version.
Another would be to try export image to larger conventional 8 bits and replace with the much larger number of bytes image. Actually you may be able to alter greyscale compression and then get a smaller file.

@pubpub-zz
Copy link
Collaborator

I think I have the fix. I've also detected an issue in the image quality
test image:
Page1 Image0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants