less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

andreaskagedal · 2024-05-21T06:47:41Z

When merging PDF containing images (one per page) some images were alterd in the resulting merged file.

The issue was discussed on stackoverflow here:
https://stackoverflow.com/questions/78508800/pypdf-does-not-give-me-the-right-image
where it was proposed to report this here.

The conclusion in the stackoverflow discussion was that (copying from the answer by K.J. on stackoverflow):

The image pixels have not changed, but the colour index has been incorrectly interpreted and rewritten in the merge.

The Source colours are written as an Octal literal string.

14 0 obj
[/Indexed
/DeviceGray
15
(\377\376\000\375\001\002\374\373\003\000\000\000\000\000\000\000)]endobj

This has been mis-interpreted as a UTF-16 string and replaced with:

12 0 obj
[ /Indexed /DeviceGray 15 (þÿý\000\002\001ûü\000\003\000\000\000\000\000\000) ]
endobj

The net effect is all the colours are now bit rotten. Many, but not all, will look reversed from dark to light.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-4.15.0-213-lowlatency-x86_64-with-glibc2.27

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter

all_pages = PdfReader('original_before_merge.pdf').pages
writer = PdfWriter()
for page in all_pages:
    writer.add_page(page)
writer.write("after_merge.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Note that it is the second page which is the problem.

original_before_merge.pdf

And the result I get from the above code is this:

after_merge.pdf

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-05-21T21:52:47Z

The issue is coming from the color table. it starts with \xFF\xFE which is understood as UTF-16LE BOM which induce byte swapping because of #1884. Under analysis.

andreaskagedal · 2024-05-22T07:05:44Z

Note that there are more comments in the Stack Overflow answer. I should have copied it all here. Sorry for that. Below is the rest of the info (including the observation about the UTF-16LE BOM).

(continued from the stack overflow answer from K J)
...
The replacement will not always happen as normally we would expect the "Index" to be in <Hexidecimal Pairs> rather than octets, and the fact the first two equate to a UTF-16 BOM (\377\376) = <FFFE> is likely the cause of this odd reversed interpretation.

So perhaps this was a bug, "waiting to happen", only in such a rare combination!

In order to correct for the bug, the colours should have been read as hex.

14 0 obj
[/Indexed/DeviceGray 15<FFFE00FD0102FCFB0300000000000000>]
endobj

Then that be transferred in the merge.

Answer
One way to fix the output is to over patch with the shorter hex version.
Another would be to try export image to larger conventional 8 bits and replace with the much larger number of bytes image. Actually you may be able to alter greyscale compression and then get a smaller file.

pubpub-zz · 2024-05-22T21:06:06Z

I think I have the fix. I've also detected an issue in the image quality
test image:

closes py-pdf#2660

…#2675) Closes #2660.

stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-merge From a users perspective, merging is the affected feature/workflow labels May 21, 2024

pubpub-zz mentioned this issue May 21, 2024

BUG: Support UTF-16-LE Strings #1884

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 23, 2024

BUG: Fix images issue 4 bits encoding and LUT starting with UTF16_BOM

b2baf38

closes py-pdf#2660

pubpub-zz mentioned this issue May 23, 2024

BUG: Fix images issue 4 bits encoding and LUT starting with UTF16_BOM #2675

Merged

stefan6419846 closed this as completed in #2675 May 27, 2024

stefan6419846 pushed a commit that referenced this issue May 27, 2024

BUG: Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (…

7481f36

…#2675) Closes #2660.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

andreaskagedal commented May 21, 2024 •

edited

Loading

pubpub-zz commented May 21, 2024

andreaskagedal commented May 22, 2024 •

edited

Loading

pubpub-zz commented May 22, 2024

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

less "conventional" Indexed 4 bit RGB colour format not handled correctly. #2660

Comments

andreaskagedal commented May 21, 2024 • edited Loading

Environment

Code + PDF

pubpub-zz commented May 21, 2024

andreaskagedal commented May 22, 2024 • edited Loading

pubpub-zz commented May 22, 2024

andreaskagedal commented May 21, 2024 •

edited

Loading

andreaskagedal commented May 22, 2024 •

edited

Loading