Mostly broken watermarking behaviour #2112

stefan6419846 · 2023-08-24T11:27:43Z

I am trying to add a watermark to a PDF file based on the official docs, which seems to create issues with different PDF viewers.

By default, the watermark is upside down and shifted somehow. Introducing page.transfer_rotation_to_content() (which is currently hidden away in the API docs of the PageObject) will fix this issue.

Nevertheless, it seems like there is some additional stuff going wrong as well, as only the Firefox browser (and plain pdf.js, the PDF viewer implementation used by Firefox) are able to view the watermark, while during my tests, Chromium, Evince, Edge, MuPDF and Adobe Acrobat only show a blank page (or the non-watermarked version if the input file would not have had a blank page).

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.81-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.15.2

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter


watermark = PdfReader('watermark.pdf').pages[0]
pdf_file = PdfWriter(clone_from='out1.pdf')

for page in pdf_file.pages:
    if page.rotation != 0:
        page.transfer_rotation_to_content()
    page.merge_page(watermark, over=True)
pdf_file.write('marked.pdf')

The plain test files are: watermark.pdf out1.pdf marked.pdf

Traceback

There is no traceback.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2023-08-25T10:39:39Z

Doing some further testing, it seems like this is (mostly) unrelated to the actual PDF file and can be reproduced in even more cases. Running Ghostscript on the corresponding file beforehand seems to reliably fix this behavior and allows displaying/printing with the watermark being visible, but imposes some unnecessary overhead and rewrites the complete PDF file (maybe changing the PDF quality itself as well).

stefan6419846 · 2023-08-25T12:33:57Z

I further investigated this issue and came to the conclusion that watermarking in PyPDF is somehow completely broken. Why? Because looking at the result of the official test, the generated PDF file shows the same behavior:

pypdf/tests/test_writer.py

Lines 1519 to 1535 in f16f434

    
           @pytest.mark.enable_socket() 
        
           def test_watermark(): 
        
               url = "https://github.com/py-pdf/pypdf/files/11985889/bg.pdf" 
        
               name = "bgwatermark.pdf" 
        
               reader = PdfReader(BytesIO(get_data_from_url(url, name=name))) 
        
               url = "https://github.com/py-pdf/pypdf/files/11985888/source.pdf" 
        
               name = "srcwatermark.pdf" 
        
               writer = PdfWriter(clone_from=BytesIO(get_data_from_url(url, name=name))) 
        
               for p in writer.pages: 
        
                   p.merge_page(reader.pages[0], over=False) 
        
               assert isinstance(p["/Contents"], ArrayObject) 
        
               assert isinstance(p["/Contents"][0], IndirectObject) 
        
               b = BytesIO() 
        
               writer.write(b) 
        
               assert len(b.getvalue()) < 2.1 * 1024 * 1024

I could reproduce this on multiple Python 3.9 installations as well as on a completely different setup with a Python 3.10.12 installation.

In my opinion, this behavior cannot be tested with a simple file size check, but needs proper visual testing as far as it is possible. For the current problem, just running Ghostscript (which is what ImageMagick uses) on the PDF file will make it look valid again, but throw away the watermark. Additionally, there are lots of warnings emitted which seem to be related to the issue:

[...]
Page 120
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

For now, doing an image comparison based upon ImageMagick/Ghostscript should at least have discovered the current issue.

stefan6419846 · 2023-08-29T10:16:52Z

This apparently has been fixed by #2086 without being aware of that, but a corresponding integration test should still be added.

MartinThoma · 2023-08-29T12:48:35Z

Do you want to attempt that?

stefan6419846 · 2023-08-29T14:14:00Z

You mentioned something about an existing private test, so I assumed that you would have a look at this.If desired, I could submit a PR tomorrow.

My current (rough) testing code:

pytest.mark.enable_socket()
def test_watermark_rendering(tmp_path):
    url = "https://github.com/py-pdf/pypdf/files/11985889/bg.pdf"
    name = "bgwatermark.pdf"
    watermark = PdfReader(BytesIO(get_data_from_url(url, name=name))).pages[0]
    url = "https://github.com/py-pdf/pypdf/files/11985888/source.pdf"
    name = "srcwatermark.pdf"
    page = PdfReader(BytesIO(get_data_from_url(url, name=name))).pages[0]
    writer = PdfWriter()
    page.merge_page(watermark, over=False)
    writer.add_page(page)

    target_png_path = tmp_path / "target.png"
    url = "https://github.com/stefan6419846/pypdf/assets/96178532/2646f89f-f7a9-494d-8bdf-5331c58ed668"
    name = "dstwatermark.png"
    target_png_path.write_bytes(get_data_from_url(url, name=name))

    pdf_path = tmp_path / "out.pdf"
    png_path = tmp_path / "out.png"
    writer.write(pdf_path)
    ghostscript_binary = shutil.which("gs")
    assert ghostscript_binary is not None
    # https://github.com/PyCQA/bandit/issues/333
    subprocess.run([ghostscript_binary, "-sDEVICE=pngalpha", "-o", png_path, pdf_path])  # noqa: S603
    assert png_path.is_file()
    image_similarity(png_path, target_png_path)

As long as Ghostscript versions stay consistent, there should be no real issue (I generated the file with version 9.52, CI uses 9.50, but larger version differences might lead to rendering differences).

pubpub-zz · 2023-08-29T20:48:37Z

@stefan6419846
can't you just store the output of gs processing within the issue as a reference?

stefan6419846 · 2023-08-30T06:32:17Z

can't you just store the output of gs processing within the issue as a reference?

This is what my testing code is basically doing, although being in my fork for the time of testing to not spam the issues here.

stefan6419846 · 2023-08-30T06:45:57Z

Reference file for testing purposes: https://github.com/py-pdf/pypdf/assets/96178532/d5c72d0e-7047-4504-bbf6-bc591c80d7c0

This adds a test for the correct rendering of watermarked files. Closes #2112

stefan6419846 mentioned this issue Aug 24, 2023

Cope with broken page geometry during insertion #2106

Closed

stefan6419846 changed the title ~~Unstable watermarking behaviour~~ Mostly broken watermarking behaviour Aug 25, 2023

stefan6419846 mentioned this issue Aug 27, 2023

PI: Making pypdf as fast as pdfrw #2086

Merged

stefan6419846 mentioned this issue Aug 30, 2023

TST: Add test for correct rendering of watermarks #2130

Merged

MartinThoma closed this as completed in #2130 Sep 3, 2023

MartinThoma pushed a commit that referenced this issue Sep 3, 2023

TST: Add test for correct rendering of watermarks (#2130)

b8d3bea

This adds a test for the correct rendering of watermarked files. Closes #2112

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mostly broken watermarking behaviour #2112

Mostly broken watermarking behaviour #2112

stefan6419846 commented Aug 24, 2023 •

edited

Loading

stefan6419846 commented Aug 25, 2023 •

edited

Loading

stefan6419846 commented Aug 25, 2023

stefan6419846 commented Aug 29, 2023

MartinThoma commented Aug 29, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

stefan6419846 commented Aug 30, 2023

stefan6419846 commented Aug 30, 2023

Mostly broken watermarking behaviour #2112

Mostly broken watermarking behaviour #2112

Comments

stefan6419846 commented Aug 24, 2023 • edited Loading

Environment

Code + PDF

Traceback

stefan6419846 commented Aug 25, 2023 • edited Loading

stefan6419846 commented Aug 25, 2023

stefan6419846 commented Aug 29, 2023

MartinThoma commented Aug 29, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

stefan6419846 commented Aug 30, 2023

stefan6419846 commented Aug 30, 2023

stefan6419846 commented Aug 24, 2023 •

edited

Loading

stefan6419846 commented Aug 25, 2023 •

edited

Loading