Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mostly broken watermarking behaviour #2112

Closed
stefan6419846 opened this issue Aug 24, 2023 · 8 comments · Fixed by #2130
Closed

Mostly broken watermarking behaviour #2112

stefan6419846 opened this issue Aug 24, 2023 · 8 comments · Fixed by #2130

Comments

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Aug 24, 2023

I am trying to add a watermark to a PDF file based on the official docs, which seems to create issues with different PDF viewers.

By default, the watermark is upside down and shifted somehow. Introducing page.transfer_rotation_to_content() (which is currently hidden away in the API docs of the PageObject) will fix this issue.

Nevertheless, it seems like there is some additional stuff going wrong as well, as only the Firefox browser (and plain pdf.js, the PDF viewer implementation used by Firefox) are able to view the watermark, while during my tests, Chromium, Evince, Edge, MuPDF and Adobe Acrobat only show a blank page (or the non-watermarked version if the input file would not have had a blank page).

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.81-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.15.2

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter


watermark = PdfReader('watermark.pdf').pages[0]
pdf_file = PdfWriter(clone_from='out1.pdf')

for page in pdf_file.pages:
    if page.rotation != 0:
        page.transfer_rotation_to_content()
    page.merge_page(watermark, over=True)
pdf_file.write('marked.pdf')

The plain test files are: watermark.pdf out1.pdf marked.pdf

Traceback

There is no traceback.

@stefan6419846
Copy link
Collaborator Author

stefan6419846 commented Aug 25, 2023

Doing some further testing, it seems like this is (mostly) unrelated to the actual PDF file and can be reproduced in even more cases. Running Ghostscript on the corresponding file beforehand seems to reliably fix this behavior and allows displaying/printing with the watermark being visible, but imposes some unnecessary overhead and rewrites the complete PDF file (maybe changing the PDF quality itself as well).

@stefan6419846 stefan6419846 changed the title Unstable watermarking behaviour Mostly broken watermarking behaviour Aug 25, 2023
@stefan6419846
Copy link
Collaborator Author

I further investigated this issue and came to the conclusion that watermarking in PyPDF is somehow completely broken. Why? Because looking at the result of the official test, the generated PDF file shows the same behavior:

pypdf/tests/test_writer.py

Lines 1519 to 1535 in f16f434

@pytest.mark.enable_socket()
def test_watermark():
url = "https://github.com/py-pdf/pypdf/files/11985889/bg.pdf"
name = "bgwatermark.pdf"
reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))
url = "https://github.com/py-pdf/pypdf/files/11985888/source.pdf"
name = "srcwatermark.pdf"
writer = PdfWriter(clone_from=BytesIO(get_data_from_url(url, name=name)))
for p in writer.pages:
p.merge_page(reader.pages[0], over=False)
assert isinstance(p["/Contents"], ArrayObject)
assert isinstance(p["/Contents"][0], IndirectObject)
b = BytesIO()
writer.write(b)
assert len(b.getvalue()) < 2.1 * 1024 * 1024

I could reproduce this on multiple Python 3.9 installations as well as on a completely different setup with a Python 3.10.12 installation.

In my opinion, this behavior cannot be tested with a simple file size check, but needs proper visual testing as far as it is possible. For the current problem, just running Ghostscript (which is what ImageMagick uses) on the PDF file will make it look valid again, but throw away the watermark. Additionally, there are lots of warnings emitted which seem to be related to the issue:

[...]
Page 120
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

For now, doing an image comparison based upon ImageMagick/Ghostscript should at least have discovered the current issue.

@stefan6419846
Copy link
Collaborator Author

This apparently has been fixed by #2086 without being aware of that, but a corresponding integration test should still be added.

@MartinThoma
Copy link
Member

Do you want to attempt that?

@stefan6419846
Copy link
Collaborator Author

You mentioned something about an existing private test, so I assumed that you would have a look at this.If desired, I could submit a PR tomorrow.

My current (rough) testing code:

pytest.mark.enable_socket()
def test_watermark_rendering(tmp_path):
    url = "https://github.com/py-pdf/pypdf/files/11985889/bg.pdf"
    name = "bgwatermark.pdf"
    watermark = PdfReader(BytesIO(get_data_from_url(url, name=name))).pages[0]
    url = "https://github.com/py-pdf/pypdf/files/11985888/source.pdf"
    name = "srcwatermark.pdf"
    page = PdfReader(BytesIO(get_data_from_url(url, name=name))).pages[0]
    writer = PdfWriter()
    page.merge_page(watermark, over=False)
    writer.add_page(page)

    target_png_path = tmp_path / "target.png"
    url = "https://github.com/stefan6419846/pypdf/assets/96178532/2646f89f-f7a9-494d-8bdf-5331c58ed668"
    name = "dstwatermark.png"
    target_png_path.write_bytes(get_data_from_url(url, name=name))

    pdf_path = tmp_path / "out.pdf"
    png_path = tmp_path / "out.png"
    writer.write(pdf_path)
    ghostscript_binary = shutil.which("gs")
    assert ghostscript_binary is not None
    # https://github.com/PyCQA/bandit/issues/333
    subprocess.run([ghostscript_binary, "-sDEVICE=pngalpha", "-o", png_path, pdf_path])  # noqa: S603
    assert png_path.is_file()
    image_similarity(png_path, target_png_path)

As long as Ghostscript versions stay consistent, there should be no real issue (I generated the file with version 9.52, CI uses 9.50, but larger version differences might lead to rendering differences).

@pubpub-zz
Copy link
Collaborator

@stefan6419846
can't you just store the output of gs processing within the issue as a reference?

@stefan6419846
Copy link
Collaborator Author

can't you just store the output of gs processing within the issue as a reference?

This is what my testing code is basically doing, although being in my fork for the time of testing to not spam the issues here.

@stefan6419846
Copy link
Collaborator Author

Reference file for testing purposes: https://github.com/py-pdf/pypdf/assets/96178532/d5c72d0e-7047-4504-bbf6-bc591c80d7c0

MartinThoma pushed a commit that referenced this issue Sep 3, 2023
This adds a test for the correct rendering of watermarked files.

Closes #2112
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants