Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

Closed
rfotino opened this issue Apr 7, 2024 · 0 comments · Fixed by #2588
Closed

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

rfotino opened this issue Apr 7, 2024 · 0 comments · Fixed by #2588
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@rfotino
Copy link
Contributor

rfotino commented Apr 7, 2024

tl;dr Using PageObject.merge_page when one of the pages' content stream ends in Q ends up popping up this error message when the resulting PDF is opened in Adobe Reader:
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.

For search engines: An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.

Example "PDF with content stream ending in Q": blank_portrait.pdf. It's a single page, blank PDF. Opening in a text editor you can see the contents are a 3-byte stream consisting of q Q.

I use PyPDF along with ReportLab to generate PDFs and end up filling in forms by using the PageObject.merge_page to merge a page with text onto a blank form. I normally use Chrome's PDF viewer or MacOS's Preview to look at PDFs, so I didn't see this error for quite a while until I got feedback from people using Adobe Reader. I used poppler-utils's pdftotext as a guess to see if it might show more info about the error. Testing with a PDF with the error showed something like Syntax Error (415): Unknown operator 'QQ'.

It turns out that the base pdf (blank form) had a content stream ending in Q, as in it was of the format q <rest-of-content-stream> Q. After much digging I finally found this line https://github.com/py-pdf/pypdf/blob/main/pypdf/generic/_data_structures.py#L1259 that was adding the additional Q onto the end of the page's content stream when merging.

I think the solution is just to move the newline before the Q in that line of code, I'll have a PR up shortly and I have a very basic repro script that should showcase the problem.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.4.1-arm64-arm-64bit
# also Alpine Linux

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('cryptography', '42.0.5'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from io import BytesIO
from pypdf import PdfWriter, PdfReader
from reportlab.pdfgen import canvas

def patch_isolate_graphics_state():
	from pypdf._utils import b_
	from pypdf.generic._data_structures import ContentStream
	def _isolate_graphics_state(self):
		if self._operations:
			self._operations.insert(0, ([], "q"))
			self._operations.append(([], "Q"))
		elif self._data:
			self._data = b"q\n" + b_(self._data) + b"\nQ"
	ContentStream.isolate_graphics_state = _isolate_graphics_state

# Uncomment this to get a fixed PDF
# patch_isolate_graphics_state()

# generate top page with text via reportlab
packet = BytesIO()
can = canvas.Canvas(packet)
for y in range(50, 750, 15):
	can.drawString(250, y, "Testing, testing, 1, 2, 3...")
can.save()
top_page = PdfReader(packet).pages[0]

# merge
blank_pdf = PdfReader(open("blank_portrait.pdf", "rb"))
bottom_page = blank_pdf.pages[0]
bottom_page.merge_page(top_page)

# write out
output = PdfWriter()
output.add_page(bottom_page)
with open("output.pdf", mode='wb') as outputStream:
	output.write(outputStream)

Input PDF file: blank_portrait.pdf
Output PDF file using latest PyPDF release: output.pdf
Output PDF file after fix: output.pdf

For some reason you have to print above a certain quantity of text on the PDF before Adobe Reader will complain about there being an error on the page, hence the many can.drawString calls. But the QQ operator will be in the output regardless, even if there is no text on the top PDF - so the stream is still not formatted properly, even though Adobe Reader doesn't pop up an error for a super minimal example.

This very basic blank pdf example was produced with MacOS's Preview in 2021, but I have a few others in the wild that were printed from Microsoft Word on MacOS in 2024 that also have the issue, I just can't share them here without quite a bit of cleanup. I've also seen the same issue with forms generated from printing to PDF with Chrome on MacOS.

I'll add the blank PDF here as a test case in my PR.

@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF generic The generic submodule is affected labels Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants