Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

rfotino · 2024-04-07T05:28:35Z

tl;dr Using PageObject.merge_page when one of the pages' content stream ends in Q ends up popping up this error message when the resulting PDF is opened in Adobe Reader:

For search engines: An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.

Example "PDF with content stream ending in Q": blank_portrait.pdf. It's a single page, blank PDF. Opening in a text editor you can see the contents are a 3-byte stream consisting of q Q.

I use PyPDF along with ReportLab to generate PDFs and end up filling in forms by using the PageObject.merge_page to merge a page with text onto a blank form. I normally use Chrome's PDF viewer or MacOS's Preview to look at PDFs, so I didn't see this error for quite a while until I got feedback from people using Adobe Reader. I used poppler-utils's pdftotext as a guess to see if it might show more info about the error. Testing with a PDF with the error showed something like Syntax Error (415): Unknown operator 'QQ'.

It turns out that the base pdf (blank form) had a content stream ending in Q, as in it was of the format q <rest-of-content-stream> Q. After much digging I finally found this line https://github.com/py-pdf/pypdf/blob/main/pypdf/generic/_data_structures.py#L1259 that was adding the additional Q onto the end of the page's content stream when merging.

I think the solution is just to move the newline before the Q in that line of code, I'll have a PR up shortly and I have a very basic repro script that should showcase the problem.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.4.1-arm64-arm-64bit
# also Alpine Linux

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('cryptography', '42.0.5'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from io import BytesIO
from pypdf import PdfWriter, PdfReader
from reportlab.pdfgen import canvas

def patch_isolate_graphics_state():
	from pypdf._utils import b_
	from pypdf.generic._data_structures import ContentStream
	def _isolate_graphics_state(self):
		if self._operations:
			self._operations.insert(0, ([], "q"))
			self._operations.append(([], "Q"))
		elif self._data:
			self._data = b"q\n" + b_(self._data) + b"\nQ"
	ContentStream.isolate_graphics_state = _isolate_graphics_state

# Uncomment this to get a fixed PDF
# patch_isolate_graphics_state()

# generate top page with text via reportlab
packet = BytesIO()
can = canvas.Canvas(packet)
for y in range(50, 750, 15):
	can.drawString(250, y, "Testing, testing, 1, 2, 3...")
can.save()
top_page = PdfReader(packet).pages[0]

# merge
blank_pdf = PdfReader(open("blank_portrait.pdf", "rb"))
bottom_page = blank_pdf.pages[0]
bottom_page.merge_page(top_page)

# write out
output = PdfWriter()
output.add_page(bottom_page)
with open("output.pdf", mode='wb') as outputStream:
	output.write(outputStream)

Input PDF file: blank_portrait.pdf
Output PDF file using latest PyPDF release: output.pdf
Output PDF file after fix: output.pdf

For some reason you have to print above a certain quantity of text on the PDF before Adobe Reader will complain about there being an error on the page, hence the many can.drawString calls. But the QQ operator will be in the output regardless, even if there is no text on the top PDF - so the stream is still not formatted properly, even though Adobe Reader doesn't pop up an error for a super minimal example.

This very basic blank pdf example was produced with MacOS's Preview in 2021, but I have a few others in the wild that were printed from Microsoft Word on MacOS in 2024 that also have the issue, I just can't share them here without quite a bit of cleanup. I've also seen the same issue with forms generated from printing to PDF with Chrome on MacOS.

I'll add the blank PDF here as a test case in my PR.

The text was updated successfully, but these errors were encountered:

Fixes #2587

This was referenced Apr 7, 2024

ENH: Add blank file with q Q content stream py-pdf/sample-files#29

Closed

BUG: Fix merge_page sometimes generating unknown operator 'QQ' #2588

Merged

stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF generic The generic submodule is affected labels Apr 7, 2024

stefan6419846 closed this as completed in #2588 Apr 7, 2024

stefan6419846 pushed a commit that referenced this issue Apr 7, 2024

BUG: Fix merge_page sometimes generating unknown operator 'QQ' (#2588)

561b1b0

Fixes #2587

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

rfotino commented Apr 7, 2024 •

edited

Loading

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

Merging PDFs with content streams ending in Q causes error message in Adobe Reader #2587

Comments

rfotino commented Apr 7, 2024 • edited Loading

Environment

Code + PDF

rfotino commented Apr 7, 2024 •

edited

Loading