Make selenium-generated PDF readable #321

akolpakov · 2017-02-06T08:31:35Z

Try to find “%%EOF” in last 1Mb of file.
In some PDF files, “%%EOF” sign can be far away from the end of document

Try to find “%%EOF” in last 1Mb of file.

brettlangdon · 2017-05-18T13:16:58Z

@akolpakov thank you! This fixed the problem I was having.

togakangaroo · 2018-12-22T02:56:45Z

Can this get merged? I'm seeing this right now in the most recent pip release

B-Stefan · 2019-02-05T14:49:44Z

Please merge this PR. We ran into the same issue.

reportgunner · 2019-02-08T10:08:54Z

thank you @akolpakov !

mccorkle · 2019-03-03T19:12:48Z

Maybe this will help someone else -- but I ran into this issue when an xls file was included in my PDF list of files. After removing that xls file from the group to be merged, the issue no longer happens in my environment. Check all your

markdoliner · 2020-04-03T18:38:32Z

This is a great change and should be merged in as-is. It attempts to address these issues, which I believe are duplicates of each other: #177 #442 #480

Is this change safe?

I'm not very familiar with the PDF standard, but from what I've read this change is safe. I've looked through a PDF specification listed on Adobe's website (https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf which is linked to from https://www.adobe.com/devnet/pdf/pdf_reference.html with the description "This document is an ISO approved copy of the ISO 32000-1 Standards document" and https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) for references of %%EOF and it doesn't appear that there is anything meaningful after %%EOF. I've heard that some PDF generators might use this to store comments or other metadata. Personally I've seen a PDF that had a bunch of null bytes after %%EOF and I have no idea why.

The second document that I mentioned above has this interesting note, "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file." That document is dated 2006. I didn't attempt to ascertain whether that is currently the case.

I looked at two other open source PDF libraries:

Poppler previously looked for %%EOF in the last 1024 bytes ("we look in the last 1024 chars because Adobe does the same"), however that check was disabled in 2007 (https://gitlab.freedesktop.org/poppler/poppler/-/commit/be1b5a0196cdfc78f74e08a023b477cac16eb0f3) with the comment "Adobe does not seem to enforce %%EOF, so we do the same."

PDFBox is considerably more complicated. The "1024" amount appears to be configurable. It also looks like there's a "lenient" mode where %%EOF is not required.
Source: https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?view=markup
The commit that made %%EOF optional when using lenient mode is https://svn.apache.org/viewvc?view=revision&revision=1635584
The Issue related to the lenient change is https://issues.apache.org/jira/browse/PDFBOX-2455

Are there other changes that should be made?

Yeah, probably. Why limit to 1MB? Why not look in the entire file? Why require it at all? If it's not there then look for startxref starting at the end of the file (I believe that's what Poppler does).

Also it's definitely a good idea to add a unit test for this and it should be trivial.

Also the test coverage from the existing tests (https://github.com/mstamy2/PyPDF2/blob/master/Tests/tests.py) is worryingly small.

Other notes

The looping behavior was changed in PR #75. It's possible that change is slightly wrong and caused this problem... but I haven't tried to look at this code close enough to be convinced of that.

This GitHub issue is related, and could possibly be resolved by refactoring this loop: #361

MartinThoma · 2022-04-09T06:22:12Z

@markdoliner Thank you for the summary!

MartinThoma · 2022-04-09T06:22:37Z

Does anybody have a small PDF created by selenium that has this issue? I would like to add it to the test suite.

markdoliner · 2022-04-09T12:45:50Z

Hi, @MartinThoma. Sorry I don't think I can share the PDF where I saw this problem since it contains private information :-(

MartinThoma · 2022-04-09T12:48:28Z

Are you maybe able to generate a minimal example with another website? If all selenium-generated PDFs are affected, this should be rather easy. You could https://pypi.org/project/PyPDF2/

guillaume-uH57J9 · 2022-04-14T19:53:51Z

I added a comment there with a link to a test PDF, and instruction to generate this kind of test PDF.
#177 (comment)

Bug Fixes (BUG): - Use 1MB as offset for readNextEndLine (#321) - 'PdfFileWriter' object has no attribute 'stream' (#787) Robustness (ROB): - Invalid float object; use 0 as fallback (#782) Documentation (DOC): - Robustness (#785) Full Changelog: 1.27.7...1.27.8

Try to find “%%EOF” in last 1Mb of file. This fixes the issue with reading Selenium-generated PDF files. Closes py-pdf#177 Closes py-pdf#442 Closes py-pdf#480

Bug Fixes (BUG): - Use 1MB as offset for readNextEndLine (py-pdf#321) - 'PdfFileWriter' object has no attribute 'stream' (py-pdf#787) Robustness (ROB): - Invalid float object; use 0 as fallback (py-pdf#782) Documentation (DOC): - Robustness (py-pdf#785) Full Changelog: py-pdf/pypdf@1.27.7...1.27.8

Fix py-pdf#177

9c36f77

Try to find “%%EOF” in last 1Mb of file.

beruic mentioned this pull request Feb 14, 2018

PdfReadError: EOF marker not found error when opening pdf files generated from selenium snapshot #177

Closed

shivstha mentioned this pull request May 16, 2020

extract_text works for some PDF files, but not the others #437

Closed

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 6, 2022

MartinThoma changed the title ~~Fix #177~~ Make selenium-generated PDF readable Apr 9, 2022

MartinThoma added this to the Last PyPDF2 version 1.X release milestone Apr 9, 2022

Merge branch 'master' into issue_177

ded37da

MartinThoma added the needs-test A test should be added before this PR is merged. label Apr 9, 2022

Merge branch 'main' into issue_177

3e3b492

MartinThoma merged commit db1e458 into py-pdf:main Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make selenium-generated PDF readable #321

Make selenium-generated PDF readable #321

akolpakov commented Feb 6, 2017

brettlangdon commented May 18, 2017

togakangaroo commented Dec 22, 2018

B-Stefan commented Feb 5, 2019

reportgunner commented Feb 8, 2019

mccorkle commented Mar 3, 2019

markdoliner commented Apr 3, 2020

MartinThoma commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

markdoliner commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

guillaume-uH57J9 commented Apr 14, 2022

Make selenium-generated PDF readable #321

Make selenium-generated PDF readable #321

Conversation

akolpakov commented Feb 6, 2017

brettlangdon commented May 18, 2017

togakangaroo commented Dec 22, 2018

B-Stefan commented Feb 5, 2019

reportgunner commented Feb 8, 2019

mccorkle commented Mar 3, 2019

markdoliner commented Apr 3, 2020

Is this change safe?

Are there other changes that should be made?

Other notes

MartinThoma commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

markdoliner commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

guillaume-uH57J9 commented Apr 14, 2022