Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make selenium-generated PDF readable #321

Merged
merged 3 commits into from
Apr 21, 2022
Merged

Conversation

akolpakov
Copy link
Contributor

Try to find “%%EOF” in last 1Mb of file.
In some PDF files, “%%EOF” sign can be far away from the end of document

Try to find “%%EOF” in last 1Mb of file.
@brettlangdon
Copy link

@akolpakov thank you! This fixed the problem I was having.

@togakangaroo
Copy link

Can this get merged? I'm seeing this right now in the most recent pip release

@B-Stefan
Copy link

B-Stefan commented Feb 5, 2019

Please merge this PR. We ran into the same issue.

@reportgunner
Copy link

thank you @akolpakov !

@mccorkle
Copy link

mccorkle commented Mar 3, 2019

Maybe this will help someone else -- but I ran into this issue when an xls file was included in my PDF list of files. After removing that xls file from the group to be merged, the issue no longer happens in my environment. Check all your

@markdoliner
Copy link

This is a great change and should be merged in as-is. It attempts to address these issues, which I believe are duplicates of each other: #177 #442 #480

Is this change safe?

I'm not very familiar with the PDF standard, but from what I've read this change is safe. I've looked through a PDF specification listed on Adobe's website (https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf which is linked to from https://www.adobe.com/devnet/pdf/pdf_reference.html with the description "This document is an ISO approved copy of the ISO 32000-1 Standards document" and https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) for references of %%EOF and it doesn't appear that there is anything meaningful after %%EOF. I've heard that some PDF generators might use this to store comments or other metadata. Personally I've seen a PDF that had a bunch of null bytes after %%EOF and I have no idea why.

The second document that I mentioned above has this interesting note, "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file." That document is dated 2006. I didn't attempt to ascertain whether that is currently the case.

I looked at two other open source PDF libraries:

Poppler previously looked for %%EOF in the last 1024 bytes ("we look in the last 1024 chars because Adobe does the same"), however that check was disabled in 2007 (https://gitlab.freedesktop.org/poppler/poppler/-/commit/be1b5a0196cdfc78f74e08a023b477cac16eb0f3) with the comment "Adobe does not seem to enforce %%EOF, so we do the same."

PDFBox is considerably more complicated. The "1024" amount appears to be configurable. It also looks like there's a "lenient" mode where %%EOF is not required.
Source: https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?view=markup
The commit that made %%EOF optional when using lenient mode is https://svn.apache.org/viewvc?view=revision&revision=1635584
The Issue related to the lenient change is https://issues.apache.org/jira/browse/PDFBOX-2455

Are there other changes that should be made?

Yeah, probably. Why limit to 1MB? Why not look in the entire file? Why require it at all? If it's not there then look for startxref starting at the end of the file (I believe that's what Poppler does).

Also it's definitely a good idea to add a unit test for this and it should be trivial.

Also the test coverage from the existing tests (https://github.com/mstamy2/PyPDF2/blob/master/Tests/tests.py) is worryingly small.

Other notes

The looping behavior was changed in PR #75. It's possible that change is slightly wrong and caused this problem... but I haven't tried to look at this code close enough to be convinced of that.

This GitHub issue is related, and could possibly be resolved by refactoring this loop: #361

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 6, 2022
@MartinThoma MartinThoma changed the title Fix #177 Make selenium-generated PDF readable Apr 9, 2022
@MartinThoma
Copy link
Member

@markdoliner Thank you for the summary!

@MartinThoma
Copy link
Member

Does anybody have a small PDF created by selenium that has this issue? I would like to add it to the test suite.

@MartinThoma MartinThoma added the needs-test A test should be added before this PR is merged. label Apr 9, 2022
@markdoliner
Copy link

Hi, @MartinThoma. Sorry I don't think I can share the PDF where I saw this problem since it contains private information :-(

@MartinThoma
Copy link
Member

Are you maybe able to generate a minimal example with another website? If all selenium-generated PDFs are affected, this should be rather easy. You could https://pypi.org/project/PyPDF2/

@guillaume-uH57J9
Copy link

I added a comment there with a link to a test PDF, and instruction to generate this kind of test PDF.
#177 (comment)

@MartinThoma MartinThoma merged commit db1e458 into py-pdf:main Apr 21, 2022
MartinThoma added a commit that referenced this pull request Apr 21, 2022
Bug Fixes (BUG):
-  Use 1MB as offset for readNextEndLine (#321)
-  'PdfFileWriter' object has no attribute 'stream' (#787)

Robustness (ROB):
-  Invalid float object; use 0 as fallback (#782)

Documentation (DOC):
-  Robustness (#785)

Full Changelog: 1.27.7...1.27.8
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
Try to find “%%EOF” in last 1Mb of file.

This fixes the issue with reading Selenium-generated PDF files.

Closes py-pdf#177
Closes py-pdf#442
Closes py-pdf#480
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
Bug Fixes (BUG):
-  Use 1MB as offset for readNextEndLine (py-pdf#321)
-  'PdfFileWriter' object has no attribute 'stream' (py-pdf#787)

Robustness (ROB):
-  Invalid float object; use 0 as fallback (py-pdf#782)

Documentation (DOC):
-  Robustness (py-pdf#785)

Full Changelog: py-pdf/pypdf@1.27.7...1.27.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-test A test should be added before this PR is merged. Tiny Pull requests that make a tiny change - and thus should be easy to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants