Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless Loop When Processing Certain PDF file with PdfFileReader #361

Closed
eghlima opened this issue Jul 25, 2017 · 3 comments
Closed

Endless Loop When Processing Certain PDF file with PdfFileReader #361

eghlima opened this issue Jul 25, 2017 · 3 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected

Comments

@eghlima
Copy link

eghlima commented Jul 25, 2017

I have attached an invalid pdf file.
when I wanted to open that pdf file,
I have faced with long time to read a pdf (more than 1 hour).
seems a bug is in PdfFileReader.
this is my test to reproduce the bug on PyPDF2==1.26.0 and python 3.6:

MCVE: Code + PDF

Example PDF: file1.pdf

from PyPDF2 import PdfReader

p = PdfFileReader("file1.pdf") # in this line we will be wait a long
@jerr0328
Copy link

Sorry for digging this up, but we saw similar results, even with a "PDF" that was a 5MB file with all zeroes.

I'm not sure what exactly the code is doing, but it seems to get stuck in a loop looking for an EOF marker that doesn't exist. I'm not sure if this will ever be fixed in PyPDF2 given the current situation, but it would really be helpful to have PyPDF2 fail fast on these types of files.

@ztravis
Copy link
Contributor

ztravis commented Nov 16, 2021

The bug here is that this library has a method that reads a line byte by byte - effectively doing:

def read_next_line(stream):
    curr = b''
    while in_same_line(stream):
        curr = curr + next_byte(stream)
    return curr

Actually, the real function reads a line backwards, but it's the same idea... This ends up being quadratic in the total length of the line, which is particularly noticeable when the line gets to be very long (e.g. 5MB of null bytes, which, since it has no \r or \n characters, is treated a single line). I'll open an issue with a fix as well as a way to monkey-patch PdfFileReader to work around it - I don't know if PyPDF2 is still being maintained but at least it'll be out there.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected labels Apr 8, 2022
@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Jun 27, 2022
@MartinThoma
Copy link
Member

The file above fails pretty soon and the quadratic reading time was fixed. For this reason, I'll close this issue.

Please let me know if you encounter it again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected
Projects
None yet
Development

No branches or pull requests

4 participants