Endless Loop When Processing Certain PDF file with PdfFileReader #361

eghlima · 2017-07-25T07:24:00Z

I have attached an invalid pdf file.
when I wanted to open that pdf file,
I have faced with long time to read a pdf (more than 1 hour).
seems a bug is in PdfFileReader.
this is my test to reproduce the bug on PyPDF2==1.26.0 and python 3.6:

MCVE: Code + PDF

Example PDF: file1.pdf

from PyPDF2 import PdfReader

p = PdfFileReader("file1.pdf") # in this line we will be wait a long

The text was updated successfully, but these errors were encountered:

jerr0328 · 2019-02-26T16:38:46Z

Sorry for digging this up, but we saw similar results, even with a "PDF" that was a 5MB file with all zeroes.

I'm not sure what exactly the code is doing, but it seems to get stuck in a loop looking for an EOF marker that doesn't exist. I'm not sure if this will ever be fixed in PyPDF2 given the current situation, but it would really be helpful to have PyPDF2 fail fast on these types of files.

ztravis · 2021-11-16T00:50:41Z

The bug here is that this library has a method that reads a line byte by byte - effectively doing:

def read_next_line(stream):
    curr = b''
    while in_same_line(stream):
        curr = curr + next_byte(stream)
    return curr

Actually, the real function reads a line backwards, but it's the same idea... This ends up being quadratic in the total length of the line, which is particularly noticeable when the line gets to be very long (e.g. 5MB of null bytes, which, since it has no \r or \n characters, is treated a single line). I'll open an issue with a fix as well as a way to monkey-patch PdfFileReader to work around it - I don't know if PyPDF2 is still being maintained but at least it'll be out there.

MartinThoma · 2022-06-27T19:26:01Z

The file above fails pretty soon and the quadratic reading time was fixed. For this reason, I'll close this issue.

Please let me know if you encounter it again!

markdoliner mentioned this issue Apr 3, 2020

Make selenium-generated PDF readable #321

Merged

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected labels Apr 8, 2022

MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Jun 27, 2022

MartinThoma closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endless Loop When Processing Certain PDF file with PdfFileReader #361

Endless Loop When Processing Certain PDF file with PdfFileReader #361

eghlima commented Jul 25, 2017 •

edited by MartinThoma

Loading

jerr0328 commented Feb 26, 2019

ztravis commented Nov 16, 2021

MartinThoma commented Jun 27, 2022

Endless Loop When Processing Certain PDF file with PdfFileReader #361

Endless Loop When Processing Certain PDF file with PdfFileReader #361

Comments

eghlima commented Jul 25, 2017 • edited by MartinThoma Loading

MCVE: Code + PDF

jerr0328 commented Feb 26, 2019

ztravis commented Nov 16, 2021

MartinThoma commented Jun 27, 2022

eghlima commented Jul 25, 2017 •

edited by MartinThoma

Loading