Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will hang on invalid PDFs #77

Closed
wolever opened this issue Mar 7, 2014 · 5 comments
Closed

Will hang on invalid PDFs #77

wolever opened this issue Mar 7, 2014 · 5 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem

Comments

@wolever
Copy link
Contributor

wolever commented Mar 7, 2014

Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment function:

def skipOverComment(stream):
    tok = stream.read(1)
    stream.seek(-1, 1)
    if tok == b_('%'):
        while tok not in (b_('\n'), b_('\r')):
            tok = stream.read(1)

Will hang indefinitely.

I would propose three courses of action:

  1. Wrap the stream in a method which will raise an exception after a certain number of empty reads; ex:
class SafeStream(object):
    def __init__(self, stream):
        self.stream = stream
        self.seek = stream.seek
        self.tell = stream.tell
        self._empty_reads = 0

    def read(self, *args):
        res = self.stream.read(*args)
        if res == "":
             self._empty_reads += 1
             if self._empty_reads > 1000:
                 raise Exception("too many empty reads")
        else:
             self._empty_reads = 0
        return res
  1. Add a script for automating fuzz testing to the repo

  2. Fix the bugs as the script from step (2) finds them

What do you think? Would you be open to patches for those?

@mstamy2
Copy link
Collaborator

mstamy2 commented Mar 10, 2014

Certainly we would be open to such patches. PyPDF2 will often hang on unexpected input (such as variations in syntax), where it should instead throw a proper exception and/or handle the data in the best way possible.

Your proposed plan would definitely help to improve PyPDF2.

@wolever
Copy link
Contributor Author

wolever commented Mar 10, 2014

:D awesome.

I've got a bit of a crunch at work coming up, but I'll take a look afterwards.

@whitemice
Copy link

whitemice commented Aug 5, 2015

I usually wrap processing of a document with PyPDF into some type of timer.

class TimeOutAlarm(Exception):
    pass

def timeout_alarm_handler(signum, frame):
    raise TimeOutAlarm

class RotationTimeOutException(Exception):
    pass

signal.signal(signal.SIGALRM, timeout_alarm_handler)
signal.alarm(15)

reader = PdfFileReader(rfile, strict=False, )
writer = PdfFileWriter()

for pagenum in range(reader.numPages):
    try:
        signal.alarm(15)
        page = reader.getPage(pagenum)
        if not counter_clockwise:
            page.rotateClockwise(rotate_degrees)
        else:
            page.rotateCounterClockwise(rotate_degrees)
        page.compressContentStreams()
        writer.addPage(page)
        signal.alarm(0)
        page = None
    except TimeOutAlarm:
        raise RotationTimeOutException
    else:
        writer.write(wfile)
writer = None

polyglot-jones pushed a commit to polyglot-jones/PyPDF2 that referenced this issue Aug 11, 2020
@py-pdf py-pdf deleted a comment from dhudson1 Jun 10, 2022
@MartinThoma MartinThoma added needs-pdf The issue needs a PDF file to show the problem and removed Parsing labels Jun 26, 2022
@MartinThoma
Copy link
Member

Does anybody have a PDF showing this issue?

@MartinThoma
Copy link
Member

If anybody runs into this issue again, please let me know. I close this for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem
Projects
None yet
Development

No branches or pull requests

4 participants