Will hang on invalid PDFs #77

wolever · 2014-03-07T21:48:08Z

Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment function:

def skipOverComment(stream):
    tok = stream.read(1)
    stream.seek(-1, 1)
    if tok == b_('%'):
        while tok not in (b_('\n'), b_('\r')):
            tok = stream.read(1)

Will hang indefinitely.

I would propose three courses of action:

Wrap the stream in a method which will raise an exception after a certain number of empty reads; ex:

class SafeStream(object):
    def __init__(self, stream):
        self.stream = stream
        self.seek = stream.seek
        self.tell = stream.tell
        self._empty_reads = 0

    def read(self, *args):
        res = self.stream.read(*args)
        if res == "":
             self._empty_reads += 1
             if self._empty_reads > 1000:
                 raise Exception("too many empty reads")
        else:
             self._empty_reads = 0
        return res

Add a script for automating fuzz testing to the repo
Fix the bugs as the script from step (2) finds them

What do you think? Would you be open to patches for those?

The text was updated successfully, but these errors were encountered:

mstamy2 · 2014-03-10T19:43:29Z

Certainly we would be open to such patches. PyPDF2 will often hang on unexpected input (such as variations in syntax), where it should instead throw a proper exception and/or handle the data in the best way possible.

Your proposed plan would definitely help to improve PyPDF2.

wolever · 2014-03-10T20:07:47Z

:D awesome.

I've got a bit of a crunch at work coming up, but I'll take a look afterwards.

whitemice · 2015-08-05T20:36:26Z

I usually wrap processing of a document with PyPDF into some type of timer.

class TimeOutAlarm(Exception):
    pass

def timeout_alarm_handler(signum, frame):
    raise TimeOutAlarm

class RotationTimeOutException(Exception):
    pass

signal.signal(signal.SIGALRM, timeout_alarm_handler)
signal.alarm(15)

reader = PdfFileReader(rfile, strict=False, )
writer = PdfFileWriter()

for pagenum in range(reader.numPages):
    try:
        signal.alarm(15)
        page = reader.getPage(pagenum)
        if not counter_clockwise:
            page.rotateClockwise(rotate_degrees)
        else:
            page.rotateCounterClockwise(rotate_degrees)
        page.compressContentStreams()
        writer.addPage(page)
        signal.alarm(0)
        page = None
    except TimeOutAlarm:
        raise RotationTimeOutException
    else:
        writer.write(wfile)
writer = None

This reverts commit f72745e.

MartinThoma · 2022-06-27T21:21:52Z

Does anybody have a PDF showing this issue?

MartinThoma · 2022-07-10T05:21:12Z

If anybody runs into this issue again, please let me know. I close this for the moment.

mstamy2 added Bug labels Jun 5, 2014

polyglot-jones pushed a commit to polyglot-jones/PyPDF2 that referenced this issue Aug 11, 2020

Revert "Multiple Fix and enhancement (py-pdf#75)" (py-pdf#77)

a3deffb

This reverts commit f72745e.

py-pdf deleted a comment from dhudson1 Jun 10, 2022

MartinThoma added needs-pdf The issue needs a PDF file to show the problem and removed Parsing labels Jun 26, 2022

MartinThoma closed this as completed Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will hang on invalid PDFs #77

Will hang on invalid PDFs #77

wolever commented Mar 7, 2014

mstamy2 commented Mar 10, 2014

wolever commented Mar 10, 2014

whitemice commented Aug 5, 2015 •

edited by MartinThoma

Loading

MartinThoma commented Jun 27, 2022

MartinThoma commented Jul 10, 2022

Will hang on invalid PDFs #77

Will hang on invalid PDFs #77

Comments

wolever commented Mar 7, 2014

mstamy2 commented Mar 10, 2014

wolever commented Mar 10, 2014

whitemice commented Aug 5, 2015 • edited by MartinThoma Loading

MartinThoma commented Jun 27, 2022

MartinThoma commented Jul 10, 2022

whitemice commented Aug 5, 2015 •

edited by MartinThoma

Loading