Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: UnboundLocalError when iterating on pages of malformed pdf (with strict=True) #2617

Closed
farjasju opened this issue May 2, 2024 · 12 comments · Fixed by #2619
Closed

BUG: UnboundLocalError when iterating on pages of malformed pdf (with strict=True) #2617

farjasju opened this issue May 2, 2024 · 12 comments · Fixed by #2619
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

@farjasju
Copy link
Contributor

farjasju commented May 2, 2024

An UnboundLocalError: local variable 'generation' referenced before assignment is raised when iterating on the pages of a malformed pdf (with len(PdfReader.pages) for example), when strict=True.

Environment

$ python -m platform
Linux-5.4.0-173-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('pycryptodome', '3.20.0'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
with open('malformed_pdf.pdf', 'rb') as f:
    doc = PdfReader(f, strict=True)
    len(doc.pages)

The malformed pdf (coming from https://www.columbia.edu/~aw2951/Nations.pdf):
malformed_pdf.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/_page.py", line 2208, in __len__
    return self.length_function()
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/_doc_common.py", line 1122, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/_doc_common.py", line 1119, in _flatten
    obj = page.get_object()
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/jules/.local/lib/python3.10/site-packages/pypdf/_reader.py", line 416, in get_object
    f"({idnum} {generation})."
UnboundLocalError: local variable 'generation' referenced before assignment
@stefan6419846
Copy link
Collaborator

Thanks for the report. Do you want to submit a corresponding PR which initializes this value with a default to allow proper error reporting?

@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness labels May 2, 2024
@pubpub-zz
Copy link
Collaborator

@farjasju for advice, the fix is too add after line 402 in _reader.py
generation = -1

It would be great to make a test with your file in order to improve coverage : You might be the one who may get over the 95% of code coverage😀

@farjasju
Copy link
Contributor Author

farjasju commented May 2, 2024

Thanks for the suggestion! I was trying to understand what was this generation corresponding to, and what would be the best default value (I'm totally new to both the codebase and the PDF structure). I'll make a PR.

@farjasju
Copy link
Contributor Author

farjasju commented May 2, 2024

@pubpub-zz so you confirm that the expected exception when iterating over the pages is the following?

Traceback (most recent call last):
  File "/home/jules/dev/tools/pypdf/test.py", line 4, in <module>
    len(doc.pages)
  File "/home/jules/dev/tools/pypdf/pypdf/_page.py", line 2208, in __len__
    return self.length_function()
  File "/home/jules/dev/tools/pypdf/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "/home/jules/dev/tools/pypdf/pypdf/_doc_common.py", line 1122, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/home/jules/dev/tools/pypdf/pypdf/_doc_common.py", line 1119, in _flatten
    obj = page.get_object()
  File "/home/jules/dev/tools/pypdf/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/jules/dev/tools/pypdf/pypdf/_reader.py", line 414, in get_object
    raise PdfReadError(
pypdf.errors.PdfReadError: Expected object ID (21 0) does not match actual (-1 -1).

@pubpub-zz
Copy link
Collaborator

Correct! your pdf is damaged and object 21 can found properly in the pdf (you can confirm that reading the file with strict=False)

@farjasju
Copy link
Contributor Author

farjasju commented May 2, 2024

Thanks! Should I add the file to resources or sample-files?

@pubpub-zz
Copy link
Collaborator

Thanks for the suggestion! I was trying to understand what was this generation corresponding to, and what would be the best default value (I'm totally new to both the codebase and the PDF structure). I'll make a PR.

objects are identified with an id and a generation/version. It allows to identify a reuse of an id

@stefan6419846
Copy link
Collaborator

As long as you do not own any copyright on the file, please download from the GitHub URL where you uploaded the example to, id est https://github.com/py-pdf/pypdf/files/15186107/malformed_pdf.pdf

@farjasju
Copy link
Contributor Author

farjasju commented May 2, 2024

The PDF seems to be a truncated version of this article. I personally do not own any right of it, I don't know if it is ok to upload it?

@stefan6419846
Copy link
Collaborator

As long as you are unsure, please do not use resources and/or sample-files, but a real download URL instead as in my previous comment. As external websites might go offline or introduce rate limits, the GitHub link is preferred.

@farjasju
Copy link
Contributor Author

farjasju commented May 2, 2024

Sorry for being such a Github noob but, I have to push my branch before creating the PR right? I get a 403 when trying to push my it:

$ git push --set-upstream origin unboundlocalerror-with-malformed-file
remote: Permission to py-pdf/pypdf.git denied to farjasju.
fatal: unable to access 'https://github.com/py-pdf/pypdf.git/': The requested URL returned error: 403

EDIT: Okay, maybe it's better to fork the repo first instead of creating the branch on the cloned repo itself

@stefan6419846
Copy link
Collaborator

Yes, you need to create a fork and push to your fork.

farjasju pushed a commit to farjasju/pypdf-fix-malformed-pdf that referenced this issue May 2, 2024
farjasju pushed a commit to farjasju/pypdf-fix-malformed-pdf that referenced this issue May 2, 2024
stefan6419846 pushed a commit that referenced this issue May 2, 2024
Closes #2617

Co-authored-by: jules <jules@harfanglab.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants