Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: ignore_eof everywhere for read_until_regex #1521

Merged
merged 1 commit into from
Jan 21, 2023

Conversation

rraval
Copy link
Contributor

@rraval rraval commented Dec 29, 2022

This was initially motivated by NumberObject.read_from_stream, which was calling read_until_regex with the default value of ignore_eof=False and thus raising exceptions like:

PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

431ba70 demonstrates a similar fix for NameObject.read_from_stream.

From discussion in #1505, it was realized that the change to NumberObject.read_from_stream had now made ALL callers of read_until_regex pass ignore_eof=True. It's cleaner to remove the parameter entirely and change the default behaviour.

@codecov
Copy link

codecov bot commented Dec 29, 2022

Codecov Report

Base: 91.79% // Head: 91.79% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (8417a83) compared to base (e7e4ffc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1521      +/-   ##
==========================================
- Coverage   91.79%   91.79%   -0.01%     
==========================================
  Files          33       33              
  Lines        6093     6091       -2     
  Branches     1200     1199       -1     
==========================================
- Hits         5593     5591       -2     
  Misses        323      323              
  Partials      177      177              
Impacted Files Coverage Δ
pypdf/_utils.py 97.43% <100.00%> (-0.03%) ⬇️
pypdf/generic/_base.py 99.64% <100.00%> (ø)
pypdf/generic/_data_structures.py 89.98% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

This was initially motivated by `NumberObject.read_from_stream`, which
was calling `read_until_regex` with the default value of
`ignore_eof=False` and thus raising exceptions like:

```
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly
```

py-pdf@431ba70
demonstrates a similar fix for `NameObject.read_from_stream`.

From discussion in py-pdf#1505, it was
realized that the change to `NumberObject.read_from_stream` had now made
ALL callers of `read_until_regex` pass `ignore_eof=True`. It's cleaner
to remove the parameter entirely and change the default behaviour.
@pubpub-zz
Copy link
Collaborator

@rraval
for my understanding, do you have any pdf file that shows some failure without this fix?

@MartinThoma
Copy link
Member

That would also help me. At the moment, I don't know what to do with those PRs.

@rraval
Copy link
Contributor Author

rraval commented Jan 9, 2023

@pubpub-zz yes, I have code that looks like this:

from PyPDF2 import PdfMerger
from pathlib import Path

bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")

merger = PdfMerger(strict=True)
merger.append(str(bad_pdf))
merger.write(str(output_pdf))

Which produces a traceback like this:

Invalid stream (index 0) within object 50 0: Stream has ended unexpectedly
Traceback (most recent call last):
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1136, in _get_object_from_stream
    obj = read_object(stream_data, self)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 870, in read_object
    return NumberObject.read_from_stream(stream)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/generic/_base.py", line 292, in read_from_stream
    num = read_until_regex(stream, NumberObject.NumberPattern)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_utils.py", line 159, in read_until_regex
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rraval/encircle/pdf-repro/repro.py", line 9, in <module>
    merger.write(str(output_pdf))
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_merger.py", line 312, in write
    my_file, ret_fileobj = self.output.write(fileobj)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1167, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1148, in _get_object_from_stream
    raise PdfReadError(f"Can't read object stream: {exc}")
PyPDF2.errors.PdfReadError: Can't read object stream: Stream has ended unexpectedly

Unfortunately I cannot provide bad.pdf, it contains private information. Visual inspection reveals it to be 6 pages of images (i.e. a scanned document) with a Docusign envelope, so this specific file has probably been through a gamut of PDF processors before arriving in this state.


Interestingly, if I run with PdfMerger(strict=False) like so:

from PyPDF2 import PdfMerger
from pathlib import Path

bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")

merger = PdfMerger(strict=False)  # <-- this line changed
merger.append(str(bad_pdf))
merger.write(str(output_pdf))

... the exceptions are turned into warnings:

Invalid stream (index 0) within object 50 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 49 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 48 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 47 0: Stream has ended unexpectedly

... and the resulting PDF is valid with the following quirks:

  • Page 1 appears to be visually the same as the original
  • Pages 2, 3, 4, and 5 are visually blank
  • Page 6 appears to be visually the same as the original

With this patch applied, I get no warnings and no blank pages (all 6 pages are visually equivalent to the original).

@MartinThoma
Copy link
Member

Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you?

The warnings exist to let the user know that the PDF is broken. You can mute them in the user code, but I would not want to remove them (except if they are actually wrong). And strict=False is the default for exactly the reason that most users probably don't want strict behavior

@rraval
Copy link
Contributor Author

rraval commented Jan 10, 2023

Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you?

Not directly, no. But it's good to have a clear definition of the semantics that pypdf intends. Quoting from that page:

Choosing strict=True means that pypdf will raise an exception if a PDF does not follow the specification.

I do not know the PDF spec well (... or at all really), but does a stream termination in this fashion explicitly violate the PDF specification? Or is pypdf implementation being overly strict here?

If this is indeed a PDF specification violation, then it strikes me as odd that 431ba70 for NameObject is classified as a bug but applying the same ignore_eof semantics for NumberObject is considered a misfeature (but again... I'm willing to accept that PDF is byzantine and inconsistent).

For what it's worth, I have not been able to find a PDF viewer or processor that fails on the motivating PDF file, I've tried processing via gs and qpdf, and viewing with firefox and okular. But it might be plausible that all of those are general purpose tools that are overly lax and pypdf is right to stick to its guns here.

@pubpub-zz
Copy link
Collaborator

@rraval
would you agree to share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

@rraval
Copy link
Contributor Author

rraval commented Jan 11, 2023

@rraval would you agree to share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

I've done as requested. I apologize for not sharing the file here but the file contains sensitive information and I do not have the rights to publish it broadly.

@MartinThoma
Copy link
Member

Thank you @rraval for trusting us.

Having example files helps us (well, @pubpub-zz to be honest - I'm mostly just doing sanity checks / releases / answering community questions 😄 )

@pubpub-zz
Copy link
Collaborator

@MartinThoma
Ready for merging

@MartinThoma MartinThoma merged commit 53645ef into py-pdf:main Jan 21, 2023
@MartinThoma
Copy link
Member

Thank you for your contributions @rraval 🙏 This change will be part of the release tomorrow

@pubpub-zz Thank you for helping me with this one again ❤️

MartinThoma added a commit that referenced this pull request Jan 22, 2023
New Features (ENH):
-  Add page label support to PdfWriter (#1558)
-  Accept inline images with space before EI (#1552)
-  Add circle annotation support (#1556)
-  Add polygon annotation support (#1557)
-  Make merging pages produce a deterministic PDF (#1542, #1543)

Bug Fixes (BUG):
-  Fix error in cmap extraction (#1544)
-  Remove erroneous assertion check (#1564)
-  Fix dictionary access of optional page label keys (#1562)

Robustness (ROB):
-  Set ignore_eof=True for read_until_regex (#1521)

Documentation (DOC):
-  Paper size (#1550)

Developer Experience (DEV):
-  Fix broken combination of dependencies of docs.txt
-  Annotate tests appropriately (#1551)

[Full Changelog](3.2.1...3.3.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants