ROB: ignore_eof everywhere for read_until_regex #1521

rraval · 2022-12-29T16:27:00Z

This was initially motivated by NumberObject.read_from_stream, which was calling read_until_regex with the default value of ignore_eof=False and thus raising exceptions like:

PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

431ba70 demonstrates a similar fix for NameObject.read_from_stream.

From discussion in #1505, it was realized that the change to NumberObject.read_from_stream had now made ALL callers of read_until_regex pass ignore_eof=True. It's cleaner to remove the parameter entirely and change the default behaviour.

codecov · 2022-12-29T16:35:33Z

Codecov Report

Base: 91.79% // Head: 91.79% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (8417a83) compared to base (e7e4ffc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1521      +/-   ##
==========================================
- Coverage   91.79%   91.79%   -0.01%     
==========================================
  Files          33       33              
  Lines        6093     6091       -2     
  Branches     1200     1199       -1     
==========================================
- Hits         5593     5591       -2     
  Misses        323      323              
  Partials      177      177

Impacted Files	Coverage Δ
pypdf/_utils.py	`97.43% <100.00%> (-0.03%)`	⬇️
pypdf/generic/_base.py	`99.64% <100.00%> (ø)`
pypdf/generic/_data_structures.py	`89.98% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

This was initially motivated by `NumberObject.read_from_stream`, which was calling `read_until_regex` with the default value of `ignore_eof=False` and thus raising exceptions like: ``` PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly ``` py-pdf@431ba70 demonstrates a similar fix for `NameObject.read_from_stream`. From discussion in py-pdf#1505, it was realized that the change to `NumberObject.read_from_stream` had now made ALL callers of `read_until_regex` pass `ignore_eof=True`. It's cleaner to remove the parameter entirely and change the default behaviour.

pubpub-zz · 2023-01-09T21:16:56Z

@rraval
for my understanding, do you have any pdf file that shows some failure without this fix?

MartinThoma · 2023-01-09T21:40:54Z

That would also help me. At the moment, I don't know what to do with those PRs.

rraval · 2023-01-09T22:28:48Z

@pubpub-zz yes, I have code that looks like this:

from PyPDF2 import PdfMerger
from pathlib import Path

bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")

merger = PdfMerger(strict=True)
merger.append(str(bad_pdf))
merger.write(str(output_pdf))

Which produces a traceback like this:

Invalid stream (index 0) within object 50 0: Stream has ended unexpectedly
Traceback (most recent call last):
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1136, in _get_object_from_stream
    obj = read_object(stream_data, self)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 870, in read_object
    return NumberObject.read_from_stream(stream)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/generic/_base.py", line 292, in read_from_stream
    num = read_until_regex(stream, NumberObject.NumberPattern)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_utils.py", line 159, in read_until_regex
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rraval/encircle/pdf-repro/repro.py", line 9, in <module>
    merger.write(str(output_pdf))
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_merger.py", line 312, in write
    my_file, ret_fileobj = self.output.write(fileobj)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1167, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/rraval/encircle/.local/poetry-envs/encircle-0tR0fh4Y-py3.9/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1148, in _get_object_from_stream
    raise PdfReadError(f"Can't read object stream: {exc}")
PyPDF2.errors.PdfReadError: Can't read object stream: Stream has ended unexpectedly

Unfortunately I cannot provide bad.pdf, it contains private information. Visual inspection reveals it to be 6 pages of images (i.e. a scanned document) with a Docusign envelope, so this specific file has probably been through a gamut of PDF processors before arriving in this state.

Interestingly, if I run with PdfMerger(strict=False) like so:

from PyPDF2 import PdfMerger
from pathlib import Path

bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")

merger = PdfMerger(strict=False)  # <-- this line changed
merger.append(str(bad_pdf))
merger.write(str(output_pdf))

... the exceptions are turned into warnings:

Invalid stream (index 0) within object 50 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 49 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 48 0: Stream has ended unexpectedly
Invalid stream (index 0) within object 47 0: Stream has ended unexpectedly

... and the resulting PDF is valid with the following quirks:

Page 1 appears to be visually the same as the original
Pages 2, 3, 4, and 5 are visually blank
Page 6 appears to be visually the same as the original

With this patch applied, I get no warnings and no blank pages (all 6 pages are visually equivalent to the original).

MartinThoma · 2023-01-10T10:23:05Z

Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you?

The warnings exist to let the user know that the PDF is broken. You can mute them in the user code, but I would not want to remove them (except if they are actually wrong). And strict=False is the default for exactly the reason that most users probably don't want strict behavior

rraval · 2023-01-10T15:21:37Z

Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you?

Not directly, no. But it's good to have a clear definition of the semantics that pypdf intends. Quoting from that page:

Choosing strict=True means that pypdf will raise an exception if a PDF does not follow the specification.

I do not know the PDF spec well (... or at all really), but does a stream termination in this fashion explicitly violate the PDF specification? Or is pypdf implementation being overly strict here?

If this is indeed a PDF specification violation, then it strikes me as odd that 431ba70 for NameObject is classified as a bug but applying the same ignore_eof semantics for NumberObject is considered a misfeature (but again... I'm willing to accept that PDF is byzantine and inconsistent).

For what it's worth, I have not been able to find a PDF viewer or processor that fails on the motivating PDF file, I've tried processing via gs and qpdf, and viewing with firefox and okular. But it might be plausible that all of those are general purpose tools that are overly lax and pypdf is right to stick to its guns here.

pubpub-zz · 2023-01-10T17:50:58Z

@rraval
would you agree to share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

rraval · 2023-01-11T02:38:59Z

@rraval would you agree to share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

I've done as requested. I apologize for not sharing the file here but the file contains sensitive information and I do not have the rights to publish it broadly.

MartinThoma · 2023-01-11T16:23:21Z

Thank you @rraval for trusting us.

Having example files helps us (well, @pubpub-zz to be honest - I'm mostly just doing sanity checks / releases / answering community questions 😄 )

pubpub-zz · 2023-01-21T12:07:39Z

@MartinThoma
Ready for merging

MartinThoma · 2023-01-21T12:27:01Z

Thank you for your contributions @rraval 🙏 This change will be part of the release tomorrow

@pubpub-zz Thank you for helping me with this one again ❤️

New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)

rraval mentioned this pull request Dec 29, 2022

ROB: Ignore EOF in NumberObject.read_from_stream #1505

Closed

rraval force-pushed the ignore-eof-everywhere branch from cd70bae to 8417a83 Compare January 9, 2023 19:20

MartinThoma merged commit 53645ef into py-pdf:main Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROB: ignore_eof everywhere for read_until_regex #1521

ROB: ignore_eof everywhere for read_until_regex #1521

rraval commented Dec 29, 2022

codecov bot commented Dec 29, 2022 •

edited

Loading

pubpub-zz commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

rraval commented Jan 9, 2023

MartinThoma commented Jan 10, 2023

rraval commented Jan 10, 2023

pubpub-zz commented Jan 10, 2023

rraval commented Jan 11, 2023 •

edited

Loading

MartinThoma commented Jan 11, 2023

pubpub-zz commented Jan 21, 2023

MartinThoma commented Jan 21, 2023

ROB: ignore_eof everywhere for read_until_regex #1521

ROB: ignore_eof everywhere for read_until_regex #1521

Conversation

rraval commented Dec 29, 2022

codecov bot commented Dec 29, 2022 • edited Loading

Codecov Report

pubpub-zz commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

rraval commented Jan 9, 2023

MartinThoma commented Jan 10, 2023

rraval commented Jan 10, 2023

pubpub-zz commented Jan 10, 2023

rraval commented Jan 11, 2023 • edited Loading

MartinThoma commented Jan 11, 2023

pubpub-zz commented Jan 21, 2023

MartinThoma commented Jan 21, 2023

codecov bot commented Dec 29, 2022 •

edited

Loading

rraval commented Jan 11, 2023 •

edited

Loading