Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix image look-up table in EncodedStreamObject #2128

Merged
merged 3 commits into from
Sep 3, 2023

Conversation

pubpub-zz
Copy link
Collaborator

closes #2124
closes #2110

@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (fe2dfaf) 94.10% compared to head (245ee3c) 94.10%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2128   +/-   ##
=======================================
  Coverage   94.10%   94.10%           
=======================================
  Files          41       41           
  Lines        7460     7460           
  Branches     1474     1474           
=======================================
  Hits         7020     7020           
  Misses        274      274           
  Partials      166      166           
Files Changed Coverage Δ
pypdf/filters.py 94.52% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma all yours

@stefan6419846
Copy link
Collaborator

Won't this still raise "strange" errors for invalid lookup tables due to assuming that None is a list?

@pubpub-zz
Copy link
Collaborator Author

the issue was due to an improper extraction of the list (this was a bug): I'm not expecting the error to appear anymore.

@stefan6419846
Copy link
Collaborator

You mean that when line 905 is executed (id est we have an invalid lookup table), neither the L nor the CMYK color mode will be present and thus never trigger the failing lines 908 and 915-916? This sounds unreasonable to me without digging to deep to be honest.

pypdf/pypdf/filters.py

Lines 900 to 926 in 89eb626

img = img.convert(conv)
if len(lookup) != (hival + 1) * nb:
logger_warning(
f"Invalid Lookup Table in {obj_as_text}", __name__
)
lookup = None
if mode == "L":
# gray lookup does not work : it is converted to a similar RGB lookup
lookup = b"".join([bytes([b, b, b]) for b in lookup])
mode = "RGB"
# TODO : cf https://github.com/py-pdf/pypdf/pull/2039
# this is a work around until PIL is able to process CMYK images
elif mode == "CMYK":
_rgb = []
for _c, _m, _y, _k in (
lookup[n : n + 4]
for n in range(0, 4 * (len(lookup) // 4), 4)
):
_r = int(255 * (1 - _c / 255) * (1 - _k / 255))
_g = int(255 * (1 - _m / 255) * (1 - _k / 255))
_b = int(255 * (1 - _y / 255) * (1 - _k / 255))
_rgb.append(bytes((_r, _g, _b)))
lookup = b"".join(_rgb)
mode = "RGB"
if lookup is not None:
img.putpalette(lookup, rawmode=mode)
img = img.convert("L" if base == ColorSpaces.DEVICE_GRAY else "RGB")

@pubpub-zz
Copy link
Collaborator Author

You mean that when line 905 is executed (id est we have an invalid lookup table), neither the L nor the CMYK color mode will be present and thus never trigger the failing lines 908 and 915-916? This sounds unreasonable to me without digging to deep to be honest.

I missed it: you are right!

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma stdby

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Aug 29, 2023

I had issues with test_watermarking_speed going into timeout : is the duration just on the edge ?
at least now it's ok
@MartinThoma all yours

@pubpub-zz pubpub-zz closed this Aug 29, 2023
@pubpub-zz pubpub-zz reopened this Aug 29, 2023
@stefan6419846
Copy link
Collaborator

I had issues with test_watermarking_speed going into timeout : is the duration just on the edge ?

Running only the three watermarking tests (including the not yet added one) I could observe a total runtime of four seconds, which should not really be an issue - although download speed might play into this as well, but the corresponding files should be cached in theory.

@pubpub-zz
Copy link
Collaborator Author

Running only the three watermarking tests (including the not yet added one) I could observe a total runtime of four seconds, which should not really be an issue - although download speed might play into this as well, but the corresponding files should be cached in theory.

@stefan6419846 / @MartinThoma
I would expect pytest.mark.timeout to be used to prevent infinite looping. What about modifying test with a time measurement such as in test_reader.py / test_do_not_get_stuck_on_large_files_without_start_xref:

pypdf/tests/test_reader.py

Lines 643 to 656 in 89eb626

def test_do_not_get_stuck_on_large_files_without_start_xref():
"""
Tests for the absence of a DoS bug, where a large file without an
startxref mark would cause the library to hang for minutes to hours.
"""
start_time = time.time()
broken_stream = BytesIO(b"\0" * 5 * 1000 * 1000)
with pytest.raises(PdfReadError):
PdfReader(broken_stream)
parse_duration = time.time() - start_time
# parsing is expected take less than a second on a modern cpu, but include
# a large tolerance to account for busy or slow systems
assert parse_duration < 60

@MartinThoma
Copy link
Member

I would expect pytest.mark.timeout to be used to prevent infinite looping

Yes, but not only that. I think we should have a general timeout for every test that is well above what we would expect. Something like 30s.

Some tests should be able to increase that if we know they are slow.

some tests should reduce it if we know they should be super fast and have potential for a regression from linear runtime to quadratic runtime or similar.

I want to avoid that CI becomes flaky just because the CI machines are slow and the test is right at the edge.

$ pytest tests/test_reader.py::test_do_not_get_stuck_on_large_files_without_start_xref --durations 0

================================================== test session starts ==================================================
platform linux -- Python 3.11.1, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/moose/Github/py-pdf/pypdf
configfile: pyproject.toml
plugins: benchmark-4.0.0, typeguard-3.0.2, socket-0.6.0, cov-4.1.0, timeout-2.1.0
collected 1 item                                                                                                        

tests/test_reader.py .                                                                                            [100%]

=================================================== slowest durations ===================================================
0.33s call     tests/test_reader.py::test_do_not_get_stuck_on_large_files_without_start_xref

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=================================================== 1 passed in 0.44s ===================================================

That looks pretty fine to me. We might want to use @pytest.mark.timeout(60) instead of the assertion within the test, but the timeout is for sure not too low. We could even reduce it to ~10s.

@MartinThoma MartinThoma changed the title BUG: fix image look-up table in EncodedStreamObject BUG: Fix image look-up table in EncodedStreamObject Sep 3, 2023
@MartinThoma MartinThoma merged commit af41173 into py-pdf:main Sep 3, 2023
25 of 26 checks passed
@MartinThoma
Copy link
Member

Very nice @pubpub-zz 🎉

I hope I find a couple of free weekends during the colder months to get below 30 open issues 🤞

MartinThoma added a commit that referenced this pull request Sep 3, 2023
## What's new

### Bug Fixes (BUG)
-  Cope with missing /I in articles (#2134)
-  Fix image look-up table in EncodedStreamObject (#2128)
-  remove_images not operating in sub level forms (#2133)

### Robustness (ROB)
-  Cope with damaged PDF (#2129)

### Documentation (DOC)
-  How to take ownership (#2123)

### Developer Experience (DEV)
-  Download PDFs before executing the tests to not run into timeouts (#2143)
-  Add workflow_dispatch to CI (#2145)
-  Automatically create release message / tag message (#2127)
-  Ensure the REL commit message is consistently created (#2126)

### Testing (TST)
-  Add test for correct rendering of watermarks (#2130)

[Full Changelog](3.15.4...3.15.5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug when getting images from a pdf Invalid image lookup tables not handled correctly
3 participants