BUG: Fix image look-up table in EncodedStreamObject #2128

pubpub-zz · 2023-08-28T20:47:51Z

closes #2124
closes #2110

codecov · 2023-08-28T21:00:49Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (fe2dfaf) 94.10% compared to head (245ee3c) 94.10%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2128   +/-   ##
=======================================
  Coverage   94.10%   94.10%           
=======================================
  Files          41       41           
  Lines        7460     7460           
  Branches     1474     1474           
=======================================
  Hits         7020     7020           
  Misses        274      274           
  Partials      166      166

Files Changed	Coverage Δ
pypdf/filters.py	`94.52% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-08-28T21:03:55Z

@MartinThoma all yours

stefan6419846 · 2023-08-29T10:14:06Z

Won't this still raise "strange" errors for invalid lookup tables due to assuming that None is a list?

pubpub-zz · 2023-08-29T17:33:20Z

the issue was due to an improper extraction of the list (this was a bug): I'm not expecting the error to appear anymore.

stefan6419846 · 2023-08-29T17:42:36Z

You mean that when line 905 is executed (id est we have an invalid lookup table), neither the L nor the CMYK color mode will be present and thus never trigger the failing lines 908 and 915-916? This sounds unreasonable to me without digging to deep to be honest.

pypdf/pypdf/filters.py

Lines 900 to 926 in 89eb626

    
               img = img.convert(conv) 
        
               if len(lookup) != (hival + 1) * nb: 
        
                   logger_warning( 
        
                       f"Invalid Lookup Table in {obj_as_text}", __name__ 
        
                   ) 
        
                   lookup = None 
        
               if mode == "L": 
        
                   # gray lookup does not work : it is converted to a similar RGB lookup 
        
                   lookup = b"".join([bytes([b, b, b]) for b in lookup]) 
        
                   mode = "RGB" 
        
               # TODO : cf https://github.com/py-pdf/pypdf/pull/2039 
        
               # this is a work around until PIL is able to process CMYK images 
        
               elif mode == "CMYK": 
        
                   _rgb = [] 
        
                   for _c, _m, _y, _k in ( 
        
                       lookup[n : n + 4] 
        
                       for n in range(0, 4 * (len(lookup) // 4), 4) 
        
                   ): 
        
                       _r = int(255 * (1 - _c / 255) * (1 - _k / 255)) 
        
                       _g = int(255 * (1 - _m / 255) * (1 - _k / 255)) 
        
                       _b = int(255 * (1 - _y / 255) * (1 - _k / 255)) 
        
                       _rgb.append(bytes((_r, _g, _b))) 
        
                   lookup = b"".join(_rgb) 
        
                   mode = "RGB" 
        
               if lookup is not None: 
        
                   img.putpalette(lookup, rawmode=mode) 
        
           img = img.convert("L" if base == ColorSpaces.DEVICE_GRAY else "RGB")

pubpub-zz · 2023-08-29T19:29:04Z

You mean that when line 905 is executed (id est we have an invalid lookup table), neither the L nor the CMYK color mode will be present and thus never trigger the failing lines 908 and 915-916? This sounds unreasonable to me without digging to deep to be honest.

I missed it: you are right!

pubpub-zz · 2023-08-29T20:27:07Z

@MartinThoma stdby

pubpub-zz · 2023-08-29T20:45:25Z

I had issues with test_watermarking_speed going into timeout : is the duration just on the edge ?
at least now it's ok
@MartinThoma all yours

stefan6419846 · 2023-08-30T06:35:58Z

I had issues with test_watermarking_speed going into timeout : is the duration just on the edge ?

Running only the three watermarking tests (including the not yet added one) I could observe a total runtime of four seconds, which should not really be an issue - although download speed might play into this as well, but the corresponding files should be cached in theory.

pubpub-zz · 2023-08-30T17:12:03Z

Running only the three watermarking tests (including the not yet added one) I could observe a total runtime of four seconds, which should not really be an issue - although download speed might play into this as well, but the corresponding files should be cached in theory.

@stefan6419846 / @MartinThoma
I would expect pytest.mark.timeout to be used to prevent infinite looping. What about modifying test with a time measurement such as in test_reader.py / test_do_not_get_stuck_on_large_files_without_start_xref:

pypdf/tests/test_reader.py

Lines 643 to 656 in 89eb626

    
           def test_do_not_get_stuck_on_large_files_without_start_xref(): 
        
               """ 
        
               Tests for the absence of a DoS bug, where a large file without an 
        
               startxref mark would cause the library to hang for minutes to hours. 
        
               """ 
        
               start_time = time.time() 
        
               broken_stream = BytesIO(b"\0" * 5 * 1000 * 1000) 
        
               with pytest.raises(PdfReadError): 
        
                   PdfReader(broken_stream) 
        
               parse_duration = time.time() - start_time 
        
               # parsing is expected take less than a second on a modern cpu, but include 
        
               # a large tolerance to account for busy or slow systems 
        
               assert parse_duration < 60

MartinThoma · 2023-09-02T08:15:55Z

I would expect pytest.mark.timeout to be used to prevent infinite looping

Yes, but not only that. I think we should have a general timeout for every test that is well above what we would expect. Something like 30s.

Some tests should be able to increase that if we know they are slow.

some tests should reduce it if we know they should be super fast and have potential for a regression from linear runtime to quadratic runtime or similar.

I want to avoid that CI becomes flaky just because the CI machines are slow and the test is right at the edge.

$ pytest tests/test_reader.py::test_do_not_get_stuck_on_large_files_without_start_xref --durations 0

================================================== test session starts ==================================================
platform linux -- Python 3.11.1, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/moose/Github/py-pdf/pypdf
configfile: pyproject.toml
plugins: benchmark-4.0.0, typeguard-3.0.2, socket-0.6.0, cov-4.1.0, timeout-2.1.0
collected 1 item                                                                                                        

tests/test_reader.py .                                                                                            [100%]

=================================================== slowest durations ===================================================
0.33s call     tests/test_reader.py::test_do_not_get_stuck_on_large_files_without_start_xref

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=================================================== 1 passed in 0.44s ===================================================

That looks pretty fine to me. We might want to use @pytest.mark.timeout(60) instead of the assertion within the test, but the timeout is for sure not too low. We could even reduce it to ~10s.

MartinThoma · 2023-09-03T20:55:41Z

Very nice @pubpub-zz 🎉

I hope I find a couple of free weekends during the colder months to get below 30 open issues 🤞

## What's new ### Bug Fixes (BUG) - Cope with missing /I in articles (#2134) - Fix image look-up table in EncodedStreamObject (#2128) - remove_images not operating in sub level forms (#2133) ### Robustness (ROB) - Cope with damaged PDF (#2129) ### Documentation (DOC) - How to take ownership (#2123) ### Developer Experience (DEV) - Download PDFs before executing the tests to not run into timeouts (#2143) - Add workflow_dispatch to CI (#2145) - Automatically create release message / tag message (#2127) - Ensure the REL commit message is consistently created (#2126) ### Testing (TST) - Add test for correct rendering of watermarks (#2130) [Full Changelog](3.15.4...3.15.5)

pubpub-zz added 2 commits August 28, 2023 22:33

BUG: fix image look-up table in EncodedStreamObject

9188157

closes py-pdf#2124 closes py-pdf#2110

Merge branch 'main' into iss_2124

f1e07bb

prevent secondary failure

245ee3c

pubpub-zz closed this Aug 29, 2023

pubpub-zz reopened this Aug 29, 2023

MartinThoma changed the title ~~BUG: fix image look-up table in EncodedStreamObject~~ BUG: Fix image look-up table in EncodedStreamObject Sep 3, 2023

MartinThoma merged commit af41173 into py-pdf:main Sep 3, 2023
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix image look-up table in EncodedStreamObject #2128

BUG: Fix image look-up table in EncodedStreamObject #2128

pubpub-zz commented Aug 28, 2023

codecov bot commented Aug 28, 2023 •

edited

Loading

pubpub-zz commented Aug 28, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023 •

edited

Loading

stefan6419846 commented Aug 30, 2023

pubpub-zz commented Aug 30, 2023

MartinThoma commented Sep 2, 2023

MartinThoma commented Sep 3, 2023

BUG: Fix image look-up table in EncodedStreamObject #2128

BUG: Fix image look-up table in EncodedStreamObject #2128

Conversation

pubpub-zz commented Aug 28, 2023

codecov bot commented Aug 28, 2023 • edited Loading

Codecov Report

pubpub-zz commented Aug 28, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

stefan6419846 commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023

pubpub-zz commented Aug 29, 2023 • edited Loading

stefan6419846 commented Aug 30, 2023

pubpub-zz commented Aug 30, 2023

MartinThoma commented Sep 2, 2023

MartinThoma commented Sep 3, 2023

codecov bot commented Aug 28, 2023 •

edited

Loading

pubpub-zz commented Aug 29, 2023 •

edited

Loading