Performance Bug: Reading large compressed images takes huge time to process #2641

snanda85 · 2024-05-14T08:57:56Z

Reading a compressed image takes (>10minutes), if the image is large-ish (>250kb)

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('pycryptodome', '3.19.0'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import sys
from datetime import datetime

from pypdf import PdfReader

def log(msg):
    print(f"[{datetime.now()}] {msg}\n")

file = sys.argv[1]

log("Reading File PyPDF..")
images = []
pypdf_reader = PdfReader(file)
for pidx, page in enumerate(pypdf_reader.pages, start=1):
    log(f"Reading page {pidx}")
    for iidx, image in enumerate(page.images, start=1):
        log(f"Processing Image {iidx}")
        images.append(image.image)
    log(f"Competed page {pidx}")
log("Completed Reading File PyPDF!")

Attached is a sample PDF I created that can reproduce this error.
file_with_image.pdf

The PDF can be added to the tests as well.

Output

This is the complete output I see:
Look at the time difference between Image 6 and Image 7. It is close to 12 minutes

[2024-05-14 14:06:52.927018] Reading File PyPDF..
[2024-05-14 14:06:52.929346] Reading page 1
[2024-05-14 14:06:52.971993] Processing Image 1
[2024-05-14 14:06:52.993297] Processing Image 2
[2024-05-14 14:06:53.007304] Processing Image 3
[2024-05-14 14:06:53.021166] Processing Image 4
[2024-05-14 14:06:53.049529] Processing Image 5
[2024-05-14 14:06:53.051842] Processing Image 6

[2024-05-14 14:18:46.906472] Processing Image 7
[2024-05-14 14:18:47.088749] Processing Image 8
[2024-05-14 14:18:47.092159] Processing Image 9
[2024-05-14 14:18:47.099422] Processing Image 10
[2024-05-14 14:18:47.099513] Competed page 1

The text was updated successfully, but these errors were encountered:

snanda85 · 2024-05-14T09:25:12Z

Added a trimmed version of the test file with a single page.
file_with_large_compressed_image.pdf

stefan6419846 · 2024-05-14T09:44:08Z

Thanks for the report. The issue seems to be that

pypdf/pypdf/filters.py

Lines 85 to 89 in 6226d66

    
           for b in [data[i : i + 1] for i in range(len(data))]: 
        
               try: 
        
                   result_str += d.decompress(b) 
        
               except zlib.error: 
        
                   pass

iterates over each input byte manually to be able to skip invalid bytes. return d.decompress(data) is faster in this case and might be implemented as a faster version to try first to improve 354f8ce.

snanda85 mentioned this issue May 14, 2024

BUG: Reading large compressed images takes huge time to process #2644

Merged

stefan6419846 added workflow-images From a users perspective, image handling is the affected feature/workflow nf-performance Non-functional change: Performance labels May 14, 2024

stefan6419846 closed this as completed in #2644 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Bug: Reading large compressed images takes huge time to process #2641

Performance Bug: Reading large compressed images takes huge time to process #2641

snanda85 commented May 14, 2024

snanda85 commented May 14, 2024

stefan6419846 commented May 14, 2024

Performance Bug: Reading large compressed images takes huge time to process #2641

Performance Bug: Reading large compressed images takes huge time to process #2641

Comments

snanda85 commented May 14, 2024

Environment

Code + PDF

Output

snanda85 commented May 14, 2024

stefan6419846 commented May 14, 2024