Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Bug: Reading large compressed images takes huge time to process #2641

Closed
snanda85 opened this issue May 14, 2024 · 2 comments · Fixed by #2644
Closed

Performance Bug: Reading large compressed images takes huge time to process #2641

snanda85 opened this issue May 14, 2024 · 2 comments · Fixed by #2644
Labels
nf-performance Non-functional change: Performance workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@snanda85
Copy link
Contributor

Reading a compressed image takes (>10minutes), if the image is large-ish (>250kb)

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('pycryptodome', '3.19.0'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import sys
from datetime import datetime

from pypdf import PdfReader

def log(msg):
    print(f"[{datetime.now()}] {msg}\n")

file = sys.argv[1]

log("Reading File PyPDF..")
images = []
pypdf_reader = PdfReader(file)
for pidx, page in enumerate(pypdf_reader.pages, start=1):
    log(f"Reading page {pidx}")
    for iidx, image in enumerate(page.images, start=1):
        log(f"Processing Image {iidx}")
        images.append(image.image)
    log(f"Competed page {pidx}")
log("Completed Reading File PyPDF!")

Attached is a sample PDF I created that can reproduce this error.
file_with_image.pdf

The PDF can be added to the tests as well.

Output

This is the complete output I see:
Look at the time difference between Image 6 and Image 7. It is close to 12 minutes

[2024-05-14 14:06:52.927018] Reading File PyPDF..
[2024-05-14 14:06:52.929346] Reading page 1
[2024-05-14 14:06:52.971993] Processing Image 1
[2024-05-14 14:06:52.993297] Processing Image 2
[2024-05-14 14:06:53.007304] Processing Image 3
[2024-05-14 14:06:53.021166] Processing Image 4
[2024-05-14 14:06:53.049529] Processing Image 5
[2024-05-14 14:06:53.051842] Processing Image 6

[2024-05-14 14:18:46.906472] Processing Image 7
[2024-05-14 14:18:47.088749] Processing Image 8
[2024-05-14 14:18:47.092159] Processing Image 9
[2024-05-14 14:18:47.099422] Processing Image 10
[2024-05-14 14:18:47.099513] Competed page 1
@snanda85
Copy link
Contributor Author

Added a trimmed version of the test file with a single page.
file_with_large_compressed_image.pdf

@stefan6419846
Copy link
Collaborator

Thanks for the report. The issue seems to be that

pypdf/pypdf/filters.py

Lines 85 to 89 in 6226d66

for b in [data[i : i + 1] for i in range(len(data))]:
try:
result_str += d.decompress(b)
except zlib.error:
pass
iterates over each input byte manually to be able to skip invalid bytes. return d.decompress(data) is faster in this case and might be implemented as a faster version to try first to improve 354f8ce.

@stefan6419846 stefan6419846 added workflow-images From a users perspective, image handling is the affected feature/workflow nf-performance Non-functional change: Performance labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants