Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding transformation to a page can increase the file size by a factor of >2 #2436

Closed
lschlesinger opened this issue Feb 2, 2024 · 3 comments

Comments

@lschlesinger
Copy link

lschlesinger commented Feb 2, 2024

When adding a transformation t to the page e.g. to scale the content (t.scale(...)) or to move it around the page (t.translate(...)), we observe a huge increase in size. For a test file of 798KB, the resulting PDF is 1.6MB. For larger files, it seems that the size can even increase by a factor of 5, e.g. ~11MB before and ~53MB after.
This size increase can even be observed if no transformation is applied, but we only call page.add_transformation(Transformation()).

Environment

reproducible on python 3.7+ and pypdf 3.16.2, also tested with 4.0.1. (latest) which shows the same behavior.

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter, Transformation

def transform_pdf(
    pdf_path: str,
    output_pdf: str,
):
    writer = PdfWriter(clone_from=pdf_path)
    for page in writer.pages:
        page.add_transformation(Transformation())
    writer.write(output_pdf)


if __name__ == '__main__':
    transform_pdf("test-file.pdf", "result-file.pdf")

result-file.pdf
test-file.pdf

@MartinThoma
Copy link
Member

Please have a look at https://pypdf.readthedocs.io/en/latest/user/file-size.html

Adding the block

    for page in writer.pages:
        # ⚠️ This has to be done on the writer, not the reader!
        page.compress_content_streams(level=9)  # This is CPU intensive!

will result in a 715K file.

My guess is that we need to uncompress the file for handling the transformation, but we don't re-compress it. It makes sense to not do it by default as compression takes time.

@MartinThoma
Copy link
Member

If this solves your issue, please close it :-)

If it doesn't, please let me know why.

@lschlesinger
Copy link
Author

lschlesinger commented Feb 5, 2024

Thanks for the quick response @MartinThoma, it seems to solve the issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants