Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Rename resources deterministically in merge_page #1543

Merged

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Jan 9, 2023

This fixes #1532 by adjusting the procedure used for renaming resources in merge_page, so that resources that have the same name (but different contents) are renamed in a deterministic/reproducible way.

When merging pages that both have a resource of the same name (say, /Example) but different values, the resource from the second page will now be renamed to /Example-0. Previously, it would be renamed with a random UUIDv4 (e.g. /Exampledbbbe7cb-5f34-4061-b863-41919b326b49) which would be different on every run, even if the inputs were identical.

The renamed name may already exist if a PDF is carefully/maliciously crafted, in which case /Example-1, /Example-2, etc. are tried until an appropriate name is found. If any of these options have the same value, that name/value is reused.

The /Example-0 pattern is short and sweet, but can easily be changed, e.g. some options off the top of my head:

  • /Example-pypdf-merged-0 for explanation about where it comes from
  • /Example-9c4beb01-0 where the 9c4beb01 is some marker hardcoded into pypdf's code, to reduce the chance of a "normal" PDF requiring idx > 0 (only truly maliciously crafted ones)

@huonw huonw force-pushed the bugfix/1532-merge-page-resources-overlap branch from 2108a7d to 97515fd Compare January 9, 2023 03:15
@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Base: 91.81% // Head: 91.86% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (ec2bb43) compared to base (3b1b9d4).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1543      +/-   ##
==========================================
+ Coverage   91.81%   91.86%   +0.05%     
==========================================
  Files          33       33              
  Lines        6204     6207       +3     
  Branches     1229     1229              
==========================================
+ Hits         5696     5702       +6     
  Misses        326      326              
+ Partials      182      179       -3     
Impacted Files Coverage Δ
pypdf/_page.py 89.85% <100.00%> (+0.48%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pypdf/_page.py Outdated Show resolved Hide resolved
huonw and others added 2 commits January 9, 2023 16:11
Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>
@huonw huonw changed the title Rename resources deterministically in merge_page ENH: Rename resources deterministically in merge_page Jan 17, 2023
MartinThoma added a commit that referenced this pull request Jan 21, 2023
MartinThoma added a commit that referenced this pull request Jan 22, 2023
@MartinThoma MartinThoma merged commit b6b6a66 into py-pdf:main Jan 22, 2023
@MartinThoma
Copy link
Member

Thank you @huonw for this PR and for adding good tests 🙏 It's now in main and will be released to PyPI today :-)

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@huonw huonw deleted the bugfix/1532-merge-page-resources-overlap branch January 22, 2023 10:33
MartinThoma added a commit that referenced this pull request Jan 22, 2023
New Features (ENH):
-  Add page label support to PdfWriter (#1558)
-  Accept inline images with space before EI (#1552)
-  Add circle annotation support (#1556)
-  Add polygon annotation support (#1557)
-  Make merging pages produce a deterministic PDF (#1542, #1543)

Bug Fixes (BUG):
-  Fix error in cmap extraction (#1544)
-  Remove erroneous assertion check (#1564)
-  Fix dictionary access of optional page label keys (#1562)

Robustness (ROB):
-  Set ignore_eof=True for read_until_regex (#1521)

Documentation (DOC):
-  Paper size (#1550)

Developer Experience (DEV):
-  Fix broken combination of dependencies of docs.txt
-  Annotate tests appropriately (#1551)

[Full Changelog](3.2.1...3.3.0)
@huonw
Copy link
Contributor Author

huonw commented Jan 22, 2023

Feel free to add me, thanks 😄

I've upgraded to 3.3.0 in our code, it works well. Thanks for merging and releasing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge_page isn't reproducible due to resource renaming
3 participants