-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Rename resources deterministically in merge_page #1543
ENH: Rename resources deterministically in merge_page #1543
Conversation
2108a7d
to
97515fd
Compare
Codecov ReportBase: 91.81% // Head: 91.86% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1543 +/- ##
==========================================
+ Coverage 91.81% 91.86% +0.05%
==========================================
Files 33 33
Lines 6204 6207 +3
Branches 1229 1229
==========================================
+ Hits 5696 5702 +6
Misses 326 326
+ Partials 182 179 -3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Thank you @huonw for this PR and for adding good tests 🙏 It's now in If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html |
New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)
Feel free to add me, thanks 😄 I've upgraded to 3.3.0 in our code, it works well. Thanks for merging and releasing! |
This fixes #1532 by adjusting the procedure used for renaming resources in
merge_page
, so that resources that have the same name (but different contents) are renamed in a deterministic/reproducible way.When merging pages that both have a resource of the same name (say,
/Example
) but different values, the resource from the second page will now be renamed to/Example-0
. Previously, it would be renamed with a random UUIDv4 (e.g./Exampledbbbe7cb-5f34-4061-b863-41919b326b49
) which would be different on every run, even if the inputs were identical.The renamed name may already exist if a PDF is carefully/maliciously crafted, in which case
/Example-1
,/Example-2
, etc. are tried until an appropriate name is found. If any of these options have the same value, that name/value is reused.The
/Example-0
pattern is short and sweet, but can easily be changed, e.g. some options off the top of my head:/Example-pypdf-merged-0
for explanation about where it comes from/Example-9c4beb01-0
where the9c4beb01
is some marker hardcoded into pypdf's code, to reduce the chance of a "normal" PDF requiringidx > 0
(only truly maliciously crafted ones)