ENH: Rename resources deterministically in merge_page #1543

huonw · 2023-01-09T03:14:11Z

This fixes #1532 by adjusting the procedure used for renaming resources in merge_page, so that resources that have the same name (but different contents) are renamed in a deterministic/reproducible way.

When merging pages that both have a resource of the same name (say, /Example) but different values, the resource from the second page will now be renamed to /Example-0. Previously, it would be renamed with a random UUIDv4 (e.g. /Exampledbbbe7cb-5f34-4061-b863-41919b326b49) which would be different on every run, even if the inputs were identical.

The renamed name may already exist if a PDF is carefully/maliciously crafted, in which case /Example-1, /Example-2, etc. are tried until an appropriate name is found. If any of these options have the same value, that name/value is reused.

The /Example-0 pattern is short and sweet, but can easily be changed, e.g. some options off the top of my head:

/Example-pypdf-merged-0 for explanation about where it comes from
/Example-9c4beb01-0 where the 9c4beb01 is some marker hardcoded into pypdf's code, to reduce the chance of a "normal" PDF requiring idx > 0 (only truly maliciously crafted ones)

codecov · 2023-01-09T03:33:12Z

Codecov Report

Base: 91.81% // Head: 91.86% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (ec2bb43) compared to base (3b1b9d4).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1543      +/-   ##
==========================================
+ Coverage   91.81%   91.86%   +0.05%     
==========================================
  Files          33       33              
  Lines        6204     6207       +3     
  Branches     1229     1229              
==========================================
+ Hits         5696     5702       +6     
  Misses        326      326              
+ Partials      182      179       -3

Impacted Files	Coverage Δ
pypdf/_page.py	`89.85% <100.00%> (+0.48%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pypdf/_page.py

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

MartinThoma · 2023-01-22T09:40:43Z

Thank you @huonw for this PR and for adding good tests 🙏 It's now in main and will be released to PyPI today :-)

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)

huonw · 2023-01-22T22:26:31Z

Feel free to add me, thanks 😄

I've upgraded to 3.3.0 in our code, it works well. Thanks for merging and releasing!

Rename resources deterministically in merge_page

97515fd

huonw force-pushed the bugfix/1532-merge-page-resources-overlap branch from 2108a7d to 97515fd Compare January 9, 2023 03:15

Type indexing

c7b2be6

MasterOdin reviewed Jan 9, 2023

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

huonw and others added 2 commits January 9, 2023 16:11

f-strings

3a73155

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

Refer to the correct variable, oops

9a792c8

huonw changed the title ~~Rename resources deterministically in merge_page~~ ENH: Rename resources deterministically in merge_page Jan 17, 2023

MartinThoma added a commit that referenced this pull request Jan 21, 2023

STY: Minor refactoring to simplify #1543

bc49495

MartinThoma added a commit that referenced this pull request Jan 22, 2023

STY: Minor refactoring to simplify #1543 (#1568)

3b1b9d4

Merge branch 'main' into bugfix/1532-merge-page-resources-overlap

ec2bb43

MartinThoma merged commit b6b6a66 into py-pdf:main Jan 22, 2023

huonw deleted the bugfix/1532-merge-page-resources-overlap branch January 22, 2023 10:33

MartinThoma mentioned this pull request Feb 4, 2023

ENH: Restore merge_transformed_page & co + add 'over' parameter #1567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Rename resources deterministically in merge_page #1543

ENH: Rename resources deterministically in merge_page #1543

huonw commented Jan 9, 2023 •

edited

codecov bot commented Jan 9, 2023 •

edited

MartinThoma commented Jan 22, 2023

huonw commented Jan 22, 2023

ENH: Rename resources deterministically in merge_page #1543

ENH: Rename resources deterministically in merge_page #1543

Conversation

huonw commented Jan 9, 2023 • edited

codecov bot commented Jan 9, 2023 • edited

Codecov Report

MartinThoma commented Jan 22, 2023

huonw commented Jan 22, 2023

huonw commented Jan 9, 2023 •

edited

codecov bot commented Jan 9, 2023 •

edited