Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Sort computed /ProcSet in a merged page, for reproducibility #1542

Merged
merged 2 commits into from
Jan 22, 2023

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Jan 9, 2023

This fixes #1531 by sorting the /ProcSet array when merging two pages.

Inferring from the name and the existing code's use of frozenset, I believe /ProcSet is a set of NameObjects, where the order doesn't matter. This means that it is should be fine for merge_page to always choose a single consistent order for reproducibility, and the easiest order is sorting.

Another option would be reproducing the order in the original pages as best as possible (e.g. if page 1 had /B, /A and page 2 had /C, /B, /D, then it could compute /B, /A, /C, /D, preserving the order for page 1 and page 2, except for duplicating /B), but this seems like unnecessary complexity.

@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Base: 91.86% // Head: 91.86% // No change to project coverage 👍

Coverage data is based on head (616a99e) compared to base (b6b6a66).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1542   +/-   ##
=======================================
  Coverage   91.86%   91.86%           
=======================================
  Files          33       33           
  Lines        6207     6207           
  Branches     1229     1229           
=======================================
  Hits         5702     5702           
  Misses        326      326           
  Partials      179      179           
Impacted Files Coverage Δ
pypdf/_page.py 89.85% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@huonw huonw changed the title Sort computed /ProcSet in a merged page, for reproducibility ENH: Sort computed /ProcSet in a merged page, for reproducibility Jan 17, 2023
@MartinThoma MartinThoma merged commit d1233df into py-pdf:main Jan 22, 2023
@MartinThoma
Copy link
Member

I really hope that there is no hidden meaning of the ProcSet order. I couldn't find anything in the specs, so it should be fine.

Thank you for the PR! It will be in the release today :-)

MartinThoma added a commit that referenced this pull request Jan 22, 2023
New Features (ENH):
-  Add page label support to PdfWriter (#1558)
-  Accept inline images with space before EI (#1552)
-  Add circle annotation support (#1556)
-  Add polygon annotation support (#1557)
-  Make merging pages produce a deterministic PDF (#1542, #1543)

Bug Fixes (BUG):
-  Fix error in cmap extraction (#1544)
-  Remove erroneous assertion check (#1564)
-  Fix dictionary access of optional page label keys (#1562)

Robustness (ROB):
-  Set ignore_eof=True for read_until_regex (#1521)

Documentation (DOC):
-  Paper size (#1550)

Developer Experience (DEV):
-  Fix broken combination of dependencies of docs.txt
-  Annotate tests appropriately (#1551)

[Full Changelog](3.2.1...3.3.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge_page isn't reproducible due to /ProcSet order
2 participants