ENH: Add page label support to PdfWriter #1558

lorenzomanini · 2023-01-15T23:30:58Z

Introduce a new PdfWriter interface set_page_labels that sets page labels to a page interval (as mentioned in #1554).
It should support setting multiple page lables for different page intervals.
I've done some local testing and it seems to be working.
Pending: tests to be added

It's my first time contributing to an open-source project and I have little experience with python so my code will certainly need corrections, so I would like to know what you think about it.
Also, I've no idea how to set up correctly tests for my code but I'll look into it as soon as I will have some spare time.

MartinThoma · 2023-01-16T04:59:33Z

I've done some local testing and it seems to be working.

How did you test it? Please add the test script

pypdf/_writer.py

lorenzomanini · 2023-01-16T09:15:32Z

Thank you a lot! This week I'm pretty busy, but I will try to address everything as soon as possible. At the latest, I think, next week I will be able to work on it.

lorenzomanini · 2023-01-17T18:22:36Z

I have added the tests and changed a bit the behavior to manage correctly when ranges of page labels overlap. The three _nums functions should be moved to _page_labels.py I think, what do you think?
Also, the inputs of set_page_label (page_number_from and page_number_to) must be 1 based indexes. It seemed more natural and practical to me for this use case but I'm not sure.
Docs and mypy are still missing.
Let me know what you think! :)

pypdf/_writer.py

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

lorenzomanini · 2023-01-17T22:07:23Z

Thank you for the mypy suggestions!
The label parameter should already be renamed style, did I mess something up with Git?
I agree about creating a class in constants for the possible styles.

pubpub-zz · 2023-01-17T22:25:18Z

Thank you for the mypy suggestions! The label parameter should already be renamed style, did I mess something up with Git? I agree about creating a class in constants for the possible styles.
oups old comment... All good.

pypdf/_writer.py

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

codecov · 2023-01-17T23:46:25Z

Codecov Report

Base: 91.87% // Head: 91.84% // Decreases project coverage by -0.04% ⚠️

Coverage data is based on head (daad866) compared to base (3560550).
Patch coverage: 89.04% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1558      +/-   ##
==========================================
- Coverage   91.87%   91.84%   -0.04%     
==========================================
  Files          33       33              
  Lines        6130     6202      +72     
  Branches     1210     1228      +18     
==========================================
+ Hits         5632     5696      +64     
- Misses        322      326       +4     
- Partials      176      180       +4

Impacted Files	Coverage Δ
pypdf/_page_labels.py	`76.74% <71.42%> (-2.92%)`	⬇️
pypdf/_writer.py	`84.37% <100.00%> (+0.61%)`	⬆️
pypdf/constants.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pypdf/constants.py

lorenzomanini

I think we are nearly there.
I wanted your opinion about the following: @MartinThoma @pubpub-zz

pypdf/_page_labels.py

lorenzomanini · 2023-01-18T18:26:24Z

pypdf/_writer.py

+    def set_page_label(
+        self,
+        page_number_from: int,
+        page_number_to: int,


What do you think about page ranges indexes.
Now set_page_label requires page numbers starting with 1 while _set_page_label requires them starting with 0 (the parameters name change accordingly from page_number to page_index)
Also in both cases extremes are included.
I did it this way in the public interface because I think it is more natural this way for the user that probably is setting these watching a pdf (where page numbers start from 1), but I understand if you don't agree, especially considering that the private interface is 0 based.

I'd prefer to use page_index everywhere and hence starting from 0.
Don't forget that pypdf users are Python developers. Starting a list of pages with index 0 is a lot more natural than starting with 1.

My suggestion would be to rename the parameters to page_index_from and page_index_to and letting it start with 0.

@pubpub-zz / @MasterOdin What's your opinion?

Agree with your approach

hehehe 1 based indexing is never the answer. The more I think about it the more I agree with you. I'll wait a bit for MasterOdin answer and then I'll change that. Unfortunately, I will have to change all the indexes in the test but I knew what I was risking when I did it:sweat_smile:

I would agree with making it 0-based indexed for the simple reason that all other functions that refer to pages uses a 0-based index.

I agree with @MartinThoma that using page_index is the more "proper" way to refer to this as people may conflate page_number with a 1-based index scheme, whereas page_index, in my opinion, really only refers to 0-based index schemes in the context of Python.

Thank you! I'll do that

The last commit changes everything to 0-based indexing. Are you convinced by including both extremes or should we do lower included and upper excluded?

pypdf/_page_labels.py

MartinThoma · 2023-01-18T18:40:09Z

I still need to go over a lot of details, but I wanted to give you a small heads-up: Excellent work 🎉 ❤️
Thank you for contributing to pypdf 🤗

pypdf/_page_labels.py

pypdf/_writer.py

MartinThoma · 2023-01-18T18:51:19Z

Overall it looks good to me. There are some minor docstring improvements I suggested (the first line should always be a summary of the function) + the page_number/page_index question needs to be clarified before we merge.

I'm confident that this can be merged + released until Sunday :-)

@lorenzomanini If you want, you can add yourself to the https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html (docs/meta/CONTRIBUTORS.md) within this PR :-)

Co-authored-by: Martin Thoma <info@martin-thoma.de>

lorenzomanini · 2023-01-18T21:00:36Z

I still need to go over a lot of details, but I wanted to give you a small heads-up: Excellent work 🎉 ❤️ Thank you for contributing to pypdf 🤗

Thank you!:blush: I'm having a lot of fun. And thank you for being so welcoming!:heart:

lorenzomanini · 2023-01-18T21:23:41Z

I'm confident that this can be merged + released until Sunday :-)

@lorenzomanini If you want, you can add yourself to the https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html (docs/meta/CONTRIBUTORS.md) within this PR :-)

Cool! Thank you!

pypdf/_writer.py

MartinThoma · 2023-01-19T18:35:16Z

@lorenzomanini Looks good from my side - let me know when I can merge :-)

Co-authored-by: Martin Thoma <info@martin-thoma.de>

lorenzomanini · 2023-01-19T19:21:44Z

Good, I think it is ready! Thank you @MartinThoma and @pubpub-zz for the help and for being so nice.
This was a great first experience contributing to an open-source project. I hope to collab again! ❤️ 😊

MartinThoma · 2023-01-19T19:41:02Z

I'm happy to hear that you feel welcome 🤗

Your contribution is now in the main branch and will be on PyPI with version pypdf==3.3.0 on Sunday.

Really good work!

New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)

lorenzomanini added 2 commits January 15, 2023 23:22

ENH: Add page label support to PdfWriter

b110225

Merge branch 'main' of https://github.com/py-pdf/pypdf

c5fa998