Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add page label support to PdfWriter #1558

Merged
merged 23 commits into from
Jan 19, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b110225
ENH: Add page label support to PdfWriter
lorenzomanini Jan 15, 2023
c5fa998
Merge branch 'main' of https://github.com/py-pdf/pypdf
lorenzomanini Jan 15, 2023
39c6423
fix import, fix overlapping ranges behaviour
lorenzomanini Jan 17, 2023
4a48c29
add tests
lorenzomanini Jan 17, 2023
474b211
Merge remote-tracking branch 'upstream/main'
lorenzomanini Jan 17, 2023
9c97450
Merge branch 'main' of https://github.com/lorenzomanini/pypdf
lorenzomanini Jan 17, 2023
55ed6c8
Apply suggestions from code review: fix mypy
lorenzomanini Jan 17, 2023
dbd6bce
Apply suggestions from code review: fix mypy
lorenzomanini Jan 17, 2023
717c882
Fix mypy
lorenzomanini Jan 17, 2023
e298494
fix flake
lorenzomanini Jan 17, 2023
6977af6
Moved nums functions to _page_labels.py
lorenzomanini Jan 18, 2023
251390f
Added tests
lorenzomanini Jan 18, 2023
d56a63a
Added Docs
lorenzomanini Jan 18, 2023
64beecb
add PageLabelStyle constants
lorenzomanini Jan 18, 2023
739d117
More explicit PageLabelStyle constants
lorenzomanini Jan 18, 2023
b24b2ce
Apply suggestions from code review: fix Doc
lorenzomanini Jan 18, 2023
75fa103
fix flake8
lorenzomanini Jan 18, 2023
8c0a82e
add lorenzomanini to CONTRIBUTORS.md
lorenzomanini Jan 18, 2023
679b8f8
Merge branch 'main' into main
lorenzomanini Jan 18, 2023
cd0b81e
change set_page_label page indexing to 0 based
lorenzomanini Jan 18, 2023
c1c1357
Apply suggestions from code review: fix Doc
lorenzomanini Jan 19, 2023
ef2a6e8
Merge remote-tracking branch 'upstream/main'
lorenzomanini Jan 19, 2023
daad866
fix test
lorenzomanini Jan 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions pypdf/_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,14 @@
ZoomArgType,
)

try:
from typing import Literal # type: ignore[attr-defined]
except ImportError:
# PEP 586 introduced typing.Literal with Python 3.8
# For older Python versions, the backport typing_extensions is necessary:
from typing_extensions import Literal # type: ignore[misc]


logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -2874,6 +2882,113 @@ def reset_translation(
else:
raise Exception("invalid parameter {reader}")

def set_page_label(
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
self,
page_number_from: int,
page_number_to: int,
Copy link
Contributor Author

@lorenzomanini lorenzomanini Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about page ranges indexes.
Now set_page_label requires page numbers starting with 1 while _set_page_label requires them starting with 0 (the parameters name change accordingly from page_number to page_index)
Also in both cases extremes are included.
I did it this way in the public interface because I think it is more natural this way for the user that probably is setting these watching a pdf (where page numbers start from 1), but I understand if you don't agree, especially considering that the private interface is 0 based.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use page_index everywhere and hence starting from 0.
Don't forget that pypdf users are Python developers. Starting a list of pages with index 0 is a lot more natural than starting with 1.

My suggestion would be to rename the parameters to page_index_from and page_index_to and letting it start with 0.

@pubpub-zz / @MasterOdin What's your opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehehe 1 based indexing is never the answer. The more I think about it the more I agree with you. I'll wait a bit for MasterOdin answer and then I'll change that. Unfortunately, I will have to change all the indexes in the test but I knew what I was risking when I did it:sweat_smile:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with making it 0-based indexed for the simple reason that all other functions that refer to pages uses a 0-based index.

I agree with @MartinThoma that using page_index is the more "proper" way to refer to this as people may conflate page_number with a 1-based index scheme, whereas page_index, in my opinion, really only refers to 0-based index schemes in the context of Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit changes everything to 0-based indexing. Are you convinced by including both extremes or should we do lower included and upper excluded?

style: Optional[Literal["/D", "/R", "/r", "/A", "/a"]] = None,
prefix: Optional[str] = None,
start: Optional[int] = 0,
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
):
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
if style is None and prefix is None:
raise ValueError("at least one between style and prefix must be given")
if page_number_from < 1:
raise ValueError("page_index_from must be equal or greater then 1")
if page_number_to < page_number_from:
raise ValueError(
"page_index_to must be equal or greater then page_index_from"
)
if page_number_to > len(self.pages):
raise ValueError("page_index_to exceeds number of pages")
if start != 0 and start < 1:
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError("start must be equal or greater than one")

self._set_page_label(
page_number_from - 1, page_number_to - 1, style, prefix, start
)

def _set_page_label(
self,
page_index_from: int,
page_index_to: int,
style: Optional[Literal["/D", "/R", "/r", "/A", "/a"]] = None,
prefix: Optional[str] = None,
start: Optional[int] = 0,
):
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
default_page_label = DictionaryObject()
default_page_label[NameObject("/S")] = NameObject("/D")

new_page_label = DictionaryObject()
if style is not None:
new_page_label[NameObject("/S")] = NameObject(style)
if prefix is not None:
new_page_label[NameObject("/P")] = TextStringObject(prefix)
if start != 0:
new_page_label[NameObject("/St")] = NumberObject(start)
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved

if not NameObject(CatalogDictionary.PAGE_LABELS) in self._root_object:
nums = ArrayObject()
self._nums_insert(NumberObject(0), default_page_label, nums)
page_labels = TreeObject()
page_labels[NameObject("/Nums")] = nums
self._root_object[NameObject(CatalogDictionary.PAGE_LABELS)] = page_labels

page_labels = self._root_object[NameObject(CatalogDictionary.PAGE_LABELS)]
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
nums = page_labels[NameObject("/Nums")]
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved

self._nums_insert(NumberObject(page_index_from), new_page_label, nums)
self._nums_clear_range(NumberObject(page_index_from), page_index_to, nums)
next_label_pos, *_ = self._nums_next(NumberObject(page_index_from), nums)
if next_label_pos != page_index_to + 1 and page_index_to + 1 < len(self.pages):
self._nums_insert(NumberObject(page_index_to + 1), default_page_label, nums)

page_labels[NameObject("/Nums")] = nums
self._root_object[NameObject(CatalogDictionary.PAGE_LABELS)] = page_labels

def _nums_insert(
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
self,
key: NumberObject,
value: DictionaryObject,
nums: ArrayObject,
):
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
if len(nums) % 2 != 0:
raise ValueError("nums must contain an even number of elements")

i = len(nums)
while i != 0 and key <= nums[i - 2]:
i = i - 2

if i < len(nums) and key == nums[i]:
nums[i + 1] = value
else:
nums.insert(i, key)
nums.insert(i + 1, value)

def _nums_clear_range(
self,
key: NumberObject,
page_index_to: int,
nums: ArrayObject,
):
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
if page_index_to < key:
raise ValueError("page_index_to must be greater or equal than key")

i = nums.index(key) + 2
while i < len(nums) and nums[i] <= page_index_to:
nums.pop(i)
nums.pop(i)

def _nums_next(
self,
key: NumberObject,
nums: ArrayObject,
):
lorenzomanini marked this conversation as resolved.
Show resolved Hide resolved
i = nums.index(key) + 2
if i < len(nums):
return (nums[i], nums[i + 1])
else:
return (None, None)


def _pdf_objectify(obj: Union[Dict[str, Any], str, int, List[Any]]) -> PdfObject:
if isinstance(obj, PdfObject):
Expand Down
95 changes: 95 additions & 0 deletions tests/test_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1008,3 +1008,98 @@ def test_append_multiple():
pages = writer._root_object["/Pages"]["/Kids"]
assert pages[0] not in pages[1:] # page not repeated
assert pages[-1] not in pages[0:-1] # page not repeated


@pytest.mark.samples
def test_set_page_labels():
src = RESOURCE_ROOT / "GeoBase_NHNC1_Data_Model_UML_EN.pdf" # File without labels
target = "pypdf-output.pdf"
reader = PdfReader(src)

expected = [
"i",
"ii",
"1",
"2",
"A",
"B",
"1",
"2",
"3",
"4",
"A",
"i",
"I",
"II",
"1",
"2",
"3",
"I",
"II",
]

# Tests full lenght with labels assigned at first and last elements
# Tests different labels assigned to consecutive ranges
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.set_page_label(1, 2, "/r")
writer.set_page_label(5, 6, "/A")
writer.set_page_label(11, 11, "/A")
writer.set_page_label(12, 12, "/r")
writer.set_page_label(13, 14, "/R")
writer.set_page_label(18, 19, "/R")
writer.write(target)
assert PdfReader(target).page_labels == expected

writer = PdfWriter() # Same labels, different set order
writer.clone_document_from_reader(reader)
writer.set_page_label(18, 19, "/R")
writer.set_page_label(5, 6, "/A")
writer.set_page_label(11, 11, "/A")
writer.set_page_label(1, 2, "/r")
writer.set_page_label(13, 14, "/R")
writer.set_page_label(12, 12, "/r")
writer.write(target)
assert PdfReader(target).page_labels == expected

# Tests labels assigned only in the middle
# Tests label assigned to a range already containing labled ranges
expected = ["1", "2", "i", "ii", "iii", "iv", "v", "1"]
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.set_page_label(4, 5, "/a")
writer.set_page_label(6, 6, "/A")
writer.set_page_label(3, 7, "/r")
writer.write(target)
assert PdfReader(target).page_labels[: len(expected)] == expected

# Tests labels assigned inside a previously existing range
expected = ["1", "2", "i", "a", "b", "A", "1", "1", "2"]
# Ones repeat because user didnt cover the entire original range
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.set_page_label(3, 7, "/r")
writer.set_page_label(4, 5, "/a")
writer.set_page_label(6, 6, "/A")
writer.write(target)
assert PdfReader(target).page_labels[: len(expected)] == expected

src = SAMPLE_ROOT / "009-pdflatex-geotopo/GeoTopo.pdf" # File with pre existing labels
target = "pypdf-output.pdf"
reader = PdfReader(src)

# Tests adding labels to existing ones
expected = ["i", "ii", "A", "B", "1"]
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.set_page_label(3, 4, "/A")
writer.write(target)
assert PdfReader(target).page_labels[: len(expected)] == expected

# Tests replacing existing lables
expected = ["A", "B", "1", "1", "2"]
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.set_page_label(1, 2, "/A")
writer.write(target)
assert PdfReader(target).page_labels[: len(expected)] == expected