BUG: missing error on name without leading / #2387

Rak424 · 2024-01-02T16:06:05Z

Leading slash is not part of the Name value, but only required in binary representation, so it should not be needed for Name object creation, and binary representation should always start with b"/".

Description in spec:

When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used to introduce a name. The SOLIDUS is not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:

codecov · 2024-01-02T16:13:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.44%. Comparing base (2b3051b) to head (c777d9f).
Report is 1 commits behind head on main.

❗ Current head c777d9f differs from pull request most recent head 7c75ab5. Consider uploading reports for the commit 7c75ab5 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2387   +/-   ##
=======================================
  Coverage   94.44%   94.44%           
=======================================
  Files          49       49           
  Lines        8027     8027           
  Branches     1618     1618           
=======================================
  Hits         7581     7581           
  Misses        276      276           
  Partials      170      170

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2024-01-02T17:00:25Z

Thanks for the PR. Did you previously get any errors regarding this from a specific PDF file?

Rak424 · 2024-01-02T19:54:38Z

Here is a quick example where it produces a corrupted pdf file:

from io import BytesIO
from pypdf import PdfWriter
from pypdf.generic import NameObject

s = BytesIO()
f = PdfWriter()
f._root_object[NameObject("foo")] = NameObject("bar")
f.write_stream(s)
print(s.getvalue().decode("latin1"))

...
3 0 obj
<<
/Type /Catalog
/Pages 1 0 R
foo bar
>>
endobj
...

stefan6419846 · 2024-01-03T06:44:36Z

I am still having trouble understanding how this affects real-world usages - the above code will not generate a valid PDF anyway due to the missing page tree.

Previously, f.write_stream(s) would emit two warnings for your example:

Incorrect first char in NameObject:(foo)
Incorrect first char in NameObject:(bar)

This will not change with your patch. Is this correct?

pubpub-zz · 2024-01-04T08:41:06Z

I am still having trouble understanding how this affects real-world usages - the above code will not generate a valid PDF anyway due to the missing page tree.

Previously, f.write_stream(s) would emit two warnings for your example:

Incorrect first char in NameObject:(foo)
Incorrect first char in NameObject:(bar)

This will not change with your patch. Is this correct?

I agree with @stefan6419846
If a change should be added I would add a constructor that would check if the input parameter starts with "/"

Rak424 · 2024-01-09T12:49:36Z

The problem is that binary representation of a name object always start with "/", and object like b"foo" is not a valid pdf object, it means that's not possible to parse the generated file correctly. If "/" is forgotten, nothing stop file creation.

An other thing is that b"/" is not part of the name value, but only used in binary representation in pdf file. Other libraries like https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py don't need it for object creation, and for me it's the correct way to implement it. It's confusing to require it and lead to corrupted files., specially when you use external libraries built on top of it. That's a problem I've encountered more than a year ago and the solution was to not use it.

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

stefan6419846 · 2024-01-09T16:40:38Z

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

We are currently preparing a major release, but we would need a corresponding deprecation process which delays the final hard change for some more time as per our deprecation policy: https://pypdf.readthedocs.io/en/latest/dev/deprecations.html

pubpub-zz · 2024-01-09T18:45:50Z

The problem is that binary representation of a name object always start with "/", and object like b"foo" is not a valid pdf object, it means that's not possible to parse the generated file correctly. If "/" is forgotten, nothing stop file creation.

I understand that your issue is when the "/" is forgotten nothing prevents writing the file. The exception should be raised within the constructor. An alternative could be to add the forgotten "/" within the constructor. doing this job during file writing will impact too much performances.

An other thing is that b"/" is not part of the name value, but only used in binary representation in pdf file.

pypdf has always considered the name to include the "/" as a convention. changing this will prevent nearly all existing programs to work in the future.

Other libraries like https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py don't need it for object creation, and for me it's the correct way to implement it. It's confusing to require it and lead to corrupted files., specially when you use external libraries built on top of it. That's a problem I've encountered more than a year ago and the solution was to not use it.

I don't think that changing the convention is a good idea: so many people are dowloading and using pypdf as it is (https://piptrends.com/compare/pypdf-vs-pdfrw)

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

tests/test_generic.py

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

Rak424 · 2024-01-20T21:52:32Z

ok, I've changed it to oblige leading /, and corrected errors found.

Rak424 · 2024-01-28T20:13:27Z

I've not found a correct way to overriding string constructor, so instead I've raised an error because the result without leading slash produce a corrupted pdf in any cases.
Doing this, I've found some related bugs raised in tests that I've corrected.

pypdf/annotations/_markup_annotations.py

pypdf/generic/_base.py

tests/test_generic.py

pypdf/generic/_base.py

stefan6419846 · 2024-02-27T09:31:14Z

@pubpub-zz Are you okay with the current implementation or do you prefer/propose an alternative one?

pubpub-zz · 2024-02-27T18:06:09Z

My first merge 🎉

@pubpub-zz

## What's new Generating name objects (`NameObject`) without a leading slash is considered deprecated now. Previously, just a plain warning would be logged, leading to possibly invalid PDF files. According to our deprecation policy, this will log a *DeprecationWarning* for now. ### New Features (ENH) - Add get_pages_from_field (#2494) by @pubpub-zz - Add reattach_fields function (#2480) by @pubpub-zz - Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz ### Bug Fixes (BUG) - Missing error on name without leading / (#2387) by @Rak424 - encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon - BI in text content identified as image tag (#2459) by @pubpub-zz ### Robustness (ROB) - Missing basefont entry in type 3 font (#2469) by @pubpub-zz ### Documentation (DOC) - Improve lossless compression example (#2488) by @j-t-1 - Amend robustness documentation (#2479) by @j-t-1 ### Developer Experience (DEV) - Fix changelog for UTF-8 characters (#2462) by @stefan6419846 ### Maintenance (MAINT) - Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz - Remove user assignment for feature requests (#2483) by @stefan6419846 - Remove reference to old 2.0.0 branch (#2482) by @stefan6419846 ### Testing (TST) - Fix benchmark failures (#2481) by @stefan6419846 - Broken test due to expired test file URL (#2468) by @pubpub-zz - Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon [Full Changelog](4.0.2...4.1.0)

@pubpub-zz

## What's new Generating name objects (`NameObject`) without a leading slash is considered deprecated now. Previously, just a plain warning would be logged, leading to possibly invalid PDF files. According to our deprecation policy, this will log a *DeprecationWarning* for now. ### New Features (ENH) - Add get_pages_from_field (#2494) by @pubpub-zz - Add reattach_fields function (#2480) by @pubpub-zz - Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz ### Bug Fixes (BUG) - Missing error on name without leading / (#2387) by @Rak424 - encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon - BI in text content identified as image tag (#2459) by @pubpub-zz ### Robustness (ROB) - Missing basefont entry in type 3 font (#2469) by @pubpub-zz ### Documentation (DOC) - Improve lossless compression example (#2488) by @j-t-1 - Amend robustness documentation (#2479) by @j-t-1 ### Developer Experience (DEV) - Fix changelog for UTF-8 characters (#2462) by @stefan6419846 ### Maintenance (MAINT) - Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz - Remove user assignment for feature requests (#2483) by @stefan6419846 - Remove reference to old 2.0.0 branch (#2482) by @stefan6419846 ### Testing (TST) - Fix benchmark failures (#2481) by @stefan6419846 - Broken test due to expired test file URL (#2468) by @pubpub-zz - Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon [Full Changelog](4.0.2...4.1.0)

handling name without leading /

7ef8b86

MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jan 6, 2024

Merge branch 'main' into slash

6400baa

Rak424 and others added 12 commits January 19, 2024 23:12

Merge branch 'py-pdf:main' into slash

e3dde18

init

6cf7075

type

15fbfca

optional

6a34b8d

exception

ae5ec5d

test

89f964a

test

8bedb27

test

9364c7e

missing /

1317068

wrong name value

000341f

wrong object

dcdb481

wrong object

d971211

pubpub-zz reviewed Jan 20, 2024

View reviewed changes

tests/test_generic.py Outdated Show resolved Hide resolved

Rak424 and others added 2 commits January 20, 2024 21:37

Update tests/test_generic.py

462cf1a

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

indent

f2674af

francois and others added 2 commits January 21, 2024 20:13

Merge branch 'main' into slash

0d71061

Merge branch 'main' into slash

5de1131

Rak424 changed the title ~~BUG: handling name without leading /~~ BUG: missing error on name without leading / Jan 28, 2024

stefan6419846 reviewed Feb 25, 2024

View reviewed changes

pypdf/annotations/_markup_annotations.py Show resolved Hide resolved

stefan6419846 reviewed Feb 25, 2024

View reviewed changes

pypdf/generic/_base.py Outdated Show resolved Hide resolved

Rak424 and others added 3 commits February 26, 2024 13:03

Merge branch 'main' into slash

0b56d7f

Merge branch 'main' into slash

a073eac

deprecate

f9a1c43

stefan6419846 reviewed Feb 26, 2024

View reviewed changes

tests/test_generic.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Feb 26, 2024

View reviewed changes

pypdf/generic/_base.py Outdated Show resolved Hide resolved

francois and others added 4 commits February 26, 2024 14:16

ruff

da3a58d

deprecate_no_replacement

d176b2e

Merge branch 'main' into slash

5c344ab

cleaning

adc27a0

stefan6419846 reviewed Feb 27, 2024

View reviewed changes

pypdf/generic/_base.py Outdated Show resolved Hide resolved

fix formatting and version

c777d9f

stefan6419846 added breaking-change A planned breaking change and removed on-hold PR requests that need clarification before they can be merged.A comment must give details labels Feb 27, 2024

pubpub-zz approved these changes Feb 27, 2024

View reviewed changes

Merge branch 'main' into slash

7c75ab5

pubpub-zz merged commit 178014e into py-pdf:main Feb 27, 2024
12 checks passed

Rak424 deleted the slash branch March 7, 2024 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: missing error on name without leading / #2387

BUG: missing error on name without leading / #2387

Rak424 commented Jan 2, 2024

codecov bot commented Jan 2, 2024 •

edited

Loading

stefan6419846 commented Jan 2, 2024

Rak424 commented Jan 2, 2024

stefan6419846 commented Jan 3, 2024

pubpub-zz commented Jan 4, 2024 •

edited by MartinThoma

Loading

Rak424 commented Jan 9, 2024

stefan6419846 commented Jan 9, 2024

pubpub-zz commented Jan 9, 2024

Rak424 commented Jan 20, 2024

Rak424 commented Jan 28, 2024

stefan6419846 commented Feb 27, 2024

pubpub-zz commented Feb 27, 2024

BUG: missing error on name without leading / #2387

BUG: missing error on name without leading / #2387

Conversation

Rak424 commented Jan 2, 2024

codecov bot commented Jan 2, 2024 • edited Loading

Codecov Report

stefan6419846 commented Jan 2, 2024

Rak424 commented Jan 2, 2024

stefan6419846 commented Jan 3, 2024

pubpub-zz commented Jan 4, 2024 • edited by MartinThoma Loading

Rak424 commented Jan 9, 2024

stefan6419846 commented Jan 9, 2024

pubpub-zz commented Jan 9, 2024

Rak424 commented Jan 20, 2024

Rak424 commented Jan 28, 2024

stefan6419846 commented Feb 27, 2024

pubpub-zz commented Feb 27, 2024

codecov bot commented Jan 2, 2024 •

edited

Loading

pubpub-zz commented Jan 4, 2024 •

edited by MartinThoma

Loading