Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: missing error on name without leading / #2387

Merged
merged 27 commits into from
Feb 27, 2024
Merged

Conversation

Rak424
Copy link
Contributor

@Rak424 Rak424 commented Jan 2, 2024

Leading slash is not part of the Name value, but only required in binary representation, so it should not be needed for Name object creation, and binary representation should always start with b"/".

Description in spec:

When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used to introduce a name. The SOLIDUS is not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:

Copy link

codecov bot commented Jan 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.44%. Comparing base (2b3051b) to head (c777d9f).
Report is 1 commits behind head on main.

❗ Current head c777d9f differs from pull request most recent head 7c75ab5. Consider uploading reports for the commit 7c75ab5 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2387   +/-   ##
=======================================
  Coverage   94.44%   94.44%           
=======================================
  Files          49       49           
  Lines        8027     8027           
  Branches     1618     1618           
=======================================
  Hits         7581     7581           
  Misses        276      276           
  Partials      170      170           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

Thanks for the PR. Did you previously get any errors regarding this from a specific PDF file?

@Rak424
Copy link
Contributor Author

Rak424 commented Jan 2, 2024

Here is a quick example where it produces a corrupted pdf file:

from io import BytesIO
from pypdf import PdfWriter
from pypdf.generic import NameObject

s = BytesIO()
f = PdfWriter()
f._root_object[NameObject("foo")] = NameObject("bar")
f.write_stream(s)
print(s.getvalue().decode("latin1"))
...
3 0 obj
<<
/Type /Catalog
/Pages 1 0 R
foo bar
>>
endobj
...

@stefan6419846
Copy link
Collaborator

I am still having trouble understanding how this affects real-world usages - the above code will not generate a valid PDF anyway due to the missing page tree.

Previously, f.write_stream(s) would emit two warnings for your example:

Incorrect first char in NameObject:(foo)
Incorrect first char in NameObject:(bar)

This will not change with your patch. Is this correct?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 4, 2024

I am still having trouble understanding how this affects real-world usages - the above code will not generate a valid PDF anyway due to the missing page tree.

Previously, f.write_stream(s) would emit two warnings for your example:

Incorrect first char in NameObject:(foo)
Incorrect first char in NameObject:(bar)

This will not change with your patch. Is this correct?

I agree with @stefan6419846
If a change should be added I would add a constructor that would check if the input parameter starts with "/"

@MartinThoma MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jan 6, 2024
@Rak424
Copy link
Contributor Author

Rak424 commented Jan 9, 2024

The problem is that binary representation of a name object always start with "/", and object like b"foo" is not a valid pdf object, it means that's not possible to parse the generated file correctly. If "/" is forgotten, nothing stop file creation.

An other thing is that b"/" is not part of the name value, but only used in binary representation in pdf file. Other libraries like https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py don't need it for object creation, and for me it's the correct way to implement it. It's confusing to require it and lead to corrupted files., specially when you use external libraries built on top of it. That's a problem I've encountered more than a year ago and the solution was to not use it.

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

@stefan6419846
Copy link
Collaborator

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

We are currently preparing a major release, but we would need a corresponding deprecation process which delays the final hard change for some more time as per our deprecation policy: https://pypdf.readthedocs.io/en/latest/dev/deprecations.html

@pubpub-zz
Copy link
Collaborator

The problem is that binary representation of a name object always start with "/", and object like b"foo" is not a valid pdf object, it means that's not possible to parse the generated file correctly. If "/" is forgotten, nothing stop file creation.

I understand that your issue is when the "/" is forgotten nothing prevents writing the file. The exception should be raised within the constructor. An alternative could be to add the forgotten "/" within the constructor. doing this job during file writing will impact too much performances.

An other thing is that b"/" is not part of the name value, but only used in binary representation in pdf file.

pypdf has always considered the name to include the "/" as a convention. changing this will prevent nearly all existing programs to work in the future.

Other libraries like https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py don't need it for object creation, and for me it's the correct way to implement it. It's confusing to require it and lead to corrupted files., specially when you use external libraries built on top of it. That's a problem I've encountered more than a year ago and the solution was to not use it.

I don't think that changing the convention is a good idea: so many people are dowloading and using pypdf as it is (https://piptrends.com/compare/pypdf-vs-pdfrw)

I've not changed the current way it's implemented, because it's a huge change, but it could be done in a major release.

tests/test_generic.py Outdated Show resolved Hide resolved
Rak424 and others added 2 commits January 20, 2024 21:37
Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@Rak424
Copy link
Contributor Author

Rak424 commented Jan 20, 2024

ok, I've changed it to oblige leading /, and corrected errors found.

@Rak424 Rak424 changed the title BUG: handling name without leading / BUG: missing error on name without leading / Jan 28, 2024
@Rak424
Copy link
Contributor Author

Rak424 commented Jan 28, 2024

I've not found a correct way to overriding string constructor, so instead I've raised an error because the result without leading slash produce a corrupted pdf in any cases.
Doing this, I've found some related bugs raised in tests that I've corrected.

@stefan6419846 stefan6419846 added breaking-change A planned breaking change and removed on-hold PR requests that need clarification before they can be merged.A comment must give details labels Feb 27, 2024
@stefan6419846
Copy link
Collaborator

@pubpub-zz Are you okay with the current implementation or do you prefer/propose an alternative one?

@pubpub-zz pubpub-zz merged commit 178014e into py-pdf:main Feb 27, 2024
12 checks passed
@pubpub-zz
Copy link
Collaborator

My first merge 🎉

stefan6419846 added a commit that referenced this pull request Mar 3, 2024
## What's new

Generating name objects (`NameObject`) without a leading slash
is considered deprecated now. Previously, just a plain warning
would be logged, leading to possibly invalid PDF files. According
to our deprecation policy, this will log a *DeprecationWarning*
for now.

### New Features (ENH)
- Add get_pages_from_field  (#2494) by @pubpub-zz
- Add reattach_fields function (#2480) by @pubpub-zz
- Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz

### Bug Fixes (BUG)
- Missing error on name without leading / (#2387) by @Rak424
- encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon
- BI in text content identified as image tag (#2459) by @pubpub-zz

### Robustness (ROB)
- Missing basefont entry in type 3 font (#2469) by @pubpub-zz

### Documentation (DOC)
- Improve lossless compression example (#2488) by @j-t-1
- Amend robustness documentation (#2479) by @j-t-1

### Developer Experience (DEV)
- Fix changelog for UTF-8 characters (#2462) by @stefan6419846

### Maintenance (MAINT)
- Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz
- Remove user assignment for feature requests (#2483) by @stefan6419846
- Remove reference to old 2.0.0 branch (#2482) by @stefan6419846

### Testing (TST)
- Fix benchmark failures (#2481) by @stefan6419846
- Broken test due to expired test file URL (#2468) by @pubpub-zz
- Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon

[Full Changelog](4.0.2...4.1.0)
stefan6419846 added a commit that referenced this pull request Mar 3, 2024
## What's new

Generating name objects (`NameObject`) without a leading slash
is considered deprecated now. Previously, just a plain warning
would be logged, leading to possibly invalid PDF files. According
to our deprecation policy, this will log a *DeprecationWarning*
for now.

### New Features (ENH)
- Add get_pages_from_field  (#2494) by @pubpub-zz
- Add reattach_fields function (#2480) by @pubpub-zz
- Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz

### Bug Fixes (BUG)
- Missing error on name without leading / (#2387) by @Rak424
- encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon
- BI in text content identified as image tag (#2459) by @pubpub-zz

### Robustness (ROB)
- Missing basefont entry in type 3 font (#2469) by @pubpub-zz

### Documentation (DOC)
- Improve lossless compression example (#2488) by @j-t-1
- Amend robustness documentation (#2479) by @j-t-1

### Developer Experience (DEV)
- Fix changelog for UTF-8 characters (#2462) by @stefan6419846

### Maintenance (MAINT)
- Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz
- Remove user assignment for feature requests (#2483) by @stefan6419846
- Remove reference to old 2.0.0 branch (#2482) by @stefan6419846

### Testing (TST)
- Fix benchmark failures (#2481) by @stefan6419846
- Broken test due to expired test file URL (#2468) by @pubpub-zz
- Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon

[Full Changelog](4.0.2...4.1.0)
@Rak424 Rak424 deleted the slash branch March 7, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change A planned breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants