Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nul byte difference between xmp and info dictionary #1017

Closed
beat2 opened this issue Feb 7, 2019 · 4 comments
Closed

nul byte difference between xmp and info dictionary #1017

beat2 opened this issue Feb 7, 2019 · 4 comments
Assignees
Labels
bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release
Milestone

Comments

@beat2
Copy link

beat2 commented Feb 7, 2019

When validating a PDF/A-1b file, we encountered this issue:

If a document information dictionary does appear at a document, then all of its entries that have analogous properties in predefined XMP schemas, shall also be embedded in the file in XMP form with equivalent values

The producer contains extra zeroes at the end:

/Producer (\376\377\000A\000d\000o\000b\000e\000 \000P\000S\000L\000 \0001\000.\000
2\000e\000 \000f\000o\000r\000 \000C\000a\000n\000o\000n\000\000)

In the XML there is no blank / nul bytes not allowed. For a discussion of the same issue please see the PDFBOX issue here: PDFBOX-2503.
Example files are there too.

I propose to add a trim within XMPChecker#checkCOSStringProperty.

Slightly related, the BFO library added a "workaround" for this too some years ago: javadoc

@bdoubrov
Copy link
Contributor

bdoubrov commented Feb 8, 2019

Thanks for bringing this to our attention! I've double checked that indeed other PDF/A validators behave exactly in this way. So, we'll indeed fix our logic as well.

@a20god
Copy link

a20god commented Feb 11, 2019

Please do not try to make veraPDF bug-compatible with broken software. Apparently, some PDF/A validators use inadequate means (wcscmp()?) for comparing strings, stopping comparison at the first NUL character. This has nothing to do with some characters not being representable in XML.

Examples:
t3.pdf
t5.pdf

@bdoubrov bdoubrov added bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release labels Feb 13, 2019
@bdoubrov bdoubrov added this to the v1.14-m4 milestone Feb 13, 2019
@beat2
Copy link
Author

beat2 commented Feb 14, 2019

@a20god please see the linked pdfbox issue - this is not about NUL between, but only at the end

quote:

msahyoun Maruan Sahyoun added a comment - 18/Nov/14 10:48 - edited

If the DocumentInformation meta data contains trailing NUL characters everything is fine. For all others the trailing characters as well as control characters within the are taken into account by Adobe Preflight as well as others and validated against the XMP entry.

From these tests IMHO we should only trim trailing NUL

@a20god
Copy link

a20god commented Feb 14, 2019

@a20god please see the linked pdfbox issue - this is not about NUL between, but only at the end

A NUL at the end is just a special case of the general Adobe Acrobat breakage: Adobe Acrobat's Preflight thinks that my t3.pdf conforms to PDF/A-1b. However, the string in the Document Information Dictionary has two additional characters at the end, U+0000 and U+0041.

I propose to fix the PDF producer rather than breaking all existing PDF/A validators.

If you think that U+0000 is to be ignored at the end of strings in the Document Information Dictionary, please point to chapter and verse in any relevant standard.

If a string cannot be represented in XML, then that string cannot be used as the value of one of the entries in the Document Information Dictionary that must match the document metadata of conforming PDF/A-1b documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

No branches or pull requests

4 participants