Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arlington-pdf-model-checker fails if there are unknown dictionary keys #1349

Open
u-fischer opened this issue Jun 22, 2023 · 7 comments
Open
Labels
arlington Issues related to veraPDF Arlington model implementation

Comments

@u-fischer
Copy link

I'm not sure if this an issue with arlington-pdf-model-checker or if this should be reported to arlington model.

If I add to dictionaries unknown keys, or keys that are not yet known in the used PDF version of the key, I get a failure, e.g.:

Catalog shall not contain entries except ...

ActionGoTo shall not contain entries SD in PDF 1.7.

Now, I.3 Feature compatibility of the pdf 2.0 says that

Likewise, adding entries not described in the PDF specification to dictionary objects does not affect the PDF processor’s behaviour.

So why is such an unknown key a "shall not"?

@ousia
Copy link

ousia commented Jun 22, 2023

@u-fischer,

PDF-1.7 specification reads at the start of section I.3 (it used to be available at https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#G22.1086391):

When a new version of PDF is defined, many features are introduced simply by adding new entries to existing dictionaries. Earlier versions of conforming readers do not notice the existence of such entries and behave as if they were not there. Such new features are therefore both forward- and backward-compatible. Likewise, adding entries not described in the PDF specification to dictionary objects does not affect the conforming reader’s behaviour.

I’m not sure a new entry from a newer format version is also backward-compatible with an older format version.

I would say such an entry has to be just ignored (for compatibility, it should be recognized).

I’m aware that the quote describes earlier software versions and not earlier format versions.

But since almost everything in PDF is an entry in /Catalog, being able to add new entries from newer format versions in earlier ones, one might end up rendering format versions almost irrelevant.

With a case, adding multimedia (section 13.2 of the PDF-1.7 spec) in a PDF-1.4 document. Media objects were added in version 1.5. If the reader can deal with them (from PDF-1.5, since they haven’t been deprecated), should the reader handle media objects in PDF-1.4?

If the answer to my previous question is affirmative, in that case format versions might be a mess.

Among many other things, because earlier format versions might incorporate entries from newer versions, but they may not follow any deprecation of entries defined in any newer version.

At least for consistency between format versions and format features, each format version should ignore entries from newer format versions.

I might be missing the whole point and I’m more than happy to be corrected.

Many thanks for your help.

@u-fischer
Copy link
Author

I would say such an entry has to be just ignored (for compatibility, it should be recognized).

That would be quite fine with me. But this not happening here: they are not ignored but gives errors.

(My use case are structure destinations. They are a PDF 2.0 feature, but if I add them to a PDF 1.7. too, it makes it easier to reimport its annotations with the newpax package, and it makes it also easier for ngpdf to create links.)

With a case, adding multimedia (section 13.2 of the PDF-1.7 spec) in a PDF-1.4 document. Media objects were added in version 1.5. If the reader can deal with them (from PDF-1.5, since they haven’t been deprecated), should the reader handle media objects in PDF-1.4?

No, personally I think it shouldn't. If the PDF says it is 1.4, the reader should only handle keys that have a meaning in 1.4.

@ousia
Copy link

ousia commented Jun 22, 2023

I would say such an entry has to be just ignored (for compatibility, it should be recognized).

That would be quite fine with me. But this not happening here: they are not ignored but gives errors.

I have tested it before and I know it complains about entries undefined in a given format version.

As far as I know, Annex I.3 prevents such error messages.

With a case, adding multimedia (section 13.2 of the PDF-1.7 spec) in a PDF-1.4 document. Media objects were added in version 1.5. If the reader can deal with them (from PDF-1.5, since they haven’t been deprecated), should the reader handle media objects in PDF-1.4?

No, personally I think it shouldn't. If the PDF says it is 1.4, the reader should only handle keys that have a meaning in 1.4.

It might be poor wording in the quote from my previous comment and the text intended to prescribe exactly that behavior.

In any case, the original error message should be a warning about entries being ignored (because undefined) in the given format version.

@petervwyatt
Copy link

Please only refer to ISO 32000-2:2020 as well as https://pdf-issues.pdfa.org/ as these documents include 1000s of corrections and clarifications agreed upon by many experts in the vendor-neutral ISO forums. Previous core PDF specs do not have this benefit. ISO 32000-2:2020 incl. the soon-to-be-published Amd1 are available to everyone for no cost via https://www.pdfa-inc.org/product/iso-32000-2-pdf-2-0-bundle-sponsored-access/

I don't know how veraPDF engineers have codified the Arlington rules (they will need to reply), but PDF 2.0 is now very clear that non-standardized keys in most dictionaries generally need to be 2nd class names with registered developer prefixes to avoid conflict with future (1st class) changes. There are a few exceptions to this rule that are mostly noted - see also pdf-association/pdf-issues#229. 2nd and 3rd class names are easily detectable in software so reporting anything that is not standardized and not a 2nd or 3rd class name key is useful.

So if you are using 1st class names (or incorrectly constructed 2nd class names) in many places then that is officially wrong according to the latest spec.

I also cannot speak to how veraPDF engineers have codified the detection and reporting of officially deprecated features, but relying on deprecated features (by themselves, since some features have "modern" better alternatives) in a PDF is also generally not a good idea in the long run. In the same way that some PDF 1.0 and PDF 1.1 no longer work today in 99.99% of implementations and that very old low-bit encryption cannot be expected to withstand today's attackers. So if PDF, as a "document of record", is deprecating something then there is a very good reason for it.

@u-fischer
Copy link
Author

@petervwyatt well the vera-checker errors also if I add a third class name to the catalog:

 <errorMessage>Catalog contains entry(ies) XXBlub</errorMessage>

I would say that is clearly wrong.

In the case of my second error with ActionGoTo shall not contain entries SD in PDF 1.7.: I'm not trying to use a deprecated feature in a 2.0 document, but instead a new feature in a PDF 1.7 document. I also do not try to use the SD key for private data, I use it in the way it is intended in PDF 2.0. So the question is if the PDF 1.7 spec disallows this, and if the checker should error or only warn.

Side remark: I would love to simply force PDF 2.0 in all documents. But as long as support in readers and accessibility checker regarding the tagging is squetchy and pdf/UA-2 is not released we have to support other older PDF versions. Structure destinations improve their accessiblity too, even if they are not mentioned in the 1.7 spec, and if would be a pity to have to drop the SD key because of errors from an arlington checker (warnings are fine ...).

@bdoubrov
Copy link
Contributor

veraPDF implementation of Arlington model does not support (yet) 2nd and 3rd class names. It just follows Arlington rules defining the permitted 1st class names and reports all other keys as deviations from the model.

@ousia Whether the presence of undefined keys 1st class keys is a warning or an error is a question of a policy. I can imagine some workflows where having such keys is fine, and some others, where a more secure policy is preferred.

@u-fischer your case of SD entry in GoTo actions in PDF 1.7 documents even more special. As far as I know, most processors would simply ignore the PDF version in the header (or in the Version entry of the Catalog). So, if they support structure destinations for PDF 2.0, they most likely will support them in PDF 1.7.

@bdoubrov bdoubrov added the arlington Issues related to veraPDF Arlington model implementation label Jul 4, 2023
@bdoubrov
Copy link
Contributor

bdoubrov commented Jul 7, 2023

The support for 2nd and 3rd class names is added to the latest dev build 1.25.14: https://software.verapdf.org/develop/arlington/1.25/verapdf-arlington-1.25.14-installer.zip

All other unknown keys are still reported as deviations from the standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arlington Issues related to veraPDF Arlington model implementation
Projects
None yet
Development

No branches or pull requests

4 participants