Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/UA-2 test issues #1413

Closed
faceless2 opened this issue Feb 22, 2024 · 9 comments
Closed

PDF/UA-2 test issues #1413

faceless2 opened this issue Feb 22, 2024 · 9 comments
Assignees
Milestone

Comments

@faceless2
Copy link

faceless2 commented Feb 22, 2024

Some issues we've found which hopefully won't need much explanation, so I've lumped them into one.

  • Most of the tests have an open action which jumps to the first page, but page-based destinations are disallowed in PDF/UA according to section 8.8.8.
  • All the tests have an Info dictionary containing ModDate and CreationDate, which are deprecated so disallowed under 6.2

And some specific issues

PDF_UA-2/8.2 Logical structure/8.2.2 Real content/8.2.2-t01-fail-a.pdf
Object 12.0, the StructElem for the Document, doesn't have a "P" pointer back to StructTreeRoot (required in table 355)

PDF_UA-2/8.4 Text representation for content/8.4.3 Replacements and alternatives for text/8.4.3-t03-fail-a.pdf
This has an invalid Alt tag on an MCID inside a Pattern. But MCIDs inside a pattern can never be made visible in the StructureTree, because of the requirement that every MCID has one parent - patterns may be reused everywhere. The same statement would apply to MCIDs in Type3 fonts and mask XObjects. Section 8.4 applies to "Text representation for content" - this isn't content.

PDF_UA-2/8.5 Real content without textual semantics/8.5.1 General/8.5.1-t01-fail-a.pdf
This one has me puzzled. "line art content is not marked by a Figure" - the only vector operations I can see in there are the setting of the clip rectangle - definitely not line art, it doesn't mark the page - or the highlight in the highlight annotation. Is it the highlight? That's already tagged with /Annot, which semantically appropriate. We even have note 5 under 8.2.2: ''Unlike PDF/UA-1, this document clearly specifies that the use of images or vector-based drawings does not always require a Figure structure element''.

PDF_UA-2/8.9 Annotations/8.9.2 Semantics and content/8.9.2.1 General/8.9.2.1-t01-fail-a.pdf
Annotation object 3.0 has StructParent 3, but item 3 in the StructTreeRoot.ParentTree is not an OBJR referencing that annotation.

@MaximPlusov MaximPlusov self-assigned this Feb 22, 2024
@faceless2
Copy link
Author

I have to follow myself up. Although the ISO14289-predis I have here still says this in 6.2

A file shall not contain any feature that is deprecated in ISO 32000-2

I recall that we agreed this wasn't the intention and that Info is allowed. And to quote an email exchange with Duff on this just yesterday.

The /Info IS “allowed”… “Deprecated” does not mean “not allowed”… see clause 3.15 in 32k-2.

So for the ModDate and CreationDate point above, it's our validator that needs to changes, not your files.

@DuffJohnson
Copy link

DuffJohnson commented Feb 22, 2024 via email

@bdoubrov
Copy link
Contributor

Thanks, @faceless2 ! Most of these were fixed and already merged. Two remaining questions / comments:

PDF_UA-2/8.4 Text representation for content/8.4.3 Replacements and alternatives for text/8.4.3-t03-fail-a.pdf
This has an invalid Alt tag on an MCID inside a Pattern. But MCIDs inside a pattern can never be made visible in the StructureTree, because of the requirement that every MCID has one parent - patterns may be reused everywhere. The same statement would apply to MCIDs in Type3 fonts and mask XObjects. Section 8.4 applies to "Text representation for content" - this isn't content.

I don't see any Pattern objects in this test. Alt tag occurs in the /Form MCID inside annotation appearance. I'd say, PUA would not be allowed here.

PDF_UA-2/8.5 Real content without textual semantics/8.5.1 General/8.5.1-t01-fail-a.pdf
This one has me puzzled. "line art content is not marked by a Figure" - the only vector operations I can see in there are the setting of the clip rectangle - definitely not line art, it doesn't mark the page - or the highlight in the highlight annotation. Is it the highlight? That's already tagged with /Annot, which semantically appropriate. We even have note 5 under 8.2.2: ''Unlike PDF/UA-1, this document clearly specifies that the use of images or vector-based drawings does not always require a Figure structure

Indeed, we have taken 8.5.1 as a machine requirement: Any non-textual content shall be marked as a Figure or a Formula. But I see the discussion at PDF/UA TWG mailing list which tends to agree that this is an author's choice => human test. I'll wait till the next PDF/UA TWG call to reconfirm this.

@faceless2
Copy link
Author

(woops, I realise I should have filed this issue on veraPDF-corpus)

I don't see any Pattern objects in this test. Alt tag occurs in the /Form MCID inside annotation appearance. I'd say, PUA would not be allowed here.

Sorry, my error - yes it's an annotation. But actually the situation is almost the same.

That MCID is never referenced from the StructureTree - it's within an annotation, and annotations are effectlvely "black boxes" to the StructureTree - their content does not add nodes to the tree. This applies to any item with "StructParent" rather than "StructParents", like that annotation, because with StructParent it IS a content item - it doesn't CONTAIN content items.

Quoting part of tables 359

An object may be either a content item in its entirety or a container for marked-content sequences that are content
items, but not both

So as that MCID is not in a "container for marked content sequences" that is referenced from the Structure Tree, it doesn't count as content. This argument also applies to any XObject with a StructParent, rather than StructParents - it has to be this way, because such an XObject could be in the tree multiple times.

@bdoubrov
Copy link
Contributor

bdoubrov commented Feb 26, 2024

@faceless2 yes, indeed I see additionally in PDF/UA-2, 8.9.2.1

ISO 32000-2 enables substructure within annotation appearance streams via marked content references. Files in conformity with this document shall not use marked content references to substructure annotation appearance streams (see ISO 32000-2:2020, Table 357).
NOTE 4 The effect of the above clause is to require that annotations are included as whole objects in a single
structure element.

However, I believe if ActualText is specified on any marked content sequence included or not into the structure tree, it shall not use PUA as per 8.4.3. I'm less certain about Alt entry in similar case. So we'll modify the test to have PUA present in ActualText.

@faceless2
Copy link
Author

Well, I have to say I still disagree :-) Again, the spec text, this time from 8.4.3:

In all cases, where real content maps to Unicode PUA values, an ActualText or Alt entry shall be present.

These requirements only applies to "real content": they wouldn't apply to an Artifact, they also wouldn't apply to a pattern, the internals of a Type 3 font or an annotation. We can be certain they're not "real content" because if they were they would have to be reachable from the Structure Tree, and they're not.

If I still haven't convinced you then I think we'll need to bounce this to the PDF/UA TWG

@bdoubrov
Copy link
Contributor

I'll post a message to the PDF/UA TWG

@MaximPlusov MaximPlusov added this to the 1.26 milestone Mar 19, 2024
@bdoubrov
Copy link
Contributor

As per discussion at PDF/UA TWG, processing of Alt and ActualText properties of marked content sequences within Annotaion appearances, patterns and Type3 font glyphs doesn't make any sense and thus is disabled in veraPDF.

The corresponding test files are removed from the corpus to avoid confusion.

@MaximPlusov
Copy link
Contributor

Included into release 1.26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants