Add content and processing requirements for machine-readable toc #371

mattgarrish · 2018-11-29T13:40:16Z

This PR contains an adaptation of the EPUB toc rules, but simplified in the following ways:

either ol or ul can be used to construct the list of links, and can be used interchangeably throughout the toc
the content model is described in terms of what elements get processed, so other tagging can be included to decorate or annotate the toc (it's just ignored when processing)
the use of a span element is dropped - an a tag without an href attribute can be used for unlinked labels
aria-label is required for links whose text content is not comprehensible due to images and/or embedded content (removes the more complex rules about title/alt attributes, etc.)

Feedback is, as always, welcome from everyone on these changes.

…or a machine-readable table of contents - adds a reference to the new appendix from section 4.7.3. - removes the explicit WebIDL entry for TOC (and its acolytes), in line with the proposal in #338

iherman · 2018-11-29T13:48:26Z

Just a note: as said in the minutes of the 26th of Nov. the issue of whether a TOC in JSON should be added to WP or not will be considered separately. This PR is only on resolutions on that meeting.

llemeurfr · 2018-11-29T14:43:54Z

The relaxing relative to ol/ul is IMO the most efficient, we heard about many EPUB files presenting this issue.

The other evolutions seem ok also, but I have an issue with backward compatibility with EPUB: the WP spec constrains processing a without href instead of a top heading (h*) and spans. It means that EPUB nav docs containing such headings will not be entirely processed (the content of h* and span will not be processed).

Re. details, in the ol/ul spec I see "in this order" -> li (one elt type only), which is not useful.

mattgarrish · 2018-11-29T15:08:09Z

the WP spec constrains processing a without href instead of a top heading (h*) and spans

You can still provide a top-level heading. That's mentioned both in the content model and in the processing (see the first bullet about obtaining a title for the toc).

But spans are problematic to keep if we're loosening the content model and allowing any other tagging to be included. A span could appear for any number of reasons not related to the text label to use, while an a tag is always the link, or possible link.

It's also technically not incompatible with EPUB, as you can include an a without an href right now instead of a span. So content can be back-translated without any issues if you very strictly follow the content model, but you're right that if you want to create a wpub out of an EPUB 3 you'll have to make some changes, but that's true regardless. There's still a reasonable path forward in that you just have to transform the span tags to a.

mattgarrish · 2018-11-29T15:13:28Z

Re. details, in the ol/ul spec I see "in this order" -> li (one elt type only), which is not useful.

Ya, that's odd. It's in the EPUB spec that way, but I can strip "in this order".

llemeurfr · 2018-11-29T15:20:03Z

@mattgarrish ok for the first issue (top-level heading), I missed it when reading.

For the second point, I agree that a simple transform (span to a) makes rountrip btw EPUB and WP possible. It must be noted that from WP to EPUB 3, the extra / non-processable HTML information will have to be suppressed from the markup.

Therefore I'm ready to approve the PR.

rdeltour

I find the approach a bit too ambiguous: defining both the content model and the processing of it as normative makes the whole processing model quite unclear IMO.

I would prefer a more solid "User Agent Processing" section, ideally backed by well-defined terminology (the one used in DOM), then a "HTML Structure" section that is a purely informative guidance for authors.

rdeltour · 2018-11-29T16:07:53Z

index.html

+						elements</a>), user agents can easily differentiate the information they need from any
+					peripheral content (asides) or stylistic tagging that has also been added. The table of contents can
+					consist of both active links (with an <code>href</code> attribute) and inactive links (excluding the
+						<code>href</code> attribute), providing additional flexibility in how the table of contents is


I don't think the terminology "active link"/"inactive link" is well-chosen; strictly speaking, for HTML, an a element without an href attribute does not represent a hyperlink.

Yes, but the html spec wasn't terribly helpful. All it calls them is "placeholders" for links.

rdeltour · 2018-11-29T16:11:01Z

index.html

+				<h3>HTML Structure</h3>
+
+				<p>To optimize the machine processing of an HTML table of contents by user agents, authors SHOULD adhere
+					to the following structuring guidelines.</p>


If the UA processing section is well defined, this entire section can be made informative, no? I understand the usefulness of author guidance, but do we really need conformance statements?

It's definitely duplicative.

rdeltour · 2018-11-29T16:13:51Z

index.html

+							<dt><a href="https://www.w3.org/TR/html/sections.html#the-nav-element"
+								><code>nav</code></a></dt>
+							<dd>
+								<p>A <code>role</code> attribute with the value <code>doc-toc</code> is REQUIRED</p>


(this comment only holds if we really want to put this as conformance statements)

what's the meaning of REQUIRED when this whole content model is a big SHOULD?

I don't think following the suggested practice being recommended has a bearing on there being requirements when you do follow it.

rdeltour · 2018-11-29T16:16:43Z

index.html

+								<ul class="nomark">
+									<li><a href="https://www.w3.org/TR/html/dom.html#heading-content"><code>HTML Heading
+												content</code></a>
+										<code>[0 or 1]</code></li>


what does [0 or 1] mean when we say that other elements or content are allowed? wouldn't it always implicitly be [0 or more], i.e. [optional]?

This is the "processed model". It's only what you have to include. That's why I put the note at the start that says any other content is allowed. The full content model is the nav element's content model, but that's entirely unhelpful for authoring.

rdeltour · 2018-11-29T16:20:48Z

index.html

+					<li>
+						<p>The <code>a</code> element MUST provide a non-zero length text label after concatenation of
+							all child content and application of white space normalization rules. If a meaningful label
+							cannot be constructed from the text content of the <code>a</code> element &#8212; for


"meaningful" isn't well defined, which makes this statement ambiguous. Besides, doesn't that contradict the MUST above? ("it MUST be non-empty, but when it is there MUST be an ARAI label")

Doesn't contradict, no, as you can have a non-zero label that is unintelligible if it depends on visuals, embeded content, etc. But I think we can merge. Either a non-zero label or an aria-label, and also an aria-label when the simple concatenation of the text content would lead to an incomplete representation of the link.

rdeltour · 2018-11-29T16:42:37Z

index.html

+					tree of links can be extracted from it. The root list within the <code>nav</code> element represents
+					the root of the tree, and each list item is either a branch (if it contains a sub-list) or a leaf
+					(if it contains only a link). The following algorithm describes how to construct this tree. (It does
+					not define how the extracted table of contents is presented to users.)</p>


I suggest we reuse DOM's tree definition and the Infra standard which clearly define all these terms (tree, descendant, ordered set, etc).

rdeltour · 2018-11-29T16:47:54Z

index.html

+						of <a href="https://www.w3.org/TR/html/dom.html#heading-content">heading content</a>.</li>
+					<li>Locate the first descendant list element (<code>ol</code> or <code>ul</code>) of the
+							<code>nav</code> element. This list is used to produce the hierarchical tree of links. Any
+						subsequent lists MUST be ignored.</li>


This list is used to produce the hierarchical tree of links.

the algorithm language could be made a bit more explicit, using constructs like variables and assignments ( "let toc be an empty tree", "append current branch to the children of parent branch", etc, etc.)

rdeltour · 2018-11-29T16:50:17Z

index.html

+										value of the attribute as the label.</li>
+									<li>Otherwise, use the <a href="https://www.w3.org/TR/domcore/#dom-node-textcontent"
+											>text content</a> [[!DOM4]] of the element as the label.</li>
+								</ul>


same comment as above: what if we have a link with a described image?

rdeltour · 2018-11-29T16:51:50Z

index.html

+							<li>If the <code>a</code> item has an <code>href</code> attribute, and the destination of
+								IRI contained in the attribute is a resource in the <a>default reading order</a> or
+									<code>resource list</code>, store the IRI. Otherwise, the node is not linkable and
+								no IRI is associated with it.</li>


we also need to cover deep links (the URL points to some location within a resource in the reading order)

rdeltour · 2018-11-29T16:53:19Z

index.html

+				</ol>
+
+				<p>If the table of contents <code>nav</code> uses a different content model, processing of it to obtain
+					a hierarchy of links is OPTIONAL. In such cases, the user agent MAY simply extract all


I'm not a fan of defining this section based on the content model; it would be much clearer and less error-prone the other way around IMO. The above text basically says to UA "you should validate the content model", but the spec doesn't clearly define how.

mattgarrish · 2018-11-29T18:44:12Z

I'll see what I can do to clean up the processing @rdeltour. Then maybe we can get the group to decide on whether to keep the content model stuff or just reduce it to explanation and examples.

avneeshsingh · 2018-11-30T07:32:50Z

Curious to know the use case for providing anchor without meaningful href.

mattgarrish · 2018-11-30T12:36:43Z

Curious to know the use case for providing anchor without meaningful href.

The two probably most relevant for publications:

a grouping heading that is not explicitly in the content - could be linked to but would go to the same location as the first child link
a preview with a complete table of contents - links to inaccessible content would not have an href

There are other cases that aren't particularly relevant to a machine-readable table of contents, like removing the href from the link to the page the user is currently on so that they aren't presented with a redundant link.

danielweck · 2018-12-05T16:17:44Z

The "Childrens Literature" EPUB3 sample exhibits such Navigation Document, with some list items used as "containers" (each with a span "heading") for sub-lists of items:
https://github.com/IDPF/epub3-samples/blob/master/30/childrens-literature/EPUB/nav.xhtml#L24

TzviyaSiegman · 2018-12-05T17:12:30Z

@rdeltour made excellent suggestions. This looks really good.

mattgarrish · 2018-12-05T18:53:26Z

Sorry, I should have mentioned that we've been working on a complete overhaul to use a model like the one defined for the outline algorithm (not to compile a toc from headings or anything like that, but similar in terms of walking over the nodes of the toc nav to extract what is needed).

I have that plus a working model of the algorithm almost ready for review, so will hopefully be able to update this PR either tonight or tomorrow.

The revisions are ongoing in the toc-algo branch, for those interested, but will be merged into this branch: https://cdn.staticaly.com/gh/w3c/wpub/toc-algo/index.html?env=dev#app-toc-ua

mattgarrish · 2018-12-06T13:35:24Z

Further to my last comment, I've now pushed the new algorithm for extracting the table of contents:
https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/?env=dev#app-toc-ua

The working implementation of this model is at:
https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/experiments/toc_generator/
(The source is annotated to help match up the javascript to the specification.)

At this point, what we'd like to get is feedback on the technical approach, including if there are any obvious bugs in the algorithm. (Especially from anyone who has experience in this kind of extraction, if it needs saying. @danielweck?)

Content issues, like the structure of the table of contents, can be taken up later. This update is just a new implementation of the basic list structure parsing we've already agreed on.

iherman · 2018-12-06T14:03:54Z

(For some reasons, the 'preview' of the original PR has not been updates with the new commit. Look at the date of the draft, it should say 6th of December!)

mattgarrish · 2018-12-06T14:22:09Z

Hm, looks like I committed to the wrong branch. I wonder what toc-updates was for?

… adapts the dom-walk approach used in the html outline algorithm

mattgarrish · 2018-12-06T14:33:53Z

Yup, definitely a case of pebkac. That was a stale branch with my pre-squashed work the first time around. Need better housekeeping.

Anyway, the preview looks like it's updated now, but the direct links are:

Specification: https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/?env=dev#app-toc-ua
Working model: https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/experiments/toc_generator/?env=dev

(I'll adjust the urls in the previous comment and get to deleting branches.)

iherman · 2018-12-06T20:47:33Z

A minor extension: I wonder if keeping, if available, the value of the rel and type attributes of the 'a' element in the generated object wouldn't be valuable, eg, if the target is a media object.

mattgarrish · 2018-12-06T21:14:39Z

A minor extension: I wonder if keeping, if available, the value of the rel and type attributes of the 'a' element in the generated object wouldn't be valuable, eg, if the target is a media object.

Those are simple enough to add in, as it's just a couple of additional attributes to inspect on the a tags.

rdeltour · 2018-12-07T07:56:05Z

Looks good to me!

There are still a few issues that are worth discussing further IMO:

the elements to ignore in the toc (currently all sectioning content, sectioning root, and hidden elements)
how to tell UA what to do when there is no label for a ToC item (null name property or placeholder name?)
what to do with the "content model" section (keep it normative or make it purely informative?)

But all these can be treated in separate issues 🙂

iherman · 2018-12-07T08:01:52Z

But all these can be treated in separate issues

In my view, s/can/should/ :-)

(It would be important to have a consistent version in the main branch, allowing further discussions...)

rdeltour · 2018-12-07T08:05:54Z

In my view, s/can/should/ :-)

oh yes, I wasn't aware GitHub comments were also subject to RFC2119 conformance 😁

mattgarrish · 2018-12-07T18:49:04Z

In the interest of moving on to more specific issues, and as there hasn't been any new negative feedback, I'm going to merge this PR now.

- adds a new appendix detailing content and processing requirements f…

fba7e85

…or a machine-readable table of contents - adds a reference to the new appendix from section 4.7.3. - removes the explicit WebIDL entry for TOC (and its acolytes), in line with the proposal in #338

mattgarrish requested review from BigBlueHat, HadrienGardeur, deborahgu, iherman, rdeltour, TzviyaSiegman, dauwhe, GeorgeKerscher, GarthConboy, wareid, avneeshsingh and llemeurfr November 29, 2018 13:40

iherman approved these changes Nov 29, 2018

View reviewed changes

dauwhe approved these changes Nov 29, 2018

View reviewed changes

remove redundant ordering statement from toc ol/ul definition

29972cf

llemeurfr approved these changes Nov 29, 2018

View reviewed changes

rdeltour requested changes Nov 29, 2018

View reviewed changes

TzviyaSiegman approved these changes Dec 5, 2018

View reviewed changes

updated algorithm for extracting the table of contents from the nav -…

75fde2f

… adapts the dom-walk approach used in the html outline algorithm

cosmetic tweaks

59db155

add type and rel to toc link objects

86935ff

rdeltour approved these changes Dec 7, 2018

View reviewed changes

mattgarrish merged commit 3899acd into master Dec 7, 2018

mattgarrish deleted the machine-processable-toc branch December 7, 2018 18:52

This was referenced Dec 8, 2018

Simplifying the WebIDL #338

Closed

Do we need a more detailed definition for the HTML TOC format? #291

Closed

Add content and processing requirements for machine-readable toc #371

Add content and processing requirements for machine-readable toc #371

Conversation

mattgarrish commented Nov 29, 2018 • edited by pr-preview bot

iherman commented Nov 29, 2018

llemeurfr commented Nov 29, 2018

mattgarrish commented Nov 29, 2018

mattgarrish commented Nov 29, 2018

llemeurfr commented Nov 29, 2018

rdeltour left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattgarrish commented Nov 29, 2018

avneeshsingh commented Nov 30, 2018

mattgarrish commented Nov 30, 2018

danielweck commented Dec 5, 2018 • edited

TzviyaSiegman commented Dec 5, 2018

mattgarrish commented Dec 5, 2018

mattgarrish commented Dec 6, 2018 • edited

iherman commented Dec 6, 2018

mattgarrish commented Dec 6, 2018

mattgarrish commented Dec 6, 2018 • edited

iherman commented Dec 6, 2018

mattgarrish commented Dec 6, 2018

rdeltour commented Dec 7, 2018

iherman commented Dec 7, 2018

rdeltour commented Dec 7, 2018

mattgarrish commented Dec 7, 2018

mattgarrish commented Nov 29, 2018 •

edited by pr-preview bot

danielweck commented Dec 5, 2018 •

edited

mattgarrish commented Dec 6, 2018 •

edited

mattgarrish commented Dec 6, 2018 •

edited