Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity regarding tab (U+0009) processing in significant whitespace. #235

Closed
skynavga opened this issue Apr 1, 2017 · 5 comments
Closed
Assignees
Milestone

Comments

@skynavga
Copy link
Contributor

skynavga commented Apr 1, 2017

The processing of "significant" whitespace by an XML application requires [1] that all non-markup characters be passed to the application, and, further, that the xml:space attribute, if declared, may be used by an author to signal the application as to whether (1) default application whitespace processing applies or (2) that whitespace should be preserved (by the application, as defined by the application).

In TTML the attribute xml:space is "declared", and its semantics are mapped [2] to XSL-FO style properties [3], specifically: suppress-at-line-break, linefeed-treatment, white-space-collapse, and white-space-treatment. These properties are intended to reflect the semantics of the CSS2 white-space property [4], but at a finer level of functional granularity.

Now, in the course of TTML implementation activity, it has been asked what the behavior should be regarding an element to which xml:space="default" applies and which content is, for example:

<span>&#x0009;X</span>

namely, a single HORIZONTAL TAB (U+0009) character followed by a single 'X' character.

The particular question is whether the HORIZONTAL TAB (U+0009) character should:

  1. be mapped to the SPACE (U+0020) character;
  2. not be mapped to the SPACE (U+0020) character, but in all other ways be treated as if it were a SPACE (U+0020) character;
  3. be neither mapped to nor treated as a SPACE (U+0020) character, e.g., be retained throughout presentation processing and eventually be mapped to a glyph in the applicable font, e.g., by using the font's CMAP (or equivalent) to map HORIZONTAL TAB (U+0009) to a glyph the specifies a definite width (advance)?

If the answer to this question is that it should be mapped to the SPACE (U+0020) character, then a secondary question arises as to when, i.e., during which processing step, should this mapping take place?

To untangle this subject, we will need to look at the original specification of CSS2 which defines the (default) initial normal value for the white-space property [5] as:

This value directs user agents to collapse sequences of whitespace, and break lines as necessary to fill line boxes.

and which, further, defines whitespace [6] as:

The token S in the grammar above stands for whitespace. Only the characters "space" (Unicode code 32), "tab" (9), "line feed" (10), "carriage return" (13), and "form feed" (12) can occur in whitespace. Other space-like characters, such as "em-space" (8195) and "ideographic space" (12288), are never part of whitespace.

Now, while whitespace is well defined here, and corresponds precisely with the definition given in XML 1.1 [7], the phrase "collapse sequences of whitespace" is not well defined. In CSS2.1, this latter phrase is given more substance by defining a whitespace processing model [8], which does define an operational model that provides greater detail, including:

  1. If 'white-space' is set to 'normal', 'nowrap', or 'pre-line',
    (1) every tab (U+0009) is converted to a space (U+0020)
    (2) any space (U+0020) following another space (U+0020) — even a space before the inline, if that space also has 'white-space' set to 'normal', 'nowrap' or 'pre-line' — is removed.

So, what is the problem with respect to TTML? TTML bases its definition of xml:space="default" semantics on XSL-FO 1.1, published in 2006, which is based on the original CSS2 that does not include the above clarifications found in CSS2.1. Furthermore, TTML bases the definition of xml:space="default" semantics on the XSL-FO definitions of the newly minted XSL-FO (but not CSS2) properties:

  • suppress-at-line-break="auto"
  • linefeed-treatment="treat-as-space"
  • white-space-collapse="true"
  • white-space-treatment="ignore-if-surrounding-linefeed"

where these values also happen to be the (default) initial values for these properties when they are not otherwise specified.

In contrast, XSL-FO defines white-space="normal" as

  • linefeed-treatment="treat-as-space"
  • white-space-collapse="true"
  • white-space-treatment="ignore-if-surrounding-linefeed"
  • wrap-option="wrap"

a definition which also happens to be implicitly dependent upon the suppress-at-line-break property, since the interpretation of white-space-treatment="ignore-if-surrounding-linefeed" depends upon the value of the suppress-at-line-break property.

Combining the default initial values of these properties with the definition of white-space="normal", we surmise that the default whitespace processing behavior for XSL-FO is intended to align with the default whitespace processing behavior of CSS2. However, a detailed reading of the semantics of this behavior raises a number of possible problems:

  1. the suppress-at-line-break property defines auto to suppress only the SPACE (U+0020) but not HORIZONTAL TAB (U+0009), and, further, explicitly states that all other characters are to be treated as if the value retain applies;
  2. nowhere in XSL-FO is there explicit language that would cause HORIZONTAL TAB (U+0009) to be mapped to SPACE (U+0020), although there is a vague reference to the possibility of such mapping taking place during refinement processing found under the definition of the white-space-collapse property where it is stated that:

after refinement, where some white space characters may have been discarded or turned into space characters, all remaining runs of two or more consecutive spaces are replaced by a single space, then any remaining space immediately adjacent to a remaining linefeed is also discarded.

To return to the example fragment of TTML cited above, absent a mapping from HORIZONTAL TAB (U+0009) to SPACE (U+0020), the whitespace processing behavior that applies to this fragment would seem to retain the HORIZONTAL TAB (U+0009) in

<span>&#x0009;X</span>

since, according to white-space-collapse="true", we have

  • &#x0009; is classified as white space in XML, and
  • &#x0009; is not &#x000A, but
  • the immediately preceding flow object (before <fo:character character="&#x0009;"/>) is not a character flow object and the immediate following flow object is not a linefeed, i.e.,<fo:character character="&#x000A;"/>

so the &#x0009; is not collapsed, i.e., it does generate a glyph area.

But now, we have a problem since the (now elaborated) definition of normal whitespace processing behavior in CSS2.1 appears to call for every &#x0009; to be mapped to &#x0020; prior to performing white space collapsing behavior.

So, where does this leave us with respect to TTML? I believe we have two questions to resolve:

  1. Is &#x0009; mapped to &#x0020;? If so, then in what context and during which processing step?
  2. If there are contexts where &#x0009; is not mapped to &#x0020;, then what are the intended presentation semantics?

My answers to these questions are as follows:

  • When xml:space="default" applies, then &#x0009; is mapped to &#x0020; prior to performing any other white space processing. This mapping would ideally occur during or immediately after constructing the reduced xml infoset of a TTML abstract document instance.
  • When xml:space="preserve" applies, then &#x0009; is not mapped to &#x0020;, in which case the CSS2.1 semantics would apply, namely:

All tabs (U+0009) are rendered as a horizontal shift that lines up the start edge of the next glyph with the next tab stop. Tab stops occur at points that are multiples of 8 times the width of a space (U+0020) rendered in the block's font from the block's starting content edge.

Specification text that implements the above could easily be added to both TTML1 and TTML2, ideally under the definition of xml:space [9] (and its TTML2 counterpart).

I don't have a strong opinion about whether we should adopt the CSS2.1 presentation semantics for HORIZONTAL TAB in cases where xml:space="preserve" applies. Alternative semantics could be to ignore entirely (i.e., treat like ZERO WIDTH SPACE) or treat as SPACE.

[1] https://www.w3.org/TR/REC-xml/#sec-white-space
[2] https://www.w3.org/TR/ttml1/#content-attribute-space
[3] https://www.w3.org/TR/xsl/
[4] https://www.w3.org/TR/xsl/#d0e297
[5] https://www.w3.org/TR/1998/REC-CSS2-19980512/text.html#white-space-prop
[6] https://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#whitespace
[7] https://www.w3.org/TR/REC-xml/#NT-S
[8] https://www.w3.org/TR/2011/REC-CSS2-20110607/text.html#white-space-model
[9] https://www.w3.org/TR/ttml1/#content-attribute-space

@skynavga
Copy link
Contributor Author

I posted the following question to Steve Zilles and Tony Graham on 2017-03-31:

A question recently came up in the TTWG regarding XSL-FO whitespace handling semantics, upon which TTML is based. The specific question has to do with whether an HT (horizontal tab) character (U+0009) is mapped to SPACE (U+0020) during the XSL-FO refinement processing when xsl:space="default" and, when it is not so mapped, i.e., when xsl:space="preserve", then what are the intended formatting semantics.

Assuming that the following properties apply:

linefeed-treatment="treat-as-space"
suppreses-at-line-break="auto"
white-space-collapse="true"
white-space-treatment="ignore-if-surrounding-linefeed"

and given an XSL-FO fragment

<fo:block><fo:inline>&#x0009;X</fo:inline></fo:block>

then

(1) is &#x0009; mapped to SPACE by the refinement process? if so, then where does XSL-FO specify this mapping occurs in refinement?

(2) if it is not mapped to SPACE, but the above properties apply, then is retained or ignored before a line feed (line break)?

@skynavga skynavga self-assigned this Apr 20, 2017
@palemieux
Copy link
Contributor

so the is not collapsed, i.e., it does generate a glyph area.

This is basically where TTML1 processing ends since Section 7.2.3 explicitly states "The semantics of the above four cited XSL-FO properties are defined by by [XSL 1.1], § 7.17.3, 7.16.7, 7.16.12, and 7.16.8, respectively," and XSL 1.1 references CSS2.

Regardless of what is done in TTML2, it sounds prudent to recommend that TTML1 author not use TAB when xml:space="default" given the ambiguity.

@skynavga
Copy link
Contributor Author

@skynavga to open PR containing a note to resolve

@nigelmegitt
Copy link
Contributor

nigelmegitt commented May 14, 2017

Noting that during the call on 2017-05-11 we tentatively agreed for TTML2 to map a tab character to a single space for presentation purposes, which does not match current CSS behaviour. The idea of mapping a tab to no presentation at all was rejected on security grounds since it could facilitate spoofing.

@skynavga skynavga modified the milestone: 3rd Ed Jul 6, 2017
@cconcolato
Copy link
Contributor

Can someone clarify if this issue would have a different resolution in TTML2 and TTML1? Is there a TTML2 issue tracking it? And if TTML2's resolution is different, does this create an incompatibility between 1 and 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants