Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify attribute term delimiter as post-normalized space #191

Closed
nigelmegitt opened this issue Sep 29, 2016 · 4 comments
Closed

Specify attribute term delimiter as post-normalized space #191

nigelmegitt opened this issue Sep 29, 2016 · 4 comments

Comments

@nigelmegitt
Copy link
Contributor

See also #185 and #170 for background: the current use of <lwsp> permits white space even though XML attribute normalization would remove leading and trailing white space and replace intermediate strings of white space with a single #x20 character. My proposal for this was to replace <lwsp> with <nsp> where:

<nsp>: #x20 after applying the normalization rules in [1]

[1] https://www.w3.org/TR/REC-xml/#AVNormalize

Right now, traversing all the links from https://w3c.github.io/ttml2/spec/ttml2.html#reduced-infoset-attribute through the term definition and the reference into https://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.attribute , we already specify attribute values in terms of normalized values in the reduced infoset, so the use of <lwsp> is actually rather difficult to achieve - anything other than a single #x20 character would have to be escaped. However it is possible to escape those characters. I do not know why that would be useful.

Some (non-mutually-exclusive) proposals to allow for simpler implementations:

  • Add an informative note that the processing of XML normalized attribute values may limit the type of character that could appear in linear white space.
  • Add feature designators to indicate that processors handle/do not handle escaped whitespace characters that pass through the normalization process, and that documents contain/do not contain such escaped whitespace characters.
  • Add an additional requirement to de-escape escaped whitespace characters prior to the XML attribute value normalization process so that the resulting information set never has leading or trailing whitespace and always has exactly one #0x20 character between terms.
@skynavga
Copy link
Collaborator

So, the facts are as follows:

  1. TTML is neutral to the concrete representation of a document instance, and merely recommends XML (in the absence of other requirements); consequently, we can't say for certain that XML space normalization has occurred on attribute values prior to creating their counterpart in the reduced infoset;
  2. even if XML is used, one can escape whitespace to avoid XML normalization;

In conclusion, we need to retain the current definition of and not refer to XML normalized space. Therefore, no action is required on this issue, so closing.

@nigelmegitt
Copy link
Contributor Author

@skynavga this is a bit surprising. Firstly, we do require attribute value normalisation when constructing the XML infoset, independently of the concrete representation of the document instance, and secondly you seem not to have addressed the third proposal at the end of #191 (comment):

  • Add an additional requirement to de-escape escaped whitespace characters prior to the XML attribute value normalization process so that the resulting information set never has leading or trailing whitespace and always has exactly one #0x20 character between terms.

This would ensure that implementations always get a consistent single #x20 between terms regardless of how the document is represented, in other words we would have a processing model with less implementation complexity in handling a variety of white space scenarios. Is that not a good idea?

@nigelmegitt nigelmegitt reopened this May 19, 2017
@skynavga
Copy link
Collaborator

skynavga commented May 19, 2017

@nigelmegitt your proposal contradicts the algorithm specified in https://www.w3.org/TR/REC-xml/#AVNormalize

Furthermore, implementations do not assume that normalization applies to character references that allow inserting non-normalized whitespace in attribute values; for example, TTV tests for the presence of whitespace padding around an attribute value and reports an error if it appears; testing this verification process requires the ability to insert non-normalized whitespace in this context, which is done using character references; with your proposal, the expansion of character references would have a second pass of normalization, and would prevent testing the padding detection;

I would suggest we limit changes to adding a note under B.3 [normalized value] that reminds reader that XML normalization does not normalize character references, and, consequently, unnormalized XML whitespace characters &#9; (HT), &#10; (LF), &#13; (CR), and &#20; (SPACE) may appear in a [normalized value] item;

@nigelmegitt
Copy link
Contributor Author

OK, I do not understand which part of a pre-processing algorithm can be contradictory to a step that comes after it, but on further reflection, most of what I am proposing here is about implementation optimisation, an area we don't need to define normatively.

Adding a note under B.3 as you suggest seems like the best way to go. I'll prepare a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants