(model) vague definition of character position for text position selector #206

mark-buer · 2016-04-28T23:20:01Z

The text position selector spec doesn't define the exact meaning of the term "character position".

There are many possible definitions. "character position" might be measured in units of UTF code points, UTF-8 code units, UTF-16 code units etc.

Interoperability issues will result if different implementations assume incompatible meanings.

azaroth42 · 2016-05-03T23:43:09Z

Related to this discussion:
https://lists.w3.org/Archives/Public/public-annotation/2015May/0003.html

We clearly need to be more explicit in the description.

iherman · 2016-05-05T09:19:06Z

I wonder whether the section on normalization of the character model is not relevant here (although that text is primarily aimed at martching). Ie, the content is supposed to be transformed into NFC, and the text position would then be understood to be the result of that transformation.

Cc: @r12a @aphillips

r12a · 2016-05-05T13:00:03Z

I expect that the i18n WG will discuss this and provide a more formal answer. In the meantime, maybe this can help:
https://www.w3.org/International/techniques/developing-specs#char_string and
https://www.w3.org/International/techniques/developing-specs#char_indexing
(follow the 'more' links for additional information, where needed, for rationales and explanations).

The above links make the fundamental point that text pointers should use character boundaries, not bytes. Having said that, because of backwards compatibility requirements, Unicode often allows two canonically equivalent forms such as U+00E1 LATIN SMALL LETTER A WITH ACUTE vs. U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT. So there are cases where if you are matching text containing á you'd want to normalise the representation (usually to a precomposed form) to make the match work.

If you are simply pointing to a position in the text, however, i'm not sure that you need to normalise. On the other hand, you may want to take into account the fact that U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT is not something that ought to be split by a selection. For this, you need to consider the text as a series of grapheme clusters.

aphillips · 2016-05-05T13:03:59Z

@iherman Transforms such as NFC might actually interfere with the intention of web annotation, since the normalization of the text has the potential to change the number of code points/code units in the text, making it harder to identify the location intended. In addition, normalization might make some selections (in a purposefully de-normalized text) impossible.

It is usually best to define offset in terms of Unicode code points. I see @r12a just provided a bunch of reference while typing this, so I'll leave it at that.

iherman · 2016-05-06T15:43:46Z

Discussed on telco 2016.05.06: follow on @aphillips and @r12a advice and use Code Points.

The example in the comment will be reused in the document.

See http://www.w3.org/2016/05/06-annotation-irc#T15-43-39

tkanai · 2016-05-16T20:46:09Z

At this moment, HTML APIs and Javascript do NOT support Code Point base String indexing at all. Then, what I get from the recommendations is that web annotation client systems which work on browsers need to walk through text from the beginning for both indexing and text selection purposes. Is my understanding correct?

Here is what I have pointed out in FindText API Issue #4.

iherman · 2016-05-17T14:54:43Z

Discussed it again at the F2F 17.05.16, and kept to the previous resolution.

http://www.w3.org/2016/05/17-annotation-irc#T14-54-34

mark-buer · 2016-05-18T00:30:30Z

I'd like to bring to this issues attention a recent query against the EPUBCFI specification, which may be used as fragment identifiers within this spec, where it was (tentatively?) decided that character positions be defined in terms of UTF-16 code units.
It might be confusing if this spec chooses unicode code points but the EPUBCFI spec chooses UTF-16 code units...

Is this specification in danger of being too difficult to implement atop the current state of deployed ECMAScript and browser DOM APIs (where UTF-16 code units are the lingua franca)?

Are changes to browser DOM APIs (for example allowing unicode code point indexing within DOM Ranges and others) imminent?

iherman · 2016-05-18T04:31:41Z

@mark-buer :

Just trying to reproduce the discussion on the first F2F meeting: there are JS libraries available to handle code points properly. Ie, although all this may be difficult to implement on top of current, built-in JS in browsers, it can be implemented nevertheless (note that this is exactly the role of the Candidate Recommendation phase: to check whether the implementation is implementable...). The feeling on the meeting was that we had better be forward looking in this respect.

(When using other environments than browsers it seems that the issue is easy to handle, because other languages seem to be more advanced in this respect...)

iherman · 2016-05-18T04:33:33Z

Just to forward additional information from the IDPF EPUB WG issue list, w3c/epub-specs#555 (comment) may be of interest here (this is the issue @mark-buer referred to concerning EPUBCFI).

(Not being an expert in the area just forwarding the information...)

Cc: @nickstenning @r12a @fsasaki @azaroth42

azaroth42 · 2016-05-18T06:22:41Z

Interesting, but not a deal breaker for Text * Selector being defined in terms of code units. The equivalent plain text fragment URI spec is defined in terms of code points, so either way we would be at odds with one of them.

Applying Unicode terminology, this means that the length of a text/plain MIME entity is computed based on its "code points".

andjc · 2016-05-18T06:37:57Z

As @r12a mentioned, it is probably best to address grapheme cluster boundaries instead of character boundaries. If you go down the UTF-16 path then implementations should use UTF-16 and not USC-2. Its worth noting the problems that javascript traditionally has with characters outside the BMP.

duerst · 2016-05-18T10:38:03Z

I agree that code points is the best. It is a bit of work in JavaScript (without a library), but will make it much easier for everything outside a browser.
Note that the question of what to do with Unicode normalization (NFC,...) is orthogonal to the question of code points vs. code units. @andjc: I don't think @r12a suggested to count grapheme clusters; he just said it was a bad idea to put a text boundary in the middle of a grapheme cluster.

aphillips · 2016-05-18T17:55:39Z

(chair hat off)

Working in UTF-16 code units has certain advantages, particularly for JavaScript programmers.

Some downsides of defining things in UTF-16 code units should be kept in mind:

The actual wire format for content is usually UTF-8 and implementations in some programming languages use UTF-8 rather than UTF-16 internally. Designing offsets around a specific encoding scheme creates counting artifacts when working in the other encoding.
Files often contain escapes such as NCRs that add to the code unit count in the file differently from the code point count. Escape expansion must be taken into account when specifying offset.
Splitting a multi-code unit sequence in the middle (in UTF-8 or UTF-16) produces U+FFFDs in the output and is experienced as a bug. With the rapid and wide adoption of emoji, the frequency of supplementary characters/surrogate pair sequences in UTF-16 can no longer be considered a rare oddity or quirk.
The points about grapheme boundary selection are, as @duerst suggests, not about the low-level definition of the annotation format. However, there should be recommended language related to text boundary processing so that implementations consider the needs of customers whose languages use combining marks or other complex combining sequences that are possible in Unicode.
Past shortfalls in JS are slowly being fixed. The "problems" that JavaScript experienced mostly had to do with how regex interacted with text. For code point boundary detection, it is relatively simple (but still a couple lines of code to be sure) to ensure that the low and high surrogate stick together.

On the flip side, a number of other specifications do specify things in terms of UTF-16 and UTF-16 is JavaScript's native encoding internally. It may be that the additional implementation complexity of counting code points turns out not to be worth the overhead. If you do go with code units, be sure that it is clear that this does not extend to code units in various legacy (non-Unicode) character encodings that are still sometimes used for storing resources used on the Web.

kojiishi · 2016-05-19T09:48:17Z

I prefer code unit. IIRC Maciej pointed out in the last TPAC that Find Text should be compatible with DOM Range. I can't agree more.

I understand code point looks safe, but when it points the middle of a grapheme cluster, it looks to me that the problems are similar to when it points the middle of a code point. Assuming we all want grapheme to be handed properly, I see little benefits in code point.

The use of grapheme has benefits, but UAX#29 allows tailoring, which can make the pointer ambiguous.

Each spec then should require to adjust appropriately. Recently we fixed Range.getClientRects() to handle grapheme correctly. It's true that needing to fix all such is troublesome, but code point doesn't save us from doing it anyway.

iherman · 2016-05-26T15:50:36Z

Decision with the I18N WG (2016-05-26) and, on their advise is to use Code Points.

The Anno WG accepted this. An additional text will be provided by the I18N WG on warning at the possible pitfalls, this will be added as a note in the document.

See: http://www.w3.org/2016/05/26-i18n-irc#T15-40-45 and http://www.w3.org/2016/05/26-i18n-irc#T15-50-25

azaroth42 · 2016-05-26T22:19:26Z

Have added this text for now, and will replace with better text provided by i18n when available.

<p>The selection of the text MUST be in terms of unicode code points (the "character number"), not in terms of code units (that number expressed using a selected data type). Selections SHOULD NOT start or end in the middle of a grapheme cluster.  For more information about the character model of text used on the web, see [[charmod]].</p>

iherman · 2016-05-30T14:13:56Z

This has been added, currently, to the Text Quote Selector. Shouldn't this be placed at the Text Position Selector?

azaroth42 · 2016-05-31T16:50:20Z

I put it as part of the normalization text, which is then referenced from Text Position Selector.
I think it needs to be there, otherwise an implementation might quote based on what it can select via code units, rather than code points.

iherman added model telco selector note labels Apr 29, 2016

azaroth42 added the bug label May 3, 2016

iherman added the i18n-review label May 5, 2016

r12a mentioned this issue May 5, 2016

vague definition of charactor position for text position selector w3c/i18n-activity#132

Closed

iherman added editor_action and removed telco labels May 6, 2016

r12a added i18n-review and removed i18n-review labels May 9, 2016

r12a mentioned this issue May 9, 2016

vague definition of charactor position for text position selector w3c/i18n-activity#134

Closed

azaroth42 removed the editor_action label May 17, 2016

iherman mentioned this issue May 18, 2016

(model) make definition of a character more precise #209

Closed

nickstenning mentioned this issue May 18, 2016

Reference to text encoding in spec perhaps not appropriate #227

Closed

iherman changed the title ~~(model) vague definition of charactor position for text position selector~~ (model) vague definition of character position for text position selector May 26, 2016

iherman added the editor_action label May 26, 2016

azaroth42 self-assigned this May 26, 2016

azaroth42 added the pending label May 26, 2016

azaroth42 closed this as completed Jun 3, 2016

azaroth42 removed editor_action pending labels Jun 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(model) vague definition of character position for text position selector #206

(model) vague definition of character position for text position selector #206

mark-buer commented Apr 28, 2016

azaroth42 commented May 3, 2016

iherman commented May 5, 2016

r12a commented May 5, 2016

aphillips commented May 5, 2016

iherman commented May 6, 2016

tkanai commented May 16, 2016

iherman commented May 17, 2016

mark-buer commented May 18, 2016

iherman commented May 18, 2016

iherman commented May 18, 2016

azaroth42 commented May 18, 2016

andjc commented May 18, 2016

duerst commented May 18, 2016

aphillips commented May 18, 2016

kojiishi commented May 19, 2016

iherman commented May 26, 2016

azaroth42 commented May 26, 2016

iherman commented May 30, 2016 •

edited

Loading

azaroth42 commented May 31, 2016

(model) vague definition of character position for text position selector #206

(model) vague definition of character position for text position selector #206

Comments

mark-buer commented Apr 28, 2016

azaroth42 commented May 3, 2016

iherman commented May 5, 2016

r12a commented May 5, 2016

aphillips commented May 5, 2016

iherman commented May 6, 2016

tkanai commented May 16, 2016

iherman commented May 17, 2016

mark-buer commented May 18, 2016

iherman commented May 18, 2016

iherman commented May 18, 2016

azaroth42 commented May 18, 2016

andjc commented May 18, 2016

duerst commented May 18, 2016

aphillips commented May 18, 2016

kojiishi commented May 19, 2016

iherman commented May 26, 2016

azaroth42 commented May 26, 2016

iherman commented May 30, 2016 • edited Loading

azaroth42 commented May 31, 2016

iherman commented May 30, 2016 •

edited

Loading