Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(model) vague definition of character position for text position selector #206

Closed
mark-buer opened this issue Apr 28, 2016 · 19 comments
Closed

Comments

@mark-buer
Copy link

The text position selector spec doesn't define the exact meaning of the term "character position".

There are many possible definitions. "character position" might be measured in units of UTF code points, UTF-8 code units, UTF-16 code units etc.

Interoperability issues will result if different implementations assume incompatible meanings.

@azaroth42
Copy link
Collaborator

Related to this discussion:
https://lists.w3.org/Archives/Public/public-annotation/2015May/0003.html

We clearly need to be more explicit in the description.

@iherman
Copy link
Member

iherman commented May 5, 2016

I wonder whether the section on normalization of the character model is not relevant here (although that text is primarily aimed at martching). Ie, the content is supposed to be transformed into NFC, and the text position would then be understood to be the result of that transformation.

Cc: @r12a @aphillips

@r12a
Copy link

r12a commented May 5, 2016

I expect that the i18n WG will discuss this and provide a more formal answer. In the meantime, maybe this can help:
https://www.w3.org/International/techniques/developing-specs#char_string and
https://www.w3.org/International/techniques/developing-specs#char_indexing
(follow the 'more' links for additional information, where needed, for rationales and explanations).

The above links make the fundamental point that text pointers should use character boundaries, not bytes. Having said that, because of backwards compatibility requirements, Unicode often allows two canonically equivalent forms such as U+00E1 LATIN SMALL LETTER A WITH ACUTE vs. U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT. So there are cases where if you are matching text containing á you'd want to normalise the representation (usually to a precomposed form) to make the match work.

If you are simply pointing to a position in the text, however, i'm not sure that you need to normalise. On the other hand, you may want to take into account the fact that U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT is not something that ought to be split by a selection. For this, you need to consider the text as a series of grapheme clusters.

@aphillips
Copy link

@iherman Transforms such as NFC might actually interfere with the intention of web annotation, since the normalization of the text has the potential to change the number of code points/code units in the text, making it harder to identify the location intended. In addition, normalization might make some selections (in a purposefully de-normalized text) impossible.

It is usually best to define offset in terms of Unicode code points. I see @r12a just provided a bunch of reference while typing this, so I'll leave it at that.

@iherman
Copy link
Member

iherman commented May 6, 2016

Discussed on telco 2016.05.06: follow on @aphillips and @r12a advice and use Code Points.

The example in the comment will be reused in the document.

See http://www.w3.org/2016/05/06-annotation-irc#T15-43-39

@tkanai
Copy link
Contributor

tkanai commented May 16, 2016

At this moment, HTML APIs and Javascript do NOT support Code Point base String indexing at all. Then, what I get from the recommendations is that web annotation client systems which work on browsers need to walk through text from the beginning for both indexing and text selection purposes. Is my understanding correct?

Here is what I have pointed out in FindText API Issue #4.

@iherman
Copy link
Member

iherman commented May 17, 2016

Discussed it again at the F2F 17.05.16, and kept to the previous resolution.

http://www.w3.org/2016/05/17-annotation-irc#T14-54-34

@mark-buer
Copy link
Author

I'd like to bring to this issues attention a recent query against the EPUBCFI specification, which may be used as fragment identifiers within this spec, where it was (tentatively?) decided that character positions be defined in terms of UTF-16 code units.
It might be confusing if this spec chooses unicode code points but the EPUBCFI spec chooses UTF-16 code units...

Is this specification in danger of being too difficult to implement atop the current state of deployed ECMAScript and browser DOM APIs (where UTF-16 code units are the lingua franca)?

Are changes to browser DOM APIs (for example allowing unicode code point indexing within DOM Ranges and others) imminent?

@iherman
Copy link
Member

iherman commented May 18, 2016

@mark-buer :

Just trying to reproduce the discussion on the first F2F meeting: there are JS libraries available to handle code points properly. Ie, although all this may be difficult to implement on top of current, built-in JS in browsers, it can be implemented nevertheless (note that this is exactly the role of the Candidate Recommendation phase: to check whether the implementation is implementable...). The feeling on the meeting was that we had better be forward looking in this respect.

(When using other environments than browsers it seems that the issue is easy to handle, because other languages seem to be more advanced in this respect...)

@iherman
Copy link
Member

iherman commented May 18, 2016

Just to forward additional information from the IDPF EPUB WG issue list, w3c/epub-specs#555 (comment) may be of interest here (this is the issue @mark-buer referred to concerning EPUBCFI).

(Not being an expert in the area just forwarding the information...)

Cc: @nickstenning @r12a @fsasaki @azaroth42

@azaroth42
Copy link
Collaborator

Interesting, but not a deal breaker for Text * Selector being defined in terms of code units. The equivalent plain text fragment URI spec is defined in terms of code points, so either way we would be at odds with one of them.

Applying Unicode terminology, this means that the length of a text/plain MIME entity is computed based on its "code points".

@andjc
Copy link

andjc commented May 18, 2016

As @r12a mentioned, it is probably best to address grapheme cluster boundaries instead of character boundaries. If you go down the UTF-16 path then implementations should use UTF-16 and not USC-2. Its worth noting the problems that javascript traditionally has with characters outside the BMP.

@duerst
Copy link

duerst commented May 18, 2016

I agree that code points is the best. It is a bit of work in JavaScript (without a library), but will make it much easier for everything outside a browser.
Note that the question of what to do with Unicode normalization (NFC,...) is orthogonal to the question of code points vs. code units. @andjc: I don't think @r12a suggested to count grapheme clusters; he just said it was a bad idea to put a text boundary in the middle of a grapheme cluster.

@aphillips
Copy link

(chair hat off)

Working in UTF-16 code units has certain advantages, particularly for JavaScript programmers.

Some downsides of defining things in UTF-16 code units should be kept in mind:

  • The actual wire format for content is usually UTF-8 and implementations in some programming languages use UTF-8 rather than UTF-16 internally. Designing offsets around a specific encoding scheme creates counting artifacts when working in the other encoding.
  • Files often contain escapes such as NCRs that add to the code unit count in the file differently from the code point count. Escape expansion must be taken into account when specifying offset.
  • Splitting a multi-code unit sequence in the middle (in UTF-8 or UTF-16) produces U+FFFDs in the output and is experienced as a bug. With the rapid and wide adoption of emoji, the frequency of supplementary characters/surrogate pair sequences in UTF-16 can no longer be considered a rare oddity or quirk.
  • The points about grapheme boundary selection are, as @duerst suggests, not about the low-level definition of the annotation format. However, there should be recommended language related to text boundary processing so that implementations consider the needs of customers whose languages use combining marks or other complex combining sequences that are possible in Unicode.
  • Past shortfalls in JS are slowly being fixed. The "problems" that JavaScript experienced mostly had to do with how regex interacted with text. For code point boundary detection, it is relatively simple (but still a couple lines of code to be sure) to ensure that the low and high surrogate stick together.

On the flip side, a number of other specifications do specify things in terms of UTF-16 and UTF-16 is JavaScript's native encoding internally. It may be that the additional implementation complexity of counting code points turns out not to be worth the overhead. If you do go with code units, be sure that it is clear that this does not extend to code units in various legacy (non-Unicode) character encodings that are still sometimes used for storing resources used on the Web.

@kojiishi
Copy link

I prefer code unit. IIRC Maciej pointed out in the last TPAC that Find Text should be compatible with DOM Range. I can't agree more.

I understand code point looks safe, but when it points the middle of a grapheme cluster, it looks to me that the problems are similar to when it points the middle of a code point. Assuming we all want grapheme to be handed properly, I see little benefits in code point.

The use of grapheme has benefits, but UAX#29 allows tailoring, which can make the pointer ambiguous.

Each spec then should require to adjust appropriately. Recently we fixed Range.getClientRects() to handle grapheme correctly. It's true that needing to fix all such is troublesome, but code point doesn't save us from doing it anyway.

@iherman iherman changed the title (model) vague definition of charactor position for text position selector (model) vague definition of character position for text position selector May 26, 2016
@iherman
Copy link
Member

iherman commented May 26, 2016

Decision with the I18N WG (2016-05-26) and, on their advise is to use Code Points.

The Anno WG accepted this. An additional text will be provided by the I18N WG on warning at the possible pitfalls, this will be added as a note in the document.

See: http://www.w3.org/2016/05/26-i18n-irc#T15-40-45 and http://www.w3.org/2016/05/26-i18n-irc#T15-50-25

@azaroth42 azaroth42 self-assigned this May 26, 2016
@azaroth42
Copy link
Collaborator

Have added this text for now, and will replace with better text provided by i18n when available.

<p>The selection of the text MUST be in terms of unicode code points (the "character number"), not in terms of code units (that number expressed using a selected data type). Selections SHOULD NOT start or end in the middle of a grapheme cluster.  For more information about the character model of text used on the web, see [[charmod]].</p>

@iherman
Copy link
Member

iherman commented May 30, 2016

This has been added, currently, to the Text Quote Selector. Shouldn't this be placed at the Text Position Selector?

@azaroth42
Copy link
Collaborator

I put it as part of the normalization text, which is then referenced from Text Position Selector.
I think it needs to be there, otherwise an implementation might quote based on what it can select via code units, rather than code points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants