Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"character" is not defined #73

Closed
r12a opened this issue Sep 15, 2017 · 9 comments
Closed

"character" is not defined #73

r12a opened this issue Sep 15, 2017 · 9 comments
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@r12a
Copy link

r12a commented Sep 15, 2017

[from Addison Phillips]

https://w3c.github.io/input-events/#interface-InputEvent-Attributes

In section 5.1.2 there are multiple places where the term "character" is used without definition. It would be better to clearly define this to mean a Unicode code point.

@r12a r12a added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Sep 15, 2017
@johanneswilm
Copy link
Contributor

@r12a @aphillips sorry for late reply, somehow I had missed this.

I am ok with defining the term character. But I cannot find any appropriate definition of the term in the W3C repositories which doesn't use the word "character" as explanation for what that is. And clearly we cannot link to that, because such a definition would be circular. The definition on Wikipedia makes the term code point even broader: "Many code points represent single characters but they can also have other meanings, such as for formatting." [1]

[1] https://en.wikipedia.org/wiki/Code_point

@xfq
Copy link
Member

xfq commented Jun 25, 2018

FWIW, in Infra:

A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+".
[...]
Code points are sometimes referred to as characters and in certain contexts are prefixed with "0x" rather than "U+".

@johanneswilm
Copy link
Contributor

Based on the meeting at TPAC, we are waiting for a a suggestion on how to adjust the explanatory note text from @r12a .

@aphillips
Copy link

Updating this issue as part of I18N's regular clean-up cycle. There is now a definition in the spec:

https://w3c.github.io/input-events/#definitions

This defines "character" as:

A character is an extended grapheme cluster. [UAX29]

I'm not sure that this is what is intended, given that some input events (backwards deletion, certain cursoring operations) may be on a code point basis. This needs a read-through to determine. In addition, it looks like we owe some text based on a meeting at TPAC. I'll update our tracking issue to needs attention and add it to our action list.

@johanneswilm
Copy link
Contributor

@aphillips See also previous discussion here: #71 (comment) .

@aphillips
Copy link

@johanneswilm I didn't really re-read Input Events this morning when making comments--relying on memory can thus be tricky. Cursoring/selection changes are something I know we've talked about somewhere, but perhaps not in input events :-)

For backward deletion without an IME, yes: generally speaking backwards deletion works on a code point basis. Try a sequence like U+0061 U+0300 (à). Even simple editors like Notepad will delete the accent separately from the base letter when using backspace (even though you cannot select them separately). This is, of course, only true for denormalized input. U+00E0 (à) deletes as a single code point.

Languages such as the Indic ones that rely/require combining marks depend on this behavior for users to be able to correct typos. Of course, some of these also use IMEs.

@johanneswilm
Copy link
Contributor

@aphillips You are right, but after rereading that discussion, I believe we were aware of this difference at the time we included the definition. We only use the definition of "character" for the "insertTranspose" input type, in which case it really is switching two characters and it's not ever on code point basis.

But I might be wrong. At any rate, I think the last we officially heard was that we would receive a PR from @r12a so if we can get that now, that would be preferable.

@aphillips
Copy link

@johanneswilm I'm working on getting that PR (or at least evaluating if more work is needed) from I18N (probably @r12a or I) but I think it'll probably be at least a few days while we remind ourselves of where we left this. Transpositioning of characters should be done on a grapheme cluster basis for sure. Stay tuned.

@aphillips
Copy link

Reviewing this today (2022-03-07) it appears we didn't put in a PR. I have reviewed the current WD: @johanneswilm's description is correct. The term character is only used once in the document, in the insertTranspose function.

The I18N WG is admittedly pedantic about character encoding jargon. In this case, the meaning of "character" is intended to be a "user-perceived character", aka a grapheme or grapheme cluster. I would suggest:

  1. Remove the definition of character from the Terminology section, since it is only used on the one time in the entire document. This will avoid future revisions accidentally using the term in a different way.

  2. Replace the term 'character' in insertTranspose with the term grapheme, linking from the [I18N-GLOSSARY]. (We created the I18N glossary since the last comments on this thread and it's specref referenceable)

Would you prefer a PR for this?

@siusin siusin closed this as completed in 58ec6b3 Sep 14, 2023
siusin added a commit that referenced this issue Sep 14, 2023
link to grapheme definition in i18n-glossary, fixes #73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

4 participants