-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of term 'character' #36
Comments
@aphillips I see two ways to look at this: 1. Are there problems with the algorithm; 2. Are there problems with the way the spec is written. Your points are of course correct, however they need to be considered in the context of what the HRM is trying to do, which is to set an upper limit on complexity by static analysis of an IMSC document instance. With the specification defined as it is now, if there are grapheme clusters formed from more than one Unicode code point, that will serve to increase the computed value of the time needed to paint an ISD. That is probably not especially harmful. So my understanding is that no significant change to the algorithm is needed, but I think changes to the specification text, both in ordinary text and in notes, are needed to address your points and clarify what is intended:
@aphillips does this make sense to you, or do you think more changes are needed? I'm not sure which document you are referring to when you mention the i18n-glossary - are the definitions at https://www.w3.org/TR/charmod/ the appropriate ones to use? |
Thanks @nigelmegitt. This comment is about the specificity of the Unicode jargon and not so much about the algorithm. For implementers to understand the algorithm and implement is consistently, they'll need to know what "character" means in its different guises. We raised #37 to discuss the algorithm, so I'll save comments about it for that thread. The term character should be strictly defined, probably to mean Unicode code point. You will also need to use terms for grapheme clusters and glyphs. Your (2) is correct, although the actual relationship might be expressed as: code point : grapheme cluster (n:1) // one or more code points per grapheme cluster This implies that many (code points)-to-many (glyphs) relationships are possible and, indeed, cases such as Arabic script with diacritics do exist. Although I have some experience working on text renderers, I am not inan expert (certainly not for modern renderers), so I don't know how much these different features of different scripts affects the performance of the renderer. While it is probably directionally correct that a glyph representing 3 code points takes more time/effort than a glyph that represents a single code point, I don't know if it takes 3x (or 5x.. or 1.5x...) the effort. We can ask some folks from that community to comment if that would help. For the text, I would suggest a close read and edit to ensure that the word "character" is replaced as needed with the right term. I would probably edit this note:
To read more like:
|
Oh... I forgot to mention that i18n-glossary is a new WG Note that we're maintaining that has many of our terms collected in it. See https://www.w3.org/TR/i18n-glossary (and it's in SpecRef so you can reference it using e.g. Respec) |
Thanks for those comments and the proposal @aphillips, which looks good to me. I noticed that you did not propose an edit to the phrase "typographical glyph", which isn't in the glossary: do you consider its meaning to be commonly understood? I'm not sure if it means anything other than "glyph" as defined in the i18n glossary. |
I left "typographical glyph" alone in my suggestion because your comment seemed to use it familiarly. The definition of glyph in i18n-glossary is probably what you mean, so I'd just use glyph here. |
@aphillips please take a look at #42, to see if it resolves this issue. @palemieux One thing I have considered doing, but not yet done, is to rename glyph as defined in §9 Paint Text to something more clearly a feature of the HRM and more distant from the pre-existing term glyph, to which it bears a passing resemblance only. For example "h-glyph". The idea is to resolve the textual ambiguity where the term glyph is sometimes used in its "ordinary" meaning, e.g. that defined by i18n-glossary, and sometimes in its specific §9 meaning as a tuple, and it would make more obvious the fact that the tuple is an HRM construct. Any thoughts on this change? |
@aphillips @nigelmegitt AFAIK the term character used in IMSC-HRM comes from TTML2: https://www.w3.org/TR/ttml2/#reduced-infoset-character |
@aphillips Sounds good. I am not excited by h_glyph, since it still sounds pretty close to glyph and h could mean horizontal. Other potential terms that come to mind: |
|
@palemieux Of these, I'd favour |
The definition in #reduced-infoset-character, when traced back through the various links I think means "Unicode code point" (aka Unicode Scalar Value). Which means that it doesn't mean glyph! I think inventing a new term here would be a bad thing. It's already hard to follow what your spec is trying to say. The closer we can get to standard terminology, the easier it will be on implementers. Here's how I draw my conclusion about what your
This is really hard to follow. It could mean code points (in a USVString), code units in the DOM, or maybe even glyphs? However, this is followed by a "see also character information item" which links to this definition:
This is clearly a Unicode code point (see https://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character). That means you're talking about code point strings being used to compose glyphs. |
@aphillips unfortunately, it is more than a Unicode code point. It also has two additional properties: a whitespace indication (boolean) and the parent element information item. Neither of these additional properties is at all desirable here. This is why I have suggested using just "Unicode code point". |
There's more discussion of this on the pull request, I've actually reverted to "character" and defined it as a term meaning the character code property of the XML Infoset "character information item". |
Define "character" and "code point". Sort defined terms into alphabetical order. Improve note about mapping between code points and glyphs, and make reference to the i18n glossary for some terms. Format some technical phrases as code where appropriate. Improve note about how choice of font can increase rendering complexity. Closes #36.
Paint Text
(https://w3c.github.io/imsc-hrm/spec/imsc-hrm.html#paint-text)
Character is the wrong term here? There is also a note about the relationship of characters to glyphs later that only partially addresses the problem:
The problem here is that a glyph can be formed by more than one Unicode code point (i.e. "character"), not just that a user-perceived character can have more than one glyph associated with (such as with Arabic). See for example here.
The Unicode "term of art" for the "character" that relates to a glyph is "grapheme cluster" (or "grapheme" for short). A grapheme is a "user-perceived character" rather than a Unicode code point. The problem here are things like combining marks. These are especially to the fore in languages (such as Hindi) that use combining marks to encode parts of a composite character (the grapheme example linked above is the tip of the grapheme iceberg). I am not sure that introducing our term here will make your document clearer, but you might use the term "user-perceived character" and link to our definition in the i18n-glossary for help.
The text was updated successfully, but these errors were encountered: