Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of term 'character' #36

Closed
aphillips opened this issue Feb 5, 2022 · 13 comments · Fixed by #42
Closed

Use of term 'character' #36

aphillips opened this issue Feb 5, 2022 · 13 comments · Fixed by #42
Labels
for review i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@aphillips
Copy link

Paint Text
(https://w3c.github.io/imsc-hrm/spec/imsc-hrm.html#paint-text)

a glyph is a tuple consisting of (i) one character and (ii) the computed values of the following style properties

Character is the wrong term here? There is also a note about the relationship of characters to glyphs later that only partially addresses the problem:

While one-to-one mapping between characters and typographical glyphs is generally the rule in some scripts, e.g. latin script, it is the exception in others. For instance, in arabic script, a character can yield multiple glyphs depending on its position in a word. The Hypothetical Render Model always assumes a one-to-one mapping, but reduces the performance of the glyph buffer for scripts where one-to-one mapping is not the general rule (see GCpy below).

The problem here is that a glyph can be formed by more than one Unicode code point (i.e. "character"), not just that a user-perceived character can have more than one glyph associated with (such as with Arabic). See for example here.

The Unicode "term of art" for the "character" that relates to a glyph is "grapheme cluster" (or "grapheme" for short). A grapheme is a "user-perceived character" rather than a Unicode code point. The problem here are things like combining marks. These are especially to the fore in languages (such as Hindi) that use combining marks to encode parts of a composite character (the grapheme example linked above is the tip of the grapheme iceberg). I am not sure that introducing our term here will make your document clearer, but you might use the term "user-perceived character" and link to our definition in the i18n-glossary for help.

@aphillips aphillips added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Feb 5, 2022
@nigelmegitt
Copy link
Contributor

@aphillips I see two ways to look at this: 1. Are there problems with the algorithm; 2. Are there problems with the way the spec is written. Your points are of course correct, however they need to be considered in the context of what the HRM is trying to do, which is to set an upper limit on complexity by static analysis of an IMSC document instance.

With the specification defined as it is now, if there are grapheme clusters formed from more than one Unicode code point, that will serve to increase the computed value of the time needed to paint an ISD. That is probably not especially harmful.

So my understanding is that no significant change to the algorithm is needed, but I think changes to the specification text, both in ordinary text and in notes, are needed to address your points and clarify what is intended:

  1. The term "character" is not defined, and we need to be accurate about what it means. This needs both a definition and editorial consistency. For example, in the abstract the phrase "sub-pixel character positioning" probably refers to glyphs (i.e. collections of pixels representing a grapheme cluster, also referred to as "typographical glyph" in the second Note in §9), whereas in the definition of the tuple defined to be a "glyph" in §9, character refers, I think, to Unicode code point.
  2. We should be clear that all three ratios of Unicode code point to "user-perceived character" exist:
    a. Many-to-one
    b. One-to-one
    c. One-to-many

@aphillips does this make sense to you, or do you think more changes are needed? I'm not sure which document you are referring to when you mention the i18n-glossary - are the definitions at https://www.w3.org/TR/charmod/ the appropriate ones to use?

@aphillips
Copy link
Author

Thanks @nigelmegitt.

This comment is about the specificity of the Unicode jargon and not so much about the algorithm. For implementers to understand the algorithm and implement is consistently, they'll need to know what "character" means in its different guises. We raised #37 to discuss the algorithm, so I'll save comments about it for that thread.

The term character should be strictly defined, probably to mean Unicode code point. You will also need to use terms for grapheme clusters and glyphs.

Your (2) is correct, although the actual relationship might be expressed as:

code point : grapheme cluster (n:1) // one or more code points per grapheme cluster
grapheme cluster : glyph (1:n) // a grapheme cluster might have more than one glyph associated with it

This implies that many (code points)-to-many (glyphs) relationships are possible and, indeed, cases such as Arabic script with diacritics do exist.

Although I have some experience working on text renderers, I am not inan expert (certainly not for modern renderers), so I don't know how much these different features of different scripts affects the performance of the renderer. While it is probably directionally correct that a glyph representing 3 code points takes more time/effort than a glyph that represents a single code point, I don't know if it takes 3x (or 5x.. or 1.5x...) the effort. We can ask some folks from that community to comment if that would help.

For the text, I would suggest a close read and edit to ensure that the word "character" is replaced as needed with the right term. I would probably edit this note:

While one-to-one mapping between characters and typographical glyphs is generally the rule in some scripts, e.g. latin script, it is the exception in others. For instance, in arabic script, a character can yield multiple glyphs depending on its position in a word. The Hypothetical Render Model always assumes a one-to-one mapping, but reduces the performance of the glyph buffer for scripts where one-to-one mapping is not the general rule (see GCpy below).

To read more like:

While one-to-one mapping between characters (code points) and typographical glyphs is common in some scripts (such as the Latin script), the actual relationship is often more complex. Some scripts, such as Arabic, use different glyphs for a given character depending on its position in a word. Some scripts require combining marks or use a sequence of code points to form a glyph. Cases exist where multiple code points (graphemes) have multiple glyphs. The Hypothetical Render Model assumes that code points have a one-to-one mapping to a glyph, but accounts for the above complexity by reducing the performance of the glyph buffer for scripts where a one-to-one mapping is not the general rule (see GCpy below).

@aphillips
Copy link
Author

Oh... I forgot to mention that i18n-glossary is a new WG Note that we're maintaining that has many of our terms collected in it. See https://www.w3.org/TR/i18n-glossary (and it's in SpecRef so you can reference it using e.g. Respec)

@nigelmegitt
Copy link
Contributor

Thanks for those comments and the proposal @aphillips, which looks good to me.

I noticed that you did not propose an edit to the phrase "typographical glyph", which isn't in the glossary: do you consider its meaning to be commonly understood? I'm not sure if it means anything other than "glyph" as defined in the i18n glossary.

@aphillips
Copy link
Author

I left "typographical glyph" alone in my suggestion because your comment seemed to use it familiarly. The definition of glyph in i18n-glossary is probably what you mean, so I'd just use glyph here.

@nigelmegitt
Copy link
Contributor

@aphillips please take a look at #42, to see if it resolves this issue.

@palemieux One thing I have considered doing, but not yet done, is to rename glyph as defined in §9 Paint Text to something more clearly a feature of the HRM and more distant from the pre-existing term glyph, to which it bears a passing resemblance only. For example "h-glyph". The idea is to resolve the textual ambiguity where the term glyph is sometimes used in its "ordinary" meaning, e.g. that defined by i18n-glossary, and sometimes in its specific §9 meaning as a tuple, and it would make more obvious the fact that the tuple is an HRM construct. Any thoughts on this change?

@palemieux
Copy link
Contributor

palemieux commented Feb 19, 2022

@aphillips @nigelmegitt AFAIK the term character used in IMSC-HRM comes from TTML2: https://www.w3.org/TR/ttml2/#reduced-infoset-character

@palemieux
Copy link
Contributor

is to rename glyph as defined in §9 Paint Text to something more clearly a feature of the HRM

@aphillips Sounds good.

I am not excited by h_glyph, since it still sounds pretty close to glyph and h could mean horizontal.

Other potential terms that come to mind: rendered-character, character-unit , rendered-character-unit or hypothetical-glyph, hypothetical-rendered-glyph...

@nigelmegitt
Copy link
Contributor

@aphillips @nigelmegitt AFAIK the term character used in IMSC-HRM comes from TTML2: https://www.w3.org/TR/ttml2/#reduced-infoset-character

#42 (comment)

@nigelmegitt
Copy link
Contributor

Other potential terms that come to mind: rendered-character, character-unit , rendered-character-unit or hypothetical-glyph, hypothetical-rendered-glyph...

@palemieux Of these, I'd favour hypothetical-glyph

@aphillips
Copy link
Author

The definition in #reduced-infoset-character, when traced back through the various links I think means "Unicode code point" (aka Unicode Scalar Value). Which means that it doesn't mean glyph!

I think inventing a new term here would be a bad thing. It's already hard to follow what your spec is trying to say. The closer we can get to standard terminology, the easier it will be on implementers.


Here's how I draw my conclusion about what your character means. The text at the link says:

Contiguous character information items are not required to be represented distinctly, but may be aggregated (chunked) into a sequence of character codes (i.e., a character string).

This is really hard to follow. It could mean code points (in a USVString), code units in the DOM, or maybe even glyphs? However, this is followed by a "see also character information item" which links to this definition:

Each data character appearing in an XML document corresponds with a character information item as defined by [XML InfoSet], §2.6.

This is clearly a Unicode code point (see https://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character). That means you're talking about code point strings being used to compose glyphs.

@nigelmegitt
Copy link
Contributor

nigelmegitt commented Feb 21, 2022

This is clearly a Unicode code point

@aphillips unfortunately, it is more than a Unicode code point. It also has two additional properties: a whitespace indication (boolean) and the parent element information item. Neither of these additional properties is at all desirable here.

This is why I have suggested using just "Unicode code point".

@nigelmegitt
Copy link
Contributor

There's more discussion of this on the pull request, I've actually reverted to "character" and defined it as a term meaning the character code property of the XML Infoset "character information item".

nigelmegitt added a commit that referenced this issue Mar 16, 2022
Define "character" and "code point".
Sort defined terms into alphabetical order.
Improve note about mapping between code points and glyphs, and make reference to the i18n glossary for some terms.
Format some technical phrases as code where appropriate.
Improve note about how choice of font can increase rendering complexity.

Closes #36.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for review i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants