Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devanagari: 3.7.1 Grapheme Clusters - incorrect layer for this requirement #34

Open
alolita opened this issue Oct 2, 2018 · 20 comments
Assignees
Labels
drafting l:hi Hindi, Devanagari script

Comments

@alolita
Copy link
Member

alolita commented Oct 2, 2018

https://w3c.github.io/iip/gap-analysis/deva-gap.html#grapheme_clusters

This section on grapheme clusters for Indic languages like Devanagari should be at the Unicode layer not at the script grammar layer.

@alolita alolita added l:hi Hindi, Devanagari script drafting labels Oct 3, 2018
@alolita
Copy link
Member Author

alolita commented Dec 5, 2018

@vivekpani can you please add specific examples re: grapheme clusters at the Unicode layer and why these don't map into script grammar behavior.

@vivekpani
Copy link

The cluster appears to be central display and editing functions on Indic scripts. Hence, the cluster definition is not at the character encoding layer. For example, for a word like हड्डियाँ, some rendering systems will stack the two डs and some others will make it linear with a halant after the first one. In such systems, the ि will join only the second ड. Thus, it is visually two independent units. If a cluster definition for display will get defined at the character encoding level, then this differentiation will not be possible. I think the cluster definition is being confused with script grammar that defines a legal sequence of characters that can make a consonant or a vowel unit (Akshar).

@lianghai
Copy link

@alolita Unicode’s concepts like “extended grapheme cluster” are meant to provide some low-level, general segmentation, and are not going to be enough for ideal experience for end users. Therefore there is indeed a gap and W3C should look into it.

@tiroj
Copy link

tiroj commented Dec 17, 2018

For example, for a word like हड्डियाँ, some rendering systems will stack the two डs and some others will make it linear with a halant after the first one. In such systems, the ि will join only the second ड. Thus, it is visually two independent units. If a cluster definition for display will get defined at the character encoding level, then this differentiation will not be possible.

The individual font has a lot of control over how this sequence is displayed. Not only is it possible to display the sequence in the two ways described by @vivekpani, but also it is possible to force an older display convention in which the ि reordering is not blocked by the presence of the explicit halant in linear layout but instead moves to before the first ड.

@vivekpani
Copy link

vivekpani commented Dec 17, 2018 via email

@tiroj
Copy link

tiroj commented Dec 17, 2018

I perhaps should have expanded on the observation. I'm not sure that it is possible to disentangle editing and display in the case of complex scripts.

When you wrote re. ड्डि that it can be displayed, linearly as 'visually two independent units', did you also mean to imply that editing behaviour would differ? Visually, the editing behaviour will certainly seem different for the user, because cursor movement and character selection is going to look different.

@asmusf
Copy link

asmusf commented Dec 18, 2018

I certainly see the non-stacked layout with editing revealing two subunits. I presume that in the stacked model a single cursor movement would cover both. However, using my OSs native renderer, the whole cluster ड्डि is moved across, despite the side-by-side display - except for backspace which at least on that system picks off individual characters with Undo restoring the whole cluster.

All this points to the idea that unlike atomic letter scripts, where you segment is rather context dependent. Line breaking may have different requirements from other operations, so for that reason alone, Unicode's "one-size-fits-all" default definition isn't going to solve much.

One thing that that definition includes is an implicit emergency mode, such that the segmentation will yield at least some results in the face of ill-formed input (e.g. incomplete syllables etc). While that is a practical necessity, it complicates any use case the needs to tease out well-formed clusters.

@vivekpani
Copy link

I perhaps should have expanded on the observation. I'm not sure that it is possible to disentangle editing and display in the case of complex scripts.

When you wrote re. ड्डि that it can be displayed, linearly as 'visually two independent units', did you also mean to imply that editing behaviour would differ? Visually, the editing behaviour will certainly seem different for the user, because cursor movement and character selection is going to look different.

Yes @tiroj . The reason why a need for cluster definition exists is because characters can join to form units where it is not possible to distinguish independent visual units for each character. So, editing must account for cursor movements, selection, delete, backspace etc. behaviors. However, when there are visually separate units, a user will expect the behavior to be taking that into account. So, a stacked display will be a single unit whereas a non-stacked one will be multiple. Visually and for editing purposes, an independent unit has no reason to participate in a cluster with another independent unit. Display "is" the most natural and intuitive feedback for a user and all behavioral expectation will be from that. The cluster definition is different from script grammar. I agree with @asmusf that one size fits all default definition isn't going to help and OpenType offers no connect to define anything beyond just glyph properties.
This was solved in the earlier software and definitions that used the Indian national standard for Indic scripts. Unfortunately the consortia members, who were also the most widely used software across the world, did not decide to confirm or at least maintain the minimum basic experience for what was already created for native languages in the geographies where they decided to support "Unicode only". IMAO, this was quite an imposition.

@khaledhosny
Copy link

Editing and selection are usually based on grapheme cluster and are not based on the visual rendering (and when they are, it is usually a bug rather than a conscious choice by the implementer). This means that whether the sequence is rendered stacked or linear, the editing behavior will be the same; if it is a single grapheme cluster then it will be treated as single unit during editing.

Many implementations make cursor movement and delete key work on grapheme clusters, while backspace key work on individual Unicode code points (as already observed here). Some implementations further refine this by having a special mode that always works on individual code points (e.g. by pressing ctlr+arrow keys).

I’m not sure if the objection here is that the Unicode grapheme clusters model is insufficient, or that it is described in the wrong section of the document.

@vivekpani
Copy link

@khaledhosny I think you are right and it would be desirable to have uniform editing behavior regardless of how it is rendered. However, I will need to understand what a grapheme cluster is though.

@r12a
Copy link
Contributor

r12a commented Mar 13, 2019

I've been trying to understand the ramifications of different renderings for consonant clusters in indic (and other) scripts for some time now (see also #18), but i'm finding it difficult to pin down the facts wrt what users actually expect to happen.

I echo the point already made that different editorial operations may segment differently, and i don't know what the precise rules are for, say, selection. But what follows will apply to some kind of operation, and for me gets at the real unknown here: what are the possible alternatives?

This means that whether the sequence is rendered stacked or linear, the editing behavior will be the same; if it is a single grapheme cluster then it will be treated as single unit during editing.

The argument from indic experts seems to be that the Tamil word இந்த should segment as V+Cv+C (rather than V+CvC), but the word ஆக்ஷன் should segment as V+CvC+Cv. The distinguishing feature in this case is that the latter word uses one of the two conjuncts that appear in modern Tamil text.

Similarly, if a word such as हड्डियाँ in hindi is written with a font that has a conjunct form it should be segmented as C+CvCV+CVM, but if it doesn't have a conjunct form (stacking, half-forms, ligatures), then it should be segmented as C+Cv+CV+CVM. The difference comes from what the font is capable of, rather than from the sequence of code points.

It may be plausible to deal with the Tamil cases (for modern Tamil at least) by recognising the specific code points in use and tailoring the grapheme cluster definition (for Tamil), but that doesn't help with the Devanagari example at all.

This was solved in the earlier software and definitions that used the Indian national standard for Indic scripts.

I'd really like to know how they solved that? Presumably such a solution could be equally well applied to Unicode text.

The individual font has a lot of control over how this sequence is displayed. Not only is it possible to display the sequence in the two ways described by @vivekpani, but also it is possible to force an older display convention in which the ि reordering is not blocked by the presence of the explicit halant in linear layout but instead moves to before the first ड.

See my comments and questions related to this at #51

@kojiishi
Copy link

@vivekpani @lianghai Unicode is not about character encoding nor about display, it is actually about editing, so your arguments that this issue is not Unicode issue because this is about editing doesn't sound reasonable to me. I know some software do it differently, WebKit and Blink has/had a separate rule for editing, but it's been a while since Blink stopped doing it, and I'd like to solve it too.

Let me try to take the question in opposite way. What is the problem if Unicode changed the grapheme cluster definition, if doing so does not affect the rendering at all?

@asmusf
Copy link

asmusf commented Mar 14, 2019

Unicode is at heart a character encoding only. However, it comes with some auxiliary specifications. Some, like Bidi, are needed so authors can correctly encode text (not just individual letters). It's for good reason that conformance to them is required.

Others, like line breaking and grapheme clusters are less universal; they describe a default, but software can and will deviate from (or tailor) the standard definition. They are still strongly associated with the character encoding for two practical reasons.

One: they rely on properties which must be extended every time a character is added.
Two: there are special characters that are encoded primarily or solely for their effect on segmentation (e.g. on line, word or cluster breaking). Because of that, the algorithm serves as a description of their intended function - and therefore describes the identity of what is being encoded.

Just because these specifications are strongly associated with encoding in the Unicode Standard is not a good predictor for how best to fit them into the layers of a software architecture. Especially as tailoring may be language, region or font specific.

@kojiishi
Copy link

Thank you for the explanation. Maybe my English problem but I still don't understand the problem when we changed grapheme cluster. IIUC you explained it's possible to tailor, and I agree with that, but didn't explain why we can't change. Did I understand correctly?

@r12a
Copy link
Contributor

r12a commented Mar 14, 2019

Maybe my English problem but I still don't understand the problem when we changed grapheme cluster.

@kojiishi Don't the use cases in #34 (comment) answer your question?

@kojiishi
Copy link

@r12a

Maybe my English problem but I still don't understand the problem when we changed grapheme cluster.

@kojiishi Don't the use cases in #34 (comment) answer your question?

It has two cases, can you tell me which one you're talking about?

I have to admit I have little knowledge on these languages, so sorry if I'm missing the point, but the former (the Tamil word...) looks like a problem in the segmentation algorithm. I still don't see it be the reason not to change grapheme cluster. We can just add a rule to UAX.

The latter (Similarly, if a word such as...) looks like a different problem to me, when it says, IIUC, you want to segment differently even for the exactly same sequence of code points but when used fonts have different capability. If this is a requirement, I agree we need to handle this in higher layer than Unicode. I think such algorithm should be defined in OpenType spec then, and we will need to refer it.

Did I understand your examples correctly?

@kojiishi
Copy link

I think such algorithm should be defined in OpenType spec then

That reminded me that OpenType does have specs for their own cluster today. Can you check the "Script development specs" of OpenType (looks like the direct link is gone, but you can go to OpenType spec at Microsoft and check "Script development specs" on the left navigation bar) and see if it suffices the requirements?

@r12a
Copy link
Contributor

r12a commented Mar 15, 2019

I have to admit I have little knowledge on these languages, so sorry if I'm missing the point, but the former (the Tamil word...) looks like a problem in the segmentation algorithm. I still don't see it be the reason not to change grapheme cluster. We can just add a rule to UAX.

@kojiishi In the Tamil example, the character sequences use the same basic code point types for the first 4 characters: இந்த VCvC, and ஆக்ஷன் VCvC(Cv). However, the first word is segmented as V+Cv+C, whereas the second is segmented as V+CvC+... What drives that difference is the use of a க்ஷ conjunct (ligated sequence of glyphs) rather than க்‌ஷ in the second word. Conjuncts are not normally split during segmentation. However, there is nothing in the code point sequence to indicate that we're using a conjunct rather than 2 consonants with visible virama.

In modern Tamil, it just so happens that there are only two consonant clusters that form conjuncts, rather than showing the virama, so in theory one could identify the specific code point sequences (eg. k͓‌ʂ) to tailor the segmentation. (Actually 3, because of ambiguous writing practises for ks.)

However, a font used to represent older Tamil writing will contain an unspecified number of additional conjuncts which should be treated like k͓‌ʂ, even though the code point sequences are identical to non-conjunct forms when using a modern font. For that use case, the tailored segmentation rules will fail.

It comes down to the fact that the final rendered state of the text is what influences the segmentation, rather than the sequence of code points used.

This is very much more the case for north indic scripts such as Devanagari, Bengali, etc, which have many more conjunct forms and much more variation in how consonant cluster sequences are rendered. This is what i mention in the second case i described above.

Does that clarify a little?

@kojiishi
Copy link

Not much unfortunately, you said:

It may be plausible to deal with the Tamil cases (for modern Tamil at least) by recognising the specific code points in use and tailoring the grapheme cluster definition (for Tamil), but that doesn't help with the Devanagari example at all.

Now you seem to say it's not possible to distinguish from code points for Tamil case?

But that part is probably ok. Your 2nd example says it depends on font features (correct?) and whether the 1st example is the same as 2nd or not doesn't seem to matter much at this point. So let's focus on the 2nd example.

For Tamil, this is the OpenType Script development spec for Tamil. The first step is to analyze the input text and break it into syllable clusters. Then apply font features and computes ligatures and combine marks.

For Devanagari, this is the spec for Devanagari. The spec looks like identical to the Tamil spec though I haven't really compared.

If these rules provides the expected behavior, we can say "caret movement should use OpenType clusters rather than Unicode grapheme clusters". Some questions remain, such as should mid-word breaks (such as word-break: break-all) use Unicode grapheme cluster or OpenType cluster, but we can probably figure this out later.

The OpenType spec is basically what apps do when rendering, so using it seems to suffice when you said:

It comes down to the fact that the final rendered state of the text is what influences the segmentation, rather than the sequence of code points used.

but if the requirement is different from rendering, I'd like to know what are the differences.

@litherum
Copy link

litherum commented Mar 15, 2019

There are three distinct concepts:

  1. Performing editing commands based on Unicode segmentation rules applied to the source string.
  2. Performing editing commands based on the OpenType segmentation rules applied to the source string.
  3. Performing editing commands based on the glyph sequences and placement that end up getting rendered on the screen.

Option 3 is particularly problematic from an implementation point-of-view. First, it requires editing commands to be font-specific. Second, glyph placements have no semantic meaning; glyphs are an implementation detail of a font. Different fonts will render the same text visually similarly, but use a wildly different number of glyphs to do it. Trying to reverse-engineer the effects of a GPOS table in order to determine how editing commands work means introducing fragile heuristics into text engines.

@alolita alolita changed the title Devanagari: 2.8.1 Grapheme Clusters - incorrect layer for this requirement Devanagari: 3.7.1 Grapheme Clusters - incorrect layer for this requirement Apr 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
drafting l:hi Hindi, Devanagari script
Projects
None yet
Development

No branches or pull requests

9 participants