Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical Ordering of Marks in Thai Script #18

Open
r12a opened this issue Aug 29, 2018 · 8 comments
Open

Canonical Ordering of Marks in Thai Script #18

r12a opened this issue Aug 29, 2018 · 8 comments
Labels
i:encoding Characters & encoding s:thai

Comments

@r12a
Copy link
Contributor

r12a commented Aug 29, 2018

I'm raising this issue to bring attention to a document by Peter Constable which is going through the Unicode committees. It's certainly something i have thought about before, and something that may well apply to other scripts too.

https://www.unicode.org/L2/L2018/18216-thai-order.pdf

It set me thinking that any time you have a sequence that looks the same but is not re-ordered during normalisation you have a problem for matching strings and possibly also security. I suspect that it would be useful to further constrain ordering in fonts so that anything that doesn't follow the rule becomes visually evident to the user. I also suspect that it might be appropriate to have similar rules for other SE Asian scripts.

I noticed the following interesting behaviour across 3 fonts, two of which are from the same stable. In each case the order of characters is:
U+0E01 THAI CHARACTER KO KAI
U+0E34 THAI CHARACTER SARA I​
U+0E38 THAI CHARACTER SARA U​

U+0E01 THAI CHARACTER KO KAI
U+0E35 THAI CHARACTER SARA II​
U+0E48 THAI CHARACTER MAI EK​

U+0E01 THAI CHARACTER KO KAI
U+0E48 THAI CHARACTER MAI EK​
U+0E35 THAI CHARACTER SARA II

Noto Serif Thai
screen shot 2018-08-29 at 16 58 17

A webfont created from an older version of Noto Sans Thai
screen shot 2018-08-29 at 16 59 38

Ayuthaya
screen shot 2018-08-29 at 17 00 23

Some interesting variations on the theme there, in some cases preventing you from seeing that there's a different underlying order of code points, in others preventing you from seeing that you've done something that's not 'normal'.

@r12a r12a added the s:thai label Aug 29, 2018
@r12a
Copy link
Contributor Author

r12a commented Aug 30, 2018

The upshot of this, for me, is that perhaps we should have more clearly defined, language/script specific rules about the order in which items should be typed/stored, and make it more obvious to the user when that order is not followed. I don't think we should absolutely prohibit people from, say, putting a tone mark before a nikahit, but we shouldn't just silently fix the problem in the font either - the font should show that something unexpected happened.

There's a similar discussion over in the indic layout issue list at the moment, since it's possible to create things that look like a single character by combining other characters that Unicode doesn't expect, or in some cases warns against. (Read more at w3c/iip#21 (comment) if you're interested.)

I think these kinds of rules have the potential (and it still needs further thought and discussion) to reduce problems for searching text, or matching text, by helping users type characters in an expected order.

@bact
Copy link
Contributor

bact commented Jan 30, 2020

Additional cases that may related to the ability to combine characters or how it should be rendered.

  • A) Consonants with a below "island"
    • A.1) ญ 0E0D + ุ 0E38 = ญุ
    • A.2) ฐ 0E10 + ุ 0E38 = ฐุ
  • The following below vowel is replacing the consonant's below island.

Screenshot 2020-01-30 19 48 35

--

  • B) Consonants with below "tail"
    • B.1) ฎ 0E0E + ุ 0E38 = ฎุ
    • B.2). ฏ 0E0F + ุ 0E38 = ฏุ
  • The following below vowel is placed further under the consonant's below tail.
  • Some fonts, like Cordia New, will not show the below vowel for this kind of sequence.

Screenshot 2020-01-30 19 48 42

--

  • C) Base vowels that look like a consonant and has below "tail"
    • C.1) ฤ 0E24 + ุ 0E38 = ฤุ
    • C.2) ฦ 0E26 + ุ 0E38 = ฦุ
  • The following below vowel is placed further under the base vowel's below tail.
  • Microsoft Word input method doesn't allow this sequence to be typed in (as it is considered an invalid sequence). But it can be copied from elsewhere and pasted into the program.
  • Some fonts, like Cordia New, will not show the below vowel for this kind of sequence.

Screenshot 2020-01-30 19 48 48

--

Case (C) is probably the most relevant to the discussion here. Like, if it is considered an invalid sequence (it should NOT be possible for a base vowel to be modified by a below vowel), what is the sensible way to display it? Should we disallow the two vowels to combined and display them separately like this instead?:
Screenshot 2020-01-30 20 00 29

@ohbendy
Copy link

ohbendy commented Jan 30, 2020

C is likely to be a font issue, isn't it? Those combinations can't occur in any language I'm aware of, so font makers probably don't include ru and lu in the GPOS mark feature. I certainly wouldn't go to the bother of putting anchors on those bases.

@bact
Copy link
Contributor

bact commented Jan 31, 2020

I think the sequence in C should be better prevented at the input method level.

But if we have a sequence like that and let the font decided solely how to display (which can have different visual results), I think we come back again to the point of cases "preventing you from seeing that you've done something that's not 'normal'", as @r12a puts it.

@r12a r12a added the i:encoding Characters & encoding label Feb 20, 2020
@PeterConstable
Copy link

@bact Why prevent people from entering the sequences in C? You might not have a use for it, but someone might.

We shouldn't limit entry of sequences because we don't think they're useful. As long as there's a sensible way the sequence can be displayed, consistent with the general behaviours of the script, then I think it should be supported in implementations. Thai script is more lenient in its structure and encoding than most Brahmi-derived scripts (e.g, the pre-consonantal vowels aren't encoded as combining marks, as for most Indic scripts, so there's no structural or encoding-architectural hindrance to handling sequences with multiple of these together). Some might argue that my example with sara ii over mai eek is contrary to the structural behaviours of Thai script, but as long as the marks aren't connected to the base (as, e.g., in Devanagari), I don't think we need to limit the possibilities --- except to the extent that Unicode canonical combining classes get in the way, just assume a general mark behaviour for Thai that above/below marks stack outward.

My thoughts, anyway.

@mcdurdin
Copy link

Input methods can be an appropriate place to limit sequences. However they do not provide an answer to security issues where invalid sequences can be manually constructed. It is also important to consider that script usage may vary from language to language -- for example in Thai, certain sequences are used by other languages that use the Thai script that would be considered illegal by Thai language orthography. Similarly, ancient texts often require sequences not found in modern texts. The best answer here is probably to have separate input methods for each of the languages and/or use cases rather than trying to cater to all cases in one general input method. That provides the most benefit to the average end user without constraining the users who need to do funky things.

In my opinion, confusables -- where two sequences produce identical or near identical output -- are currently a significant problem for SE Asian and South Asian scripts. See @MakaraSok's Khmer Character Specification and related paper for how big a problem this is for Khmer script.

It is interesting to see that these issues are present to some degree in Thai rendering, even though Thai renderers are typically far more straightforward than Khmer renderers.

Ideally, no two sequences should be confusable at render time. Is this a problem that could be addressed with mathematical proofs?

Side note: the worst place to implement restrictions is at the application level. This results in an explosion of inconsistent restrictions.

@asmusf
Copy link

asmusf commented Mar 12, 2020

This discussion tacitly assumes that the different sequences are not canonically equivalent to each other. The minute they are in some scripts, any restriction would have to accommodate all correctly normalized sequences, otherwise, normalized data could not be represented (in fact, an issue in some scripts today).

@PeterConstable
Copy link

At the start of this thread, a UTC doc I wrote was cited. I suggest that should be read as background to the thread. It goes into detail on canonical (non-)equivalence for Thai combining sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i:encoding Characters & encoding s:thai
Projects
None yet
Development

No branches or pull requests

6 participants