Not all precomposed characters are reachable by NFC #190

r12a · 2019-01-17T17:39:30Z

https://w3c.github.io/charmod-norm/#composition_decomposition

Text in a Unicode character encoding form (such as UTF-8 or UTF-16) is said to be in NFC if it doesn't contain any combining sequence that could be replaced with a precomposed character ...

Not entirely true. Eg. indic characters such as U+09DC BENGALI LETTER RRA decompose into consonant + nukta, but are not recomposed by NFC.

Is it worth tweaking the text to accomodate that?

aphillips · 2019-01-20T01:01:49Z

Fair enough. I just removed the sentence, as it doesn't appear to add anything and did some adjustments to the adjacent text.

Addressed w3c#190

asmusf · 2019-01-21T05:39:31Z

It's worth mentioning that as too many people get the wrong idea about NFC. The more places that you can find out that the truth is more nuanced, the better.

aphillips · 2019-01-21T06:03:04Z

@asmusf too true. But we say it several times in a row in the remaining text. Have a look at https://w3c.github.io/charmod-norm/#composition_decomposition and note that the removed text was fairly redundant to begin with.

asmusf · 2019-01-22T01:27:00Z

I find statements like this:

"Users are cautioned that the resulting character sequence can still contain combining marks: not all character sequences have a precomposed equivalent and some scripts depend on combining marks for encoding. There are even cases where a given base character and combining mark is not replaced with a precomposed character because the combination is "blocked" by another combining mark in the sequence."

Which does NOT cover the case of "composition exceptions", but only the effect of canonical reordering.

If you would like to extend the discussion in the note to cover the example of BENGALI, that would be good. (It would help to introduce the concept of these "composition exceptions" by name). Most of them are found in certain complex scripts and we find that they are completely off the radar even of users of those scripts.

When data MUST be in NFC, as in IDNA 2008, we discover that people file bugs . . . so this is something that desperately needs to be covered if explaining normalization at this level of detail.

Addresses #190

aphillips · 2019-01-23T16:47:13Z

Try the text in the commit I just merged.

@asmusf I agree that the text was just specific enough to cause confusion for close observers of the effect of normalization. I tend to disagree that we are making that level of detailed description.

So... I converted the note to a warning so that it is blindingly visible. I rephrased the warning part to include the concept of exceptions, although I didn't termify "composition exception". And I added a final sentence about the purpose of NFC:

What NFC gives the user is a string that can be compared to other NFC strings for equality with the minimum number of combining marks for that purpose.

@r12a Any thoughts?

r12a · 2019-01-23T17:24:27Z

What NFC gives the user is a string that can be compared to other NFC strings for equality with the minimum number of combining marks for that purpose.

Hmm. That's not actually true either, since precomposed characters exist for nukta combinations in scripts like Devanagari and Bengali but NFC doesn't use them. So it's not the minimum number of combining marks in that case, because you could have 0 by using the precomposed character.

Actually, i think that the rationale behind NFC has more to do with a nominal compatibility with legacy standards. So i guess i should try to suggest something. How about this (changes signalled using bold):

These two types of Unicode-defined equivalence are then grouped by another pair of variations: "decomposition" and "composition". In "decomposition", separable logical parts of a visual character are broken out into a sequence of base characters and combining marks and the resulting code points are put into a fixed, canonical order. In "composition", the decomposition is performed and then combining marks are recombined according to certain rules with their base characters.

Roughly speaking, NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character.

It is rather important to notice what this does not mean. The resulting character sequence can still contain combining marks, since not all character sequences have a precomposed equivalent. Indeed, as we've seen, many scripts offer no alternative to the use of combining marks, such as the Devanagari vowels in this example. In other cases, a given base character and combining mark is not replaced with a precomposed character because the combination is blocked by normalization rules. For example, some Indic scripts do not compose certain sequences of base plus diacritic, even though a matching precomposed character exists, due to composition exclusion rules. Composition may also be blocked by another combining mark between the two characters that would otherwise combine.

I'd just omit the paragraph you quoted in the previous comment. "What NFC gives the user..."

@r12a

Addressed #190 with @r12a's text.

aphillips · 2019-01-23T18:05:28Z

@r12a I took your text verbatim. Thanks!

asmusf · 2019-01-23T18:14:10Z

Distinct improvement.

aphillips self-assigned this Jan 20, 2019

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 20, 2019

Adjusted fallback order of large example text to use NoToFu font first.

1f82fa8

Addressed w3c#190

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 23, 2019

Addresses w3c#190

03a8ea2

aphillips added a commit that referenced this issue Jan 23, 2019

Merge pull request #194 from aphillips/gh-pages

ca2e235

Addresses #190

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 23, 2019

Addressed w3c#190 with @r12a's text.

b65d8ac

aphillips added a commit that referenced this issue Jan 23, 2019

Merge pull request #195 from aphillips/gh-pages

4b2f4a3

Addressed #190 with @r12a's text.

r12a added the close? label Jan 23, 2019

aphillips closed this as completed Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all precomposed characters are reachable by NFC #190

Not all precomposed characters are reachable by NFC #190

r12a commented Jan 17, 2019

aphillips commented Jan 20, 2019

asmusf commented Jan 21, 2019

aphillips commented Jan 21, 2019

asmusf commented Jan 22, 2019

aphillips commented Jan 23, 2019

r12a commented Jan 23, 2019 •

edited

Loading

aphillips commented Jan 23, 2019

asmusf commented Jan 23, 2019

Not all precomposed characters are reachable by NFC #190

Not all precomposed characters are reachable by NFC #190

Comments

r12a commented Jan 17, 2019

aphillips commented Jan 20, 2019

asmusf commented Jan 21, 2019

aphillips commented Jan 21, 2019

asmusf commented Jan 22, 2019

aphillips commented Jan 23, 2019

r12a commented Jan 23, 2019 • edited Loading

aphillips commented Jan 23, 2019

asmusf commented Jan 23, 2019

r12a commented Jan 23, 2019 •

edited

Loading