-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all precomposed characters are reachable by NFC #190
Comments
Fair enough. I just removed the sentence, as it doesn't appear to add anything and did some adjustments to the adjacent text. |
It's worth mentioning that as too many people get the wrong idea about NFC. The more places that you can find out that the truth is more nuanced, the better. |
@asmusf too true. But we say it several times in a row in the remaining text. Have a look at https://w3c.github.io/charmod-norm/#composition_decomposition and note that the removed text was fairly redundant to begin with. |
I find statements like this: "Users are cautioned that the resulting character sequence can still contain combining marks: not all character sequences have a precomposed equivalent and some scripts depend on combining marks for encoding. There are even cases where a given base character and combining mark is not replaced with a precomposed character because the combination is "blocked" by another combining mark in the sequence." Which does NOT cover the case of "composition exceptions", but only the effect of canonical reordering. If you would like to extend the discussion in the note to cover the example of BENGALI, that would be good. (It would help to introduce the concept of these "composition exceptions" by name). Most of them are found in certain complex scripts and we find that they are completely off the radar even of users of those scripts. When data MUST be in NFC, as in IDNA 2008, we discover that people file bugs . . . so this is something that desperately needs to be covered if explaining normalization at this level of detail. |
Try the text in the commit I just merged. @asmusf I agree that the text was just specific enough to cause confusion for close observers of the effect of normalization. I tend to disagree that we are making that level of detailed description. So... I converted the note to a warning so that it is blindingly visible. I rephrased the warning part to include the concept of exceptions, although I didn't termify "composition exception". And I added a final sentence about the purpose of NFC:
@r12a Any thoughts? |
Hmm. That's not actually true either, since precomposed characters exist for nukta combinations in scripts like Devanagari and Bengali but NFC doesn't use them. So it's not the minimum number of combining marks in that case, because you could have 0 by using the precomposed character. Actually, i think that the rationale behind NFC has more to do with a nominal compatibility with legacy standards. So i guess i should try to suggest something. How about this (changes signalled using bold): These two types of Unicode-defined equivalence are then grouped by another pair of variations: "decomposition" and "composition". In "decomposition", separable logical parts of a visual character are broken out into a sequence of base characters and combining marks and the resulting code points are put into a fixed, canonical order. In "composition", the decomposition is performed and then combining marks are recombined according to certain rules with their base characters. Roughly speaking, NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character. It is rather important to notice what this does not mean. The resulting character sequence can still contain combining marks, since not all character sequences have a precomposed equivalent. Indeed, as we've seen, many scripts offer no alternative to the use of combining marks, such as the Devanagari vowels in this example. In other cases, a given base character and combining mark is not replaced with a precomposed character because the combination is blocked by normalization rules. For example, some Indic scripts do not compose certain sequences of base plus diacritic, even though a matching precomposed character exists, due to composition exclusion rules. Composition may also be blocked by another combining mark between the two characters that would otherwise combine. I'd just omit the paragraph you quoted in the previous comment. "What NFC gives the user..." |
@r12a I took your text verbatim. Thanks! |
Distinct improvement. |
https://w3c.github.io/charmod-norm/#composition_decomposition
Not entirely true. Eg. indic characters such as U+09DC BENGALI LETTER RRA decompose into consonant + nukta, but are not recomposed by NFC.
Is it worth tweaking the text to accomodate that?
The text was updated successfully, but these errors were encountered: