Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all precomposed characters are reachable by NFC #190

Closed
r12a opened this issue Jan 17, 2019 · 8 comments
Closed

Not all precomposed characters are reachable by NFC #190

r12a opened this issue Jan 17, 2019 · 8 comments
Assignees
Labels

Comments

@r12a
Copy link
Contributor

r12a commented Jan 17, 2019

https://w3c.github.io/charmod-norm/#composition_decomposition

Text in a Unicode character encoding form (such as UTF-8 or UTF-16) is said to be in NFC if it doesn't contain any combining sequence that could be replaced with a precomposed character ...

Not entirely true. Eg. indic characters such as U+09DC BENGALI LETTER RRA decompose into consonant + nukta, but are not recomposed by NFC.

Is it worth tweaking the text to accomodate that?

@aphillips
Copy link
Contributor

Fair enough. I just removed the sentence, as it doesn't appear to add anything and did some adjustments to the adjacent text.

@aphillips aphillips self-assigned this Jan 20, 2019
aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 20, 2019
@asmusf
Copy link

asmusf commented Jan 21, 2019

It's worth mentioning that as too many people get the wrong idea about NFC. The more places that you can find out that the truth is more nuanced, the better.

@aphillips
Copy link
Contributor

@asmusf too true. But we say it several times in a row in the remaining text. Have a look at https://w3c.github.io/charmod-norm/#composition_decomposition and note that the removed text was fairly redundant to begin with.

@asmusf
Copy link

asmusf commented Jan 22, 2019

I find statements like this:

"Users are cautioned that the resulting character sequence can still contain combining marks: not all character sequences have a precomposed equivalent and some scripts depend on combining marks for encoding. There are even cases where a given base character and combining mark is not replaced with a precomposed character because the combination is "blocked" by another combining mark in the sequence."

Which does NOT cover the case of "composition exceptions", but only the effect of canonical reordering.

If you would like to extend the discussion in the note to cover the example of BENGALI, that would be good. (It would help to introduce the concept of these "composition exceptions" by name). Most of them are found in certain complex scripts and we find that they are completely off the radar even of users of those scripts.

When data MUST be in NFC, as in IDNA 2008, we discover that people file bugs . . . so this is something that desperately needs to be covered if explaining normalization at this level of detail.

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 23, 2019
aphillips added a commit that referenced this issue Jan 23, 2019
@aphillips
Copy link
Contributor

Try the text in the commit I just merged.

@asmusf I agree that the text was just specific enough to cause confusion for close observers of the effect of normalization. I tend to disagree that we are making that level of detailed description.

So... I converted the note to a warning so that it is blindingly visible. I rephrased the warning part to include the concept of exceptions, although I didn't termify "composition exception". And I added a final sentence about the purpose of NFC:

What NFC gives the user is a string that can be compared to other NFC strings for equality with the minimum number of combining marks for that purpose.

@r12a Any thoughts?

@r12a
Copy link
Contributor Author

r12a commented Jan 23, 2019

What NFC gives the user is a string that can be compared to other NFC strings for equality with the minimum number of combining marks for that purpose.

Hmm. That's not actually true either, since precomposed characters exist for nukta combinations in scripts like Devanagari and Bengali but NFC doesn't use them. So it's not the minimum number of combining marks in that case, because you could have 0 by using the precomposed character.

Actually, i think that the rationale behind NFC has more to do with a nominal compatibility with legacy standards. So i guess i should try to suggest something. How about this (changes signalled using bold):


These two types of Unicode-defined equivalence are then grouped by another pair of variations: "decomposition" and "composition". In "decomposition", separable logical parts of a visual character are broken out into a sequence of base characters and combining marks and the resulting code points are put into a fixed, canonical order. In "composition", the decomposition is performed and then combining marks are recombined according to certain rules with their base characters.

Roughly speaking, NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character.

It is rather important to notice what this does not mean. The resulting character sequence can still contain combining marks, since not all character sequences have a precomposed equivalent. Indeed, as we've seen, many scripts offer no alternative to the use of combining marks, such as the Devanagari vowels in this example. In other cases, a given base character and combining mark is not replaced with a precomposed character because the combination is blocked by normalization rules. For example, some Indic scripts do not compose certain sequences of base plus diacritic, even though a matching precomposed character exists, due to composition exclusion rules. Composition may also be blocked by another combining mark between the two characters that would otherwise combine.

I'd just omit the paragraph you quoted in the previous comment. "What NFC gives the user..."

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jan 23, 2019
aphillips added a commit that referenced this issue Jan 23, 2019
@aphillips
Copy link
Contributor

@r12a I took your text verbatim. Thanks!

@asmusf
Copy link

asmusf commented Jan 23, 2019

Distinct improvement.

@r12a r12a added the close? label Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants