-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalisation for Case-Insensitive Comparison #172
Comments
Thank you, Richard57. I've been struggling trying to understand the "right" way to do this. The Unicode Character Database's Case-Folding.txt file clearly states at the top: NOTE: case folding does not preserve normalization formats!So the 3.1 Matching Algorithm which puts normalization first seems strange, because after case folding, the string may no longer be normalized, and this is clearly wrong for a string matching algorithm. However, I haven't found a reference that normalization prior to case folding is or is not required for consist case folding results. Do you have a reference for U+0345 being the only exception to requiring normalization before case folding? |
Sorry, I've no reference. I had to work it out for myself, and the UTC basically ignored me when I pointed out that case folding as defined did not respect canonical equivalence. |
I found it on the last page here: http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf D145 A string X is a canonical caseless match for a string Y if and only if: The invocations of canonical decomposition (NFD normalization) before case folding in |
For completeness, I should also point out that compatibility caseless matching is more complicated, going beyond ypogegrammeni as a troublesome character. D146 A string X is a compatibility caseless match for a string Y if and only if: |
@EricSharkey/@Richard57 I'm aware that case fold doesn't preserve normalization. However, I had overlooked D145. I suspect we need to rework the algorithm in light of this. Providing canonical (or compatibility) equivalence should at least be an option for users of the spec. That suggests four versions of the "algorithm":
Thoughts? |
…case fold normalized forms of matching)
Initial cut in my forked editors's copy aphillips@b359ebb Comments welcome as I start to work through the re-factor. |
I think this is addressed by the current version. Please comment as needed. |
I'm working on both the WASI filename spec and a best-practices RFC for filesystem i18n. In the i18n working group, normalization of filenames to NFC was seen as a viable option (in contrast to Apple's NFD normalization). I came here to track down this discrepancy between charmod and the Unicode caseless matching spec, as I wanted to know if NFC normalization could be substituted for NFD. @aphillips PR specifically skips NFC normalization post-casefolding. I think such a step would be more idiot proof and should be preferred if there are no compatibility concerns. Thoughts? |
@indolering Note that the PR you link says NFD, but that was not the last PR. The text that is live specifies NFC. (Specifically NFD -> casefold -> NFC). I drew an action item in today's I18N teleconference to supply additional clarifying text and a worked example. Thanks! |
Either way, it would be wise to ask Unicode to clarify the standard, as the wording "if and only if" is logician speak for requiring that exact series of steps, which would make the current EDIT: I believe you have to do NFD beforehand because I think the casefold data mappings are only defined for decomposed characters. An optimized implementation could skip a lot of this, however. |
Chapter 5 Implementation Guidelines, section 18 Case Mappings, page #240:
So you are fine. |
In Section 3.1 Step 3 'Normalisation', do we really want the case-insensitive Unicode full case folding comparison of "ᾳ͙" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+0345 COMBINING GREEK YPOGEGRAMMENI> and "α͙ι" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+03B9 GREEK SMALL LETTER IOTA> to depend on the choice of normalisation? NF(K)C yields 'different', while NF(K)D yields 'identical'. (The combining mark U+0359 was added to support the retranscription of deteriorated Greek manuscripts.)
The sequence of Step 4, "case-folding" and Step 6 "compare code points" does not work properly. For example, in the comparison of the NFC strings "sś" <U+0073, U+015B> and "ß́" <U+00DF latin small letter sharp s, U+0301>, default case-folding yields the strings <U+0073, U+015B> and <U+0073, U+0073, U+0301>. However, converting to NFD and then case-folding would yield <U+0073, U+0073, U+0301> for both strings.
Normalisation is required after case-folding. By contrast, apart from strings containing U+0345 when fully decomposed, normalisation (i.e. NFC/NFD) is not required before case-folding. However, compatibility decomposition, if applied, would be required before case-folding.
The text was updated successfully, but these errors were encountered: