Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalisation for Case-Insensitive Comparison #172

Closed
Richard57 opened this issue May 11, 2018 · 11 comments
Closed

Normalisation for Case-Insensitive Comparison #172

Richard57 opened this issue May 11, 2018 · 11 comments
Assignees
Labels

Comments

@Richard57
Copy link

In Section 3.1 Step 3 'Normalisation', do we really want the case-insensitive Unicode full case folding comparison of "ᾳ͙" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+0345 COMBINING GREEK YPOGEGRAMMENI> and "α͙ι" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+03B9 GREEK SMALL LETTER IOTA> to depend on the choice of normalisation? NF(K)C yields 'different', while NF(K)D yields 'identical'. (The combining mark U+0359 was added to support the retranscription of deteriorated Greek manuscripts.)

The sequence of Step 4, "case-folding" and Step 6 "compare code points" does not work properly. For example, in the comparison of the NFC strings "sś" <U+0073, U+015B> and "ß́" <U+00DF latin small letter sharp s, U+0301>, default case-folding yields the strings <U+0073, U+015B> and <U+0073, U+0073, U+0301>. However, converting to NFD and then case-folding would yield <U+0073, U+0073, U+0301> for both strings.

Normalisation is required after case-folding. By contrast, apart from strings containing U+0345 when fully decomposed, normalisation (i.e. NFC/NFD) is not required before case-folding. However, compatibility decomposition, if applied, would be required before case-folding.

@EricSharkey
Copy link

Thank you, Richard57. I've been struggling trying to understand the "right" way to do this. The Unicode Character Database's Case-Folding.txt file clearly states at the top:

NOTE: case folding does not preserve normalization formats!

So the 3.1 Matching Algorithm which puts normalization first seems strange, because after case folding, the string may no longer be normalized, and this is clearly wrong for a string matching algorithm.

However, I haven't found a reference that normalization prior to case folding is or is not required for consist case folding results. Do you have a reference for U+0345 being the only exception to requiring normalization before case folding?

@Richard57
Copy link
Author

Sorry, I've no reference. I had to work it out for myself, and the UTC basically ignored me when I pointed out that case folding as defined did not respect canonical equivalence.

@EricSharkey
Copy link

I found it on the last page here:

http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

D145 A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD( X ))) = NFD(toCasefold(NFD( Y )))

The invocations of canonical decomposition (NFD normalization) before case folding in
D145 are to catch very infrequent edge cases. Normalization is not required before case
folding, except for the character U+0345 combining greek ypogegrammeni and any
characters that have it as part of their canonical decomposition, such as U+1FC3
greeksmall letter eta with ypogegrammeni. In practice, optimized versions of canonical
caseless matching can catch these special cases, thereby avoiding an extra normalization
step for each comparison.

@EricSharkey
Copy link

For completeness, I should also point out that compatibility caseless matching is more complicated, going beyond ypogegrammeni as a troublesome character.

D146 A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD( X ))))) =
NFKD(toCasefold(NFKD(toCasefold(NFD( Y )))))

@aphillips aphillips self-assigned this Jul 14, 2018
@aphillips
Copy link
Contributor

@EricSharkey/@Richard57 I'm aware that case fold doesn't preserve normalization. However, I had overlooked D145. I suspect we need to rework the algorithm in light of this. Providing canonical (or compatibility) equivalence should at least be an option for users of the spec.

That suggests four versions of the "algorithm":

  1. "Lightweight equivalence". The current "algorithm" with no normalization or casefolding.
  2. "ASCII equivalence". The current "algorithm" with no normalization and ASCII case fold (for vocabularies that need this).
  3. Canonical equivalence. NFD -> casefold -> NFD
  4. Compatibility equivalence. NFD -> casefold -> NFKD -> casefold -> NFKD

Thoughts?

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jul 20, 2018
@aphillips
Copy link
Contributor

Initial cut in my forked editors's copy aphillips@b359ebb

Comments welcome as I start to work through the re-factor.

@aphillips
Copy link
Contributor

I think this is addressed by the current version. Please comment as needed.

@indolering
Copy link

I'm working on both the WASI filename spec and a best-practices RFC for filesystem i18n. In the i18n working group, normalization of filenames to NFC was seen as a viable option (in contrast to Apple's NFD normalization). I came here to track down this discrepancy between charmod and the Unicode caseless matching spec, as I wanted to know if NFC normalization could be substituted for NFD.

@aphillips PR specifically skips NFC normalization post-casefolding. I think such a step would be more idiot proof and should be preferred if there are no compatibility concerns.

Thoughts?

@aphillips
Copy link
Contributor

@indolering Note that the PR you link says NFD, but that was not the last PR. The text that is live specifies NFC. (Specifically NFD -> casefold -> NFC).

I drew an action item in today's I18N teleconference to supply additional clarifying text and a worked example. Thanks!

@indolering
Copy link

indolering commented Nov 12, 2020

It's still not clear to me that performing NFD before case-folding is necessary for compatibility reasons and probably slower than using NFC. At least in on one mailing list discussion on the normalization of filenames, it was the consensus that NFC should be used as it is faster in practice (since most text is in NFC form).

Either way, it would be wise to ask Unicode to clarify the standard, as the wording "if and only if" is logician speak for requiring that exact series of steps, which would make the current NFC(casefold(NFD(c)) incompatible with the Unicode caseless matching standard. I don't think this would be the case in practice, but given the further complications outlined in NFKC and NFKD matching, you really should check to make sure.

EDIT: I believe you have to do NFD beforehand because I think the casefold data mappings are only defined for decomposed characters. An optimized implementation could skip a lot of this, however.

@indolering
Copy link

Chapter 5 Implementation Guidelines, section 18 Case Mappings, page #240:

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then case folding the result, then putting the result into Normalization Form NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeatedly applying Q does not change the result; case folding is closed under canonical normalization for either Normalization Form NFC or NFD.

Case folding is not, however, closed under compatibility normalization for either Normalization Form NFKD or NFKC. That is, given R(X) = NFKC(toCasefold(NFD(X))), there are some strings such that R(R(X)) ≠ R(X).

So you are fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants