Normalisation for Case-Insensitive Comparison #172

Richard57 · 2018-05-11T19:30:37Z

In Section 3.1 Step 3 'Normalisation', do we really want the case-insensitive Unicode full case folding comparison of "ᾳ͙" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+0345 COMBINING GREEK YPOGEGRAMMENI> and "α͙ι" <U+03B1 GREEK SMALL LETTER ALPHA, U+0359 COMBINING ASTERISK BELOW, U+03B9 GREEK SMALL LETTER IOTA> to depend on the choice of normalisation? NF(K)C yields 'different', while NF(K)D yields 'identical'. (The combining mark U+0359 was added to support the retranscription of deteriorated Greek manuscripts.)

The sequence of Step 4, "case-folding" and Step 6 "compare code points" does not work properly. For example, in the comparison of the NFC strings "sś" <U+0073, U+015B> and "ß́" <U+00DF latin small letter sharp s, U+0301>, default case-folding yields the strings <U+0073, U+015B> and <U+0073, U+0073, U+0301>. However, converting to NFD and then case-folding would yield <U+0073, U+0073, U+0301> for both strings.

Normalisation is required after case-folding. By contrast, apart from strings containing U+0345 when fully decomposed, normalisation (i.e. NFC/NFD) is not required before case-folding. However, compatibility decomposition, if applied, would be required before case-folding.

EricSharkey · 2018-06-08T17:57:56Z

Thank you, Richard57. I've been struggling trying to understand the "right" way to do this. The Unicode Character Database's Case-Folding.txt file clearly states at the top:

NOTE: case folding does not preserve normalization formats!

So the 3.1 Matching Algorithm which puts normalization first seems strange, because after case folding, the string may no longer be normalized, and this is clearly wrong for a string matching algorithm.

However, I haven't found a reference that normalization prior to case folding is or is not required for consist case folding results. Do you have a reference for U+0345 being the only exception to requiring normalization before case folding?

Richard57 · 2018-06-08T18:30:18Z

Sorry, I've no reference. I had to work it out for myself, and the UTC basically ignored me when I pointed out that case folding as defined did not respect canonical equivalence.

EricSharkey · 2018-06-11T13:13:18Z

I found it on the last page here:

http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

D145 A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD( X ))) = NFD(toCasefold(NFD( Y )))

The invocations of canonical decomposition (NFD normalization) before case folding in
D145 are to catch very infrequent edge cases. Normalization is not required before case
folding, except for the character U+0345 combining greek ypogegrammeni and any
characters that have it as part of their canonical decomposition, such as U+1FC3
greeksmall letter eta with ypogegrammeni. In practice, optimized versions of canonical
caseless matching can catch these special cases, thereby avoiding an extra normalization
step for each comparison.

EricSharkey · 2018-06-11T13:36:00Z

For completeness, I should also point out that compatibility caseless matching is more complicated, going beyond ypogegrammeni as a troublesome character.

D146 A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD( X ))))) =
NFKD(toCasefold(NFKD(toCasefold(NFD( Y )))))

aphillips · 2018-07-16T16:41:16Z

@EricSharkey/@Richard57 I'm aware that case fold doesn't preserve normalization. However, I had overlooked D145. I suspect we need to rework the algorithm in light of this. Providing canonical (or compatibility) equivalence should at least be an option for users of the spec.

That suggests four versions of the "algorithm":

"Lightweight equivalence". The current "algorithm" with no normalization or casefolding.
"ASCII equivalence". The current "algorithm" with no normalization and ASCII case fold (for vocabularies that need this).
Canonical equivalence. NFD -> casefold -> NFD
Compatibility equivalence. NFD -> casefold -> NFKD -> casefold -> NFKD

Thoughts?

…case fold normalized forms of matching)

aphillips · 2018-07-20T20:43:46Z

Initial cut in my forked editors's copy aphillips@b359ebb

Comments welcome as I start to work through the re-factor.

aphillips · 2019-01-14T18:45:40Z

I think this is addressed by the current version. Please comment as needed.

indolering · 2020-10-31T21:44:29Z

I'm working on both the WASI filename spec and a best-practices RFC for filesystem i18n. In the i18n working group, normalization of filenames to NFC was seen as a viable option (in contrast to Apple's NFD normalization). I came here to track down this discrepancy between charmod and the Unicode caseless matching spec, as I wanted to know if NFC normalization could be substituted for NFD.

@aphillips PR specifically skips NFC normalization post-casefolding. I think such a step would be more idiot proof and should be preferred if there are no compatibility concerns.

Thoughts?

aphillips · 2020-11-12T20:31:45Z

@indolering Note that the PR you link says NFD, but that was not the last PR. The text that is live specifies NFC. (Specifically NFD -> casefold -> NFC).

I drew an action item in today's I18N teleconference to supply additional clarifying text and a worked example. Thanks!

indolering · 2020-11-12T22:18:30Z

~~It's still not clear to me that performing NFD before case-folding is necessary for compatibility reasons~~ and probably slower than using NFC. At least in on one mailing list discussion on the normalization of filenames, it was the consensus that NFC should be used as it is faster in practice (since most text is in NFC form).

Either way, it would be wise to ask Unicode to clarify the standard, as the wording "if and only if" is logician speak for requiring that exact series of steps, which would make the current NFC(casefold(NFD(c)) incompatible with the Unicode caseless matching standard. I don't think this would be the case in practice, but given the further complications outlined in NFKC and NFKD matching, you really should check to make sure.

EDIT: I believe you have to do NFD beforehand because I think the casefold data mappings are only defined for decomposed characters. An optimized implementation could skip a lot of this, however.

indolering · 2020-11-15T04:40:28Z

Chapter 5 Implementation Guidelines, section 18 Case Mappings, page #240:

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then case folding the result, then putting the result into Normalization Form NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeatedly applying Q does not change the result; case folding is closed under canonical normalization for either Normalization Form NFC or NFD.

Case folding is not, however, closed under compatibility normalization for either Normalization Form NFKD or NFKC. That is, given R(X) = NFKC(toCasefold(NFD(X))), there are some strings such that R(R(X)) ≠ R(X).

So you are fine.

aphillips self-assigned this Jul 14, 2018

aphillips added a commit to aphillips/charmod-norm that referenced this issue Jul 20, 2018

Addresses w3c#172 (initial set of edits to start supplying different …

b359ebb

…case fold normalized forms of matching)

aphillips added the close? label Jan 14, 2019

aphillips closed this as completed Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalisation for Case-Insensitive Comparison #172

Normalisation for Case-Insensitive Comparison #172

Richard57 commented May 11, 2018

EricSharkey commented Jun 8, 2018

Richard57 commented Jun 8, 2018

EricSharkey commented Jun 11, 2018

EricSharkey commented Jun 11, 2018

aphillips commented Jul 16, 2018

aphillips commented Jul 20, 2018

aphillips commented Jan 14, 2019

indolering commented Oct 31, 2020

aphillips commented Nov 12, 2020

indolering commented Nov 12, 2020 •

edited

Loading

indolering commented Nov 15, 2020

Normalisation for Case-Insensitive Comparison #172

Normalisation for Case-Insensitive Comparison #172

Comments

Richard57 commented May 11, 2018

EricSharkey commented Jun 8, 2018

NOTE: case folding does not preserve normalization formats!

Richard57 commented Jun 8, 2018

EricSharkey commented Jun 11, 2018

EricSharkey commented Jun 11, 2018

aphillips commented Jul 16, 2018

aphillips commented Jul 20, 2018

aphillips commented Jan 14, 2019

indolering commented Oct 31, 2020

aphillips commented Nov 12, 2020

indolering commented Nov 12, 2020 • edited Loading

indolering commented Nov 15, 2020

indolering commented Nov 12, 2020 •

edited

Loading