Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worked examples of case folding [I18N-ACTION-974] #214

Merged
merged 9 commits into from May 6, 2021

Conversation

aphillips
Copy link
Contributor

@aphillips aphillips commented Dec 2, 2020

Adds new examples illustrating how casefolding and normalization interact.


Preview | Diff

Merge commits to my fork
- Add a new subsection illustrating the interplay between normalization and case folding
- Modify recommendation text to reference new examples
@aphillips aphillips requested a review from r12a December 2, 2020 00:57
@r12a
Copy link
Contributor

r12a commented Dec 4, 2020

Here's how i'm framing the topic for myself. Does it help? (Either for planning your document explanation, or for showing me where i'm missing something.)

If you case fold the precomposed (NFC) character [U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI] (which is the most common way to represent this combination of base and diacritic characters) and you just run case fold transformations you end up with:
ἄι [U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA + U+03B9 GREEK SMALL LETTER IOTA]

If you start from a fully decomposed (NFD) sequence representing the same letter, ᾌ [U+0391 GREEK CAPITAL LETTER ALPHA + U+0313 COMBINING COMMA ABOVE + U+0301 COMBINING ACUTE ACCENT + U+0345 COMBINING GREEK YPOGEGRAMMENI]
you end up with
ἄι [U+03B1 GREEK SMALL LETTER ALPHA + U+0313 COMBINING COMMA ABOVE + U+0301 COMBINING ACUTE ACCENT + U+03B9 GREEK SMALL LETTER IOTA]

Clearly, these two don't match, and some normalisation will be necessary. However, in both of those cases, the acute accent is associated with the alpha base character.

If, however, you begin with the half-precomposed sequence ᾌ [U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI + U+0301 COMBINING ACUTE ACCENT]
you end up with
ἀί [U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI + U+03B9 GREEK SMALL LETTER IOTA + U+0301 COMBINING ACUTE ACCENT]
where the acute accent is associated with the iota.

This produces a sequence that can't be normalised to match the others!

The way to resolve this problem is to normalise all the text beforehand to NFD. Then
[U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI]
and
ᾌ [U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI + U+0301 COMBINING ACUTE ACCENT]
both end up the same as the decomposed version, ie.
ᾌ [U+0391 GREEK CAPITAL LETTER ALPHA + U+0313 COMBINING COMMA ABOVE + U+0301 COMBINING ACUTE ACCENT + U+0345 COMBINING GREEK YPOGEGRAMMENI]

If you now case-fold that sequence, it produces a match for all cases.

What i'm still not clear about, is why you need to double-normalise.

@aphillips
Copy link
Contributor Author

@r12a Thanks, that's a useful and clear explanation and I'll borrow heavily from it in my next revision. I just now posted my "work in progress" version.

The need for double normalization is: the first (pre-case fold) handles the prosgegrammeni problem. The post-case-fold normalization handles all of the other cases that flow from case folding by canonicalizing the code points and the ordering of combining marks (denormalized values for which can be produced by the case fold). Hopefully the just posted text makes that clearer (although more work is needed on that)

@r12a
Copy link
Contributor

r12a commented Dec 4, 2020

Note, btw, that i think it really helps to have the names in addition to the code point values, for people to understand what's going on. (They are all marked up in my comment, so if you're able to see the edited version that may help you. If you can't, i could send you a plain text version.)

@r12a
Copy link
Contributor

r12a commented Dec 4, 2020

Btw, my Greek char app now has a casefold function (in the drop down menu). You'll also find it very useful to use the normalisation switch on the panel lower down, which allows you to transform the text to NFC or NFD, or prevents any normalisation as you paste/type if you set to 'None'. https://r12a.github.io/pickers/grek/

@aphillips
Copy link
Contributor Author

@r12a I always use the char [U+1234 NAME GOES HERE] for individual characters (with styles) when done.

I will probably make a new subsection out of the prosgegrammeni note and add discussion of why the pre-normalization step is optional and the workaround for the affected characters. Then I need to update 3.2.2.3/3.2.2.4 to reflect that.

…ion with case fold.

Added notes about the optional step.
Adjusted text.
- Moved 'user-supplied values' next to 'syntactic content'
- Changed 'natural language content' to 'textual content'
- Added definition of 'natural language' with link to LTLI
- Modified all references to match
- Assorted cleanup. Some light editing of sections being addressed.
- Added a > 4 character example for hex notation (smiling cat)
- Remove subscript 16 from hex range of code point and style
  consistently with document.
- Remove 'termref' from one anchor for consistent styling.
- Change example3 from 'coverImage' to 'downloadLocation' (which is not
  localizable)
- Fix spacing in example3
@aphillips aphillips merged commit 21284b0 into w3c:gh-pages May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants