Skip to content

Commit

Permalink
qa-html-css-normalization: Clarifications for para on NFKC/NFKD
Browse files Browse the repository at this point in the history
  • Loading branch information
r12a committed Aug 31, 2022
1 parent e0e0458 commit 17cc050
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions questions/qa-html-css-normalization.en.html
Expand Up @@ -98,7 +98,7 @@ <h2>What is Unicode normalization?</h2>
<p>Four <dfn>normalization forms</dfn> are specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The <span class="qchar">C</span> stands for (pre-)composed, and the <span class="qchar">D</span> for decomposed. The <span class="qchar">K</span> stands for compatibility. </p>
<p><span class="leadin">NFD</span> uses Unicode rules to maximally decompose a code point into component parts. For example, the Vietnamese letter <span class="codepoint" translate="no"><bdi lang="vi">&#x1EC1;</bdi> [<span class="uname">U+1EC1 LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE</span>]</span> becomes the sequence <span class="codepoint" translate="no"><bdi lang="vi">&#x0065;&#x0302;&#x0300;</bdi> [<span class="uname">U+0065 LATIN SMALL LETTER E</span> + <span class="uname">U+0302 COMBINING CIRCUMFLEX ACCENT</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span>.</p>
<p><span class="leadin">NFC</span> runs that process in reverse, and will also completely compose partially decomposed sequences. However, this composition process is only applied to a subset of the Unicode repertoire. For example, the sequence <span class="codepoint" translate="no"><bdi lang="en">&#x0067;&#x0300;</bdi> [<span class="uname">U+0067 LATIN SMALL LETTER G</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span> has no precomposed form, and is unaffected by normalization.</p>
<p><span class="leadin">NFKC and NFKD</span> were intially introduced to provide round-trip compatibility with other character sets. This applies to code points that represent such things as glyph variants, shaped forms, alternative compositions, and so on, that can also be represented by other ‘canonical’ code points already in Unicode. NFKD and NFKC normalization replaces these code points with canonical characters or character sequences, and you cannot convert back to the original code points. In principle, such compatibility variants should not be used.</p>
<p><span class="leadin">NFKC and NFKD</span> were introduced to handle characters that were included in Unicode in order to provide compatibility with other character sets. This applies to code points that represent such things as glyph variants, shaped forms, alternative compositions, and so on. NFKD and NFKC normalization replaces these code points with canonical characters or character sequences, and you cannot convert back to the original code points. In principle, such compatibility variants should not be used.</p>
</section>


Expand All @@ -113,7 +113,7 @@ <h3>Choosing a normalization form</h3>

<p>Natural language content aimed at human consumption does not need to all be in one normalized form – there may sometimes be good reasons to mix normalized forms. Applications that try to match one piece of text with another should, however, compare normalized versions of both.</p>

<p style="">Unfortunately, normalization doesn't always take place before content is compared, and a particularly important case is when CSS selectors are compared with HTML class names or ids, as style is applied to a page. If the word <span class="qterm">világ</span> is used in precomposed form in the HTML (eg. <code>&lt;span class=&quot;világ&quot;&gt;</code>), but in decomposed form in the CSS (eg. <code>.vila&#x0301;g { font-style: italic; }</code>), then the selector won't match the class name.</p>
<p style="">Unfortunately, normalization doesn't always take place before content is compared, and a particularly important case is when CSS selectors are compared with HTML class names or ids, as style is applied to a page. If the word <span class="qterm">világ</span> (meaning 'word' in Hungarian) is used in precomposed form in the HTML (eg. <code>&lt;span class=&quot;világ&quot;&gt;</code>), but in decomposed form in the CSS (eg. <code>.vila&#x0301;g { font-style: italic; }</code>), then the selector won't match the class name.</p>

<p style="">The following example shows this. The CSS selector is decomposed, whereas one class name in the HTML is decomposed and the other precomposed. As you should be able to see, only the decomposed class name is matched to the style. But notice also that it is not possible to distinguish the two forms in the source text.</p>

Expand Down

0 comments on commit 17cc050

Please sign in to comment.