Skip to content

Commit

Permalink
questions/qa-html-css-normalization: Add more info about normalisatio…
Browse files Browse the repository at this point in the history
…n forms. Start on security issues.
  • Loading branch information
r12a committed Aug 24, 2022
1 parent c55133e commit 2e19253
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions questions/qa-html-css-normalization.en.html
Expand Up @@ -89,13 +89,16 @@ <h2 class="notoc">Quick check</h2>
<section id="n11nwhat">
<h2>What are normalization forms?</h2>

<p>In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word <span class="qterm">világ</span>. The fourth letter could be stored in memory as a <dfn>precomposed</dfn> (single) character <span class="uname">U+00E1 LATIN SMALL LETTER A WITH ACUTE</span> or as a <dfn>decomposed</dfn> sequence of <span class="uname">U+0061 LATIN SMALL LETTER A</span> followed by <span class="uname">U+0301 COMBINING ACUTE ACCENT</span> (two characters). </p>
<p>In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word <span class="qterm">világ</span>. The fourth letter could be stored in memory as a <dfn>precomposed</dfn> (single) code point <span class="uname">U+00E1 LATIN SMALL LETTER A WITH ACUTE</span> or as a <dfn>decomposed</dfn> sequence of <span class="uname">U+0061 LATIN SMALL LETTER A</span> followed by <span class="uname">U+0301 COMBINING ACUTE ACCENT</span> (two code points). </p>

<p><img src="qa-html-css-normalization-data/vilag.png" alt=" " /></p>

<p>The Unicode Standard allows either of these alternatives, but requires that both be treated as identical (ie. they are 'canonically equivalent'). To improve effectiveness, an application will usually <dfn>normalize</dfn> text before performing searches or comparisons. Normalization, in this particular case, means converting the text to use all precomposed or all decomposed characters. </p>

<p>Four <dfn>normalization forms</dfn> are specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The <span class="qchar">C</span> stands for (pre-)composed, and the <span class="qchar">D</span> for decomposed. The <span class="qchar">K</span> stands for compatibility. In fact, not all sequences of Unicode character have precomposed equivalents, but there are rules to indicate the many that do. </p>
<p>Four <dfn>normalization forms</dfn> are specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The <span class="qchar">C</span> stands for (pre-)composed, and the <span class="qchar">D</span> for decomposed. The <span class="qchar">K</span> stands for compatibility. </p>
<p><span class="leadin">NFD</span> uses Unicode rules to maximally decompose a code point into component parts. For example, the Vietnamese letter <span class="codepoint" translate="no"><bdi lang="vi">&#x1EC1;</bdi> [<span class="uname">U+1EC1 LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE</span>]</span> becomes the sequence <span class="codepoint" translate="no"><bdi lang="vi">&#x0065;&#x0302;&#x0300;</bdi> [<span class="uname">U+0065 LATIN SMALL LETTER E</span> + <span class="uname">U+0302 COMBINING CIRCUMFLEX ACCENT</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span>.</p>
<p><span class="leadin">NFC</span> runs that process in reverse, and will also completely compose partially decomposed sequences. However, this composition process is only applied to a subset of the Unicode repertoire. For example, the sequence <span class="codepoint" translate="no"><bdi lang="en">&#x0067;&#x0300;</bdi> [<span class="uname">U+0067 LATIN SMALL LETTER G</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span> has no precomposed form, and is unaffected by normalization.</p>
<p><span class="leadin">NFKC and NFKD</span> were intially introduced to provide round-trip compatibility with other character sets. This applies to code points that represent such things as glyph variants, shaped forms, alternative compositions, and so on, that can also be represented by other ‘canonical’ code points already in Unicode. NFKD and NFKC normalization replaces them with canonical characters or character sequences, and you cannot convert back to the original code points. In principle, such compatibility variants should not be used.</p>
</section>


Expand Down Expand Up @@ -140,7 +143,8 @@ <h3>Choosing a normalization form</h3>
<section id="converting">
<h3>Converting the normalization form of a page</h3>

<p style="">You should also be careful about automatically converting content into a particular normalization form, as it may obliterate some careful uses of differently normalized forms, such as in the carefully crafted examples of <span lang="hu" class="qterm">világ</span> above, or in filenames or URLs, or text included in the page from elsewhere, etc.</p>
<p style="">You should also try to avoid automatically converting content from one normalization form to another, as it may obliterate some important code point distinctions, such as in the carefully crafted examples of <span lang="hu" class="qterm">világ</span> above, or in filenames or URLs, or text included in the page from elsewhere, etc.</p>
<p style="">It may also introduce a security risk, especially in code syntax. For example, the following code points are canonically equivalent: </p>
</section>


Expand Down

0 comments on commit 2e19253

Please sign in to comment.