questions/qa-html-css-normalization: Add more info about normalisatio…

…n forms. Start on security issues.
w3c · Aug 24, 2022 · 2e19253 · 2e19253
1 parent c55133e
commit 2e19253
Showing 1 changed file with 7 additions and 3 deletions.
diff --git a/questions/qa-html-css-normalization.en.html b/questions/qa-html-css-normalization.en.html
@@ -89,13 +89,16 @@ <h2 class="notoc">Quick check</h2>
 <section id="n11nwhat">
 <h2>What are normalization forms?</h2>
 
-<p>In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word <span class="qterm">világ</span>. The fourth letter could be stored in memory as a <dfn>precomposed</dfn> (single) character <span class="uname">U+00E1 LATIN SMALL LETTER A WITH ACUTE</span> or as a <dfn>decomposed</dfn> sequence of <span class="uname">U+0061 LATIN SMALL LETTER A</span> followed by <span class="uname">U+0301 COMBINING ACUTE ACCENT</span> (two characters). </p>
+<p>In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word <span class="qterm">világ</span>. The fourth letter could be stored in memory as a <dfn>precomposed</dfn> (single) code point <span class="uname">U+00E1 LATIN SMALL LETTER A WITH ACUTE</span> or as a <dfn>decomposed</dfn> sequence of <span class="uname">U+0061 LATIN SMALL LETTER A</span> followed by <span class="uname">U+0301 COMBINING ACUTE ACCENT</span> (two code points). </p>
 
 <p><img src="qa-html-css-normalization-data/vilag.png" alt=" " /></p>
 
 <p>The Unicode Standard allows either of these alternatives, but requires that both  be treated as identical (ie. they are 'canonically equivalent'). To improve effectiveness, an application will usually <dfn>normalize</dfn> text before performing searches or comparisons. Normalization, in this particular case, means converting the text to use all precomposed or all decomposed characters. </p>
 
-<p>Four <dfn>normalization forms</dfn> are specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The <span class="qchar">C</span> stands for (pre-)composed, and the <span class="qchar">D</span> for decomposed. The <span class="qchar">K</span> stands for compatibility. In fact, not all sequences of Unicode character have precomposed equivalents, but there are rules to indicate the many that do. </p>
+<p>Four <dfn>normalization forms</dfn> are specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The <span class="qchar">C</span> stands for (pre-)composed, and the <span class="qchar">D</span> for decomposed. The <span class="qchar">K</span> stands for compatibility. </p>
+<p><span class="leadin">NFD</span> uses Unicode rules to maximally decompose a code point into component parts. For example, the Vietnamese letter <span class="codepoint" translate="no"><bdi lang="vi">&#x1EC1;</bdi> [<span class="uname">U+1EC1 LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE</span>]</span> becomes the sequence <span class="codepoint" translate="no"><bdi lang="vi">&#x0065;&#x0302;&#x0300;</bdi> [<span class="uname">U+0065 LATIN SMALL LETTER E</span> + <span class="uname">U+0302 COMBINING CIRCUMFLEX ACCENT</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span>.</p>
+<p><span class="leadin">NFC</span> runs that process in reverse, and will also completely compose partially decomposed sequences. However, this composition process is only applied to a subset of the Unicode repertoire. For example, the sequence <span class="codepoint" translate="no"><bdi lang="en">&#x0067;&#x0300;</bdi> [<span class="uname">U+0067 LATIN SMALL LETTER G</span> + <span class="uname">U+0300 COMBINING GRAVE ACCENT</span>]</span> has no precomposed form, and is unaffected by normalization.</p>
+<p><span class="leadin">NFKC and NFKD</span> were intially introduced to provide round-trip compatibility with other character sets. This applies to code points that represent such things as glyph variants, shaped forms, alternative compositions, and so on, that can also be represented by other ‘canonical’ code points already in Unicode.  NFKD and NFKC normalization  replaces them with canonical characters or character sequences, and you cannot convert back to the original code points.  In principle, such compatibility variants should not be used.</p>
 </section>
 
 
@@ -140,7 +143,8 @@ <h3>Choosing a normalization form</h3>
 <section id="converting">
 <h3>Converting the normalization form of a page</h3>
 
-<p style="">You should also be careful about automatically converting content into a particular normalization form, as it may obliterate some careful uses of differently normalized forms, such as in the carefully crafted examples of <span lang="hu" class="qterm">világ</span> above, or in filenames or URLs, or text included in the page from elsewhere, etc.</p>
+<p style="">You should also try to avoid automatically converting content from one normalization form to another, as it may obliterate some important code point distinctions, such as in the carefully crafted examples of <span lang="hu" class="qterm">világ</span> above, or in filenames or URLs, or text included in the page from elsewhere, etc.</p>
+<p style="">It may also introduce a security risk, especially in code syntax. For example, the following code points are canonically equivalent: </p>
 </section>