Skip to content

Commit

Permalink
Serious major rewrite of the case mapping/case folding section
Browse files Browse the repository at this point in the history
  • Loading branch information
aphillips committed Nov 2, 2017
1 parent 4f0b1c0 commit ce183d0
Showing 1 changed file with 83 additions and 53 deletions.
136 changes: 83 additions & 53 deletions index.html
Expand Up @@ -516,51 +516,52 @@ <h2>The String Matching Problem</h2>
types of text variation that affect both user perception of text on the Web and the string processing on which
the Web relies.</p>
<section id="definitionCaseFolding">
<h3>Case Folding</h3>
<h3>Case Mapping and Case Folding</h3>
<p>Some scripts and writing systems make a distinction between UPPER,
lower, and Title case characters. Most scripts, such as the Brahmic
scripts of India, the Arabic script, and the scripts used to
write Chinese, Japanese, or Korean do not have a case distinction, but
some important ones do. Examples of such scripts include the Latin
script used in the majority of this document, as well as scripts such
as Greek, Armenian, and Cyrillic. </p>
<p>Some document formats or protocols seek to aid interoperability or
provide an aid to content authors by ignoring case variations in the
<a data-lt="vocabulary">vocabulary</a> they define or in user-defined values permitted by the
format or protocol. For example, this occurs when matching element
names
between an HTML document and its associated style sheet. Consider this
HTML fragment: </p>
<aside class="example">
<pre>&lt;style type="text/css"&gt;

SPAN.hello {
text-decoration: underline;
}
&lt;/style&gt;

&lt;span class="hello"&gt;Hello World!&lt;/span&gt;
</pre>
</aside>
<p>The <code class="kw" translate="no">SPAN</code> in the stylesheet
matches the <code class="kw" translate="no">span</code> element in the
document, even though the stylesheet uses uppercase and the HTML markup
does not.</p>
<p><dfn>Case folding</dfn> is the process of making two texts identical
which differ in case but are otherwise "the same".</p>
<p>Case folding might, at first, appear simple. However there are
variations that need to be considered when treating the full range of
Unicode in diverse languages. For more information,
<cite>[[!Unicode]]</cite> Chapter 5 (in v8.0, <a href="">Section 5.18</a>)
discusses case mappings in detail.</p>

<p>Unicode defines the default case fold mapping for each Unicode code point.
Since most scripts do not provide a case distinction, most Unicode code
points do not require a case fold mapping. For those characters that
have a case fold mapping, the majority have a simple, straight-forward
mapping to a single matching (generally lowercase) code point. Unicode
calls these the <code class="kw">common</code> case fold mappings, as they are shared by
Unicode's case fold mappings.

<p>For those scripts which have a case distinction, Unicode defines a <em>default</em> UPPER, lower, and Title case character mapping for each Unicode code point. These default mappings can be found in the Unicode Character Database (UCD). Case mapping, at first, appears simple. However there are variations that need to be considered when treating the full range of Unicode in diverse languages.</p>


<aside class="note">
<p>For more information, <cite>[[!Unicode]]</cite> Chapter 5 (in v8.0, <a href="">Section 5.18</a>) discusses case mappings and case folding in detail. </p>
</aside>

<aside class="example">
<p>For example here is a specific character that has a mapping to all three case variations. These mappings are defined in the Unicode Character Database (UCD).</p>
<table>
<tr>
<th>Character</th>
<th>Uppercase</th>
<th>Lowercase</th>
<th>Titlecase</th>
</tr>
<tr>
<td class="exampleChar">&#x1c5;</td>
<td class="exampleChar">&#x1c4;</td>
<td class="exampleChar">&#x1c6;</td>
<td class="exampleChar">&#x1c5;</td>
</tr>
<tr>
<td>U+01C5</td>
<td>U+01C4</td>
<td>U+01C6</td>
<td>U+01C5</td>
</tr>
</table>
</aside>


<p><dfn>Case folding</dfn> is the process of making two texts which differ only in case identical for comparison purposes. This is distinct from case mapping for display purposes. As with the default case mappings, Unicode defines default case fold mappings for each Unicode code point. Unicode defines two forms of case fold mapping, which we'll examine below.</p>

<p>Since most scripts do not have a case distinction, as with case mappings, most Unicode code points do not require a case fold mapping. For those characters that
have a case fold mapping, the majority have a simple, straight-forward mapping to a single matching (generally lowercase) code point. Unicode
calls these the <code class="kw">common</code> case fold mappings, as they are shared by Unicode's case fold mappings.
</p>

<aside class="example">
Expand Down Expand Up @@ -588,18 +589,13 @@ <h3>Case Folding</h3>
</aside>


<p>In addition to the <code class="kw">common</code> case folding mappings, a few characters
have a case fold mapping that would normally map one
Unicode character to more than one during case folding. These are called the <code class="kw">full</code> case fold mappings.
Together with the <code class="kw">common</code> case fold mappings, these provide the
default case fold mapping for all of Unicode. This case fold mapping is referred to in this
document as <dfn id="dfn-UnicodeC+F">Unicode C+F</dfn>.
<p>A few characters have a case fold mapping that map one Unicode code point to two or more code points during case folding. These are called the <code class="kw">full</code> case fold mappings. Together with the <code class="kw">common</code> case fold mappings, these provide the default case fold mapping for all of Unicode. This case fold mapping is referred to in this document as <dfn id="dfn-UnicodeC+F">Unicode C+F</dfn>.
</p>

<aside class="example">
<p>One well-known example of a 'full' case fold mapping is the character <span class="qchar">&#xdf;</span>
<p>One well-known example of a <code class="kw">full</code> case fold mapping is the character <span class="qchar">&#xdf;</span>
<span class="uname" translate="no">U+00DF LATIN SMALL LETTER SHARP S</span>, a letter that is commonly
used in the German language. The 'full' mapping of this character is to two ASCII letters 's'. (The upper case mapping is to "SS".)
used in the German language. The <code class="kw">full</code> case fold mapping and the lower case mapping of this character is to two ASCII letters 's'. The upper case mapping is to "SS".
</p>
<table>
<tr>
Expand All @@ -609,7 +605,6 @@ <h3>Case Folding</h3>
<td><span class="uname">LATIN SMALL LETTER SHARP S</span> to <span class="uname">LATIN SMALL LETTER S</span> + <span class="uname">LATIN SMALL LETTER S</span></td>
</tr>
</table>

</aside>

<p>Because some applications cannot allocate additional storage when
Expand All @@ -622,31 +617,34 @@ <h3>Case Folding</h3>

<aside class="example">

<p>Other examples can be found in the Greek script, where several precomposed characters have multi-character
case fold mappings. The table below shows one such example, the character <code>U+1F9B</code> (<span class="uname" translate="no">GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI</span>) and it's <code class="kw">full</code> and <code class="kw">simple</code> case fold mappings:</p>
<p>Examples of <code class=kw>full</code> versus <code class=kw>simple</code> case fold variations can be found in the Greek script, where several precomposed characters have multi-character case fold mappings. The table below shows one such example, the character <code>U+1F9B</code> (<span class="uname" translate="no">GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI</span>) and it's <code class="kw">full</code> and <code class="kw">simple</code> case fold mappings:</p>

<table style="width: 100%">

<tr>
<td class="exampleChar">&#x1f9b;</td>
<td>&#x21d2;</td>
<td class="exampleChar">&#x1f23;&#x03b9;</td>
<td><em>Full:</em> <code>U+1F23&nbsp;U+03B9</code> <span class="uname">GREEK SMALL LETTER ETA WITH DASIA AND VARIA</span> + <span class="uname">GREEK SMALL LETTER IOTA</span></td>
<td><code class=kw>full</code>: <code>U+1F23&nbsp;U+03B9</code> <span class="uname">GREEK SMALL LETTER ETA WITH DASIA AND VARIA</span> + <span class="uname">GREEK SMALL LETTER IOTA</span></td>
</tr>
<tr>
<td class="exampleChar">&#x1f9b;</td>
<td>&#x21d2;</td>
<td class="exampleChar">&#x1f93;</td>
<td><em>Simple:</em> <code>U+1F93</code> <span class="uname" translate="no">GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI</span></td>
<td><code class=kw>simple</code>: <code>U+1F93</code> <span class="uname" translate="no">GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI</span></td>
</tr>
</table>

</aside>

<p>Note that case folding removes information from a string which cannot
be recovered later. For example, two <span class="qchar">s</span> letters in German do not necessarily represent <span class="qchar">&#xdf;</span> in unfolded text.</p>
<p>Another aspect of case folding is that it can be language sensitive.
Unicode defines default case mappings for each encoded character, but

<section id="caseMappingLanguageSensitivity">
<h3>Language Sensitivity</h3>

<p>Another aspect of case mapping and case folding is that it can be language sensitive.
Unicode defines <em>default</em> case mappings and case fold mappings for each encoded character, but
these are only defaults and are not appropriate in all cases. Some
languages need case-folding to be tailored to meet specific linguistic
needs. One common example of this are Turkic languages written in the
Expand All @@ -673,6 +671,37 @@ <h3>Case Folding</h3>
case, this word appears like this: <span class="qterm"><code>DİYARBAKIR</code></span>.
Notice that the ASCII letter <span class="qchar">i</span> maps to <span class="uname" translate="no">U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE</span>, while the letter <span class="qchar">ı</span> (<span class="uname" translate="no">U+0131 LATIN SMALL LETTER DOTLESS I</span>) maps to the ASCII uppercase <span class="qchar">I</span>. Similarly a lower case casefolding of I to i would change the meaning of the text in Turkish, even thought this is the expected mapping in other languages, such as English or German.</p>
</aside>

</section>

<section id="caseFoldApplication">
<h3>Uses for Case Folding</h3>
<p>Some document formats or protocols seek to aid interoperability or
provide an aid to content authors by ignoring case variations in the
<a data-lt="vocabulary">vocabulary</a> they define or in user-defined values permitted by the
format or protocol.</p>


<aside class="example">

<p>One example where this occurs is when matching element names between an HTML document and its associated style sheet. Consider this HTML fragment: </p>
<pre>&lt;style type="text/css"&gt;

SPAN.hello {
text-decoration: underline;
}
&lt;/style&gt;

&lt;span class="hello"&gt;Hello World!&lt;/span&gt;
</pre>

<p>The <code class="kw" translate="no">SPAN</code> in the stylesheet
matches the <code class="kw" translate="no">span</code> element in the
document, even though the stylesheet uses uppercase and the HTML markup
does not.</p>
</aside>


<p>Sometimes case can vary in a way that is not semantically meaningful
or is not fully under the user's control. This is particularly true
when searching a document, but may sometimes also apply
Expand Down Expand Up @@ -707,6 +736,7 @@ <h3>Case Folding</h3>
These case-fold mappings are defined in the <cite>Common Locale Data
Repository</cite> [[UAX35]] project of the Unicode Consortium.</p>
<p>For advice on how to handle case folding see <a href="#handlingCaseFolding"></a>.</p>
</section>
</section>
<section id="unicodeNormalization">
<h3>Unicode Normalization</h3>
Expand Down

0 comments on commit ce183d0

Please sign in to comment.