From ce183d0674dc96b8ab962764d4f0a62a4e9a0d3e Mon Sep 17 00:00:00 2001 From: "@aphillips" Date: Thu, 2 Nov 2017 13:20:17 -0700 Subject: [PATCH] Serious major rewrite of the case mapping/case folding section --- index.html | 136 ++++++++++++++++++++++++++++++++--------------------- 1 file changed, 83 insertions(+), 53 deletions(-) diff --git a/index.html b/index.html index 6ae7ee8..4f9854e 100644 --- a/index.html +++ b/index.html @@ -516,7 +516,7 @@

The String Matching Problem

types of text variation that affect both user perception of text on the Web and the string processing on which the Web relies.

-

Case Folding

+

Case Mapping and Case Folding

Some scripts and writing systems make a distinction between UPPER, lower, and Title case characters. Most scripts, such as the Brahmic scripts of India, the Arabic script, and the scripts used to @@ -524,43 +524,44 @@

Case Folding

some important ones do. Examples of such scripts include the Latin script used in the majority of this document, as well as scripts such as Greek, Armenian, and Cyrillic.

-

Some document formats or protocols seek to aid interoperability or - provide an aid to content authors by ignoring case variations in the - vocabulary they define or in user-defined values permitted by the - format or protocol. For example, this occurs when matching element - names - between an HTML document and its associated style sheet. Consider this - HTML fragment:

- -

The SPAN in the stylesheet - matches the span element in the - document, even though the stylesheet uses uppercase and the HTML markup - does not.

-

Case folding is the process of making two texts identical - which differ in case but are otherwise "the same".

-

Case folding might, at first, appear simple. However there are - variations that need to be considered when treating the full range of - Unicode in diverse languages. For more information, - [[!Unicode]] Chapter 5 (in v8.0, Section 5.18) - discusses case mappings in detail.

- -

Unicode defines the default case fold mapping for each Unicode code point. - Since most scripts do not provide a case distinction, most Unicode code - points do not require a case fold mapping. For those characters that - have a case fold mapping, the majority have a simple, straight-forward - mapping to a single matching (generally lowercase) code point. Unicode - calls these the common case fold mappings, as they are shared by - Unicode's case fold mappings. + +

For those scripts which have a case distinction, Unicode defines a default UPPER, lower, and Title case character mapping for each Unicode code point. These default mappings can be found in the Unicode Character Database (UCD). Case mapping, at first, appears simple. However there are variations that need to be considered when treating the full range of Unicode in diverse languages.

+ + + + + + + +

Case folding is the process of making two texts which differ only in case identical for comparison purposes. This is distinct from case mapping for display purposes. As with the default case mappings, Unicode defines default case fold mappings for each Unicode code point. Unicode defines two forms of case fold mapping, which we'll examine below.

+ +

Since most scripts do not have a case distinction, as with case mappings, most Unicode code points do not require a case fold mapping. For those characters that + have a case fold mapping, the majority have a simple, straight-forward mapping to a single matching (generally lowercase) code point. Unicode + calls these the common case fold mappings, as they are shared by Unicode's case fold mappings.

-

In addition to the common case folding mappings, a few characters - have a case fold mapping that would normally map one - Unicode character to more than one during case folding. These are called the full case fold mappings. - Together with the common case fold mappings, these provide the - default case fold mapping for all of Unicode. This case fold mapping is referred to in this - document as Unicode C+F. +

A few characters have a case fold mapping that map one Unicode code point to two or more code points during case folding. These are called the full case fold mappings. Together with the common case fold mappings, these provide the default case fold mapping for all of Unicode. This case fold mapping is referred to in this document as Unicode C+F.

Because some applications cannot allocate additional storage when @@ -622,8 +617,7 @@

Case Folding

+ +
+ +
+

Uses for Case Folding

+

Some document formats or protocols seek to aid interoperability or + provide an aid to content authors by ignoring case variations in the + vocabulary they define or in user-defined values permitted by the + format or protocol.

+ + + + +

Sometimes case can vary in a way that is not semantically meaningful or is not fully under the user's control. This is particularly true when searching a document, but may sometimes also apply @@ -707,6 +736,7 @@

Case Folding

These case-fold mappings are defined in the Common Locale Data Repository [[UAX35]] project of the Unicode Consortium.

For advice on how to handle case folding see .

+

Unicode Normalization