From da12ef02e2e4b1641b6d0172627aa88914ae38e6 Mon Sep 17 00:00:00 2001 From: "@aphillips" Date: Fri, 8 Apr 2016 16:40:46 -0700 Subject: [PATCH] Addressing additional comments and adding specific bits of text. - added a better introduction to Section 4 about find text - issue-78 hebrew/arabic short vowels: added text to the searching section - attempted to address JcK's concern about UTS39 reference, albeit in a temporary manner --- index.html | 58 +++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 16 deletions(-) diff --git a/index.html b/index.html index deceba9..1c89364 100644 --- a/index.html +++ b/index.html @@ -702,8 +702,8 @@

Canonical vs. Compatibility Equivalence

represent the same abstract character. When correctly displayed, these should always have the same visual appearance and behavior. Generally speaking, two canonically equivalent Unicode texts should - be considered to be identical as text. Canonical decomposition - removes these primary distinctions between two texts.

+ be considered to be identical as text. Unicode defines a process called + canonical decomposition that removes these primary distinctions between two texts.

Examples of canonical equivalence defined by Unicode include:

Compatibility equivalence is a weaker equivalence - between characters or sequences of characters that represent the + between Unicode characters or sequences of Unicode characters that represent the same abstract character, but may have a different visual appearance - or behavior. Generally a compatibility decomposition removes + or behavior. Generally the process called compatibility decomposition removes formatting variations, such as superscript, subscript, rotated, circled, and so forth, but other variations also occur. In many cases, characters with compatibility decompositions represent a distinction of a semantic nature; replacing the use of distinct characters with their compatibility decomposition can therefore - cause problems and texts that are equivalent after compatibility - decomposition often were not perceived as being identical beforehand - and usually should not be treated as equivalent by a formal + change the meaning of the text. Texts that are equivalent after + compatibility decomposition often were not perceived as being + identical beforehand and SHOULD NOT be treated as equivalent by a formal language.

The following table illustrates various kinds of compatibility equivalence in Unicode:

@@ -868,7 +868,8 @@

Canonical vs. Compatibility Equivalence

In the above table, it is important to note that the characters - illustrated are actual Unicode codepoints. They were + illustrated are actual Unicode codepoints, not just presentational + variations due to context or style. Each character was encoded into Unicode for compatibility with various legacy character encodings. They should not be confused with the normal kinds of presentational processing used on their non-compatibility @@ -1071,7 +1072,9 @@

Limitations of Normalization

if somewhat less "identical-looking" spoofs such as l vs. 1 or O and 0.

Confusable characters, regardless of script, can present spoofing - and other security risks. For more information on homographs and confusability, see [[UTS39]].

+ and other security risks. There are a variety of specifications and + standards that attempt to document or describe the issues of homographs and confusability. + One such as example is [[UTS39]].

Finally, note that Unicode Normalization, even the K Compatibility forms, does not bring together characters that have the same intrinsic meaning or function, but which vary in appearance or usage. For example, U+002E (.) and U+3002 (。) @@ -1752,23 +1755,46 @@

Considerations for Matching Natural Language Content

document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.

-

Searching content (one example is using the "find" command in your - browser) generates different user expectations and thus has different +

The preceeding sections of this document were concerned with string + matching in formal languages, but there are other types of common text + matching operations on the Web.

+

Full natural language searching is a broad topic well beyond the + aspirations of this document. However, implementers often need to + provide simple "find text" algorithms and specification often try to + define APIs to support these needs. Find operations on text generates different user expectations and thus has different requirements from the need for absolute identity matching needed by - document formats and protocols. Searching text has different - contextual needs and often provides different features.

+ document formats and protocols. This section describes the + requirements and considerations when designing a "find text" feature + or protocol. It is important to note that domain-specific requirements + may impose additional restrictions or alter the considerations + presented here.

One description of Unicode string searching can be found in Section 8 (Searching and Matching) of [[UTS10]].

One of the primary considerations for string searching is that, quite often, the user's input is not identical to the way that the text is - encoded in the text being searched. Users generally expect matching to + encoded in the text being searched. This often happens because the + text can vary in ways the user cannot predict or because the user's + keyboard or input method does not provide ready access to the textual + variations needed. In these cases, users generally expect matching to be more "promiscuous", particularly when they don't add additional - effort to their input. For example, they expect a term entered in + effort to their input.

+

For example, a user might expect a term entered in lowercase to match uppercase equivalents. Conversely, when the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the - base letter—they expect their search results to match (only) their + base letter—they might expect their search results to match (only) their more-specific input.

+

A different case is where the text can vary in multiple ways, but + the user can only type a single search term in. For example, the + Japanese language uses two different phonetic scripts, hiragana + and katakana. These scripts encode the same phonemes; thus + the user might expect that typing in a search term in hiragana + would find the exact same word spelled out in katakana. A + different example might be the presence or absence of short vowels in + the Arabic and Hebrew scripts. For most languages in these scripts, + the inclusion of the short vowels is entirely optional, but the + presence of vowels in text being searched might impede a match if the + user doesn't enter or know to enter them.

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra