diff --git a/index.html b/index.html index e2ebf87..698ac92 100644 --- a/index.html +++ b/index.html @@ -171,12 +171,16 @@

Terminology

Much of the terminology needed to understand this document is provided by the Internationalization Glossary [[I18N-GLOSSARY]]. Some terms are also defined by [[CHARMOD-NORM]] and can be found in the Terminology and Notation section of that document.

+

Unicode, also known as the Universal Character Set, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the Unicode Standard [[Unicode]] provide useful background reading. Also see the Unicode Collation Algorithm [[UTS10]], which contains a chapter on searching.

+

Corpus The natural language text contained by a document or set of documents which the user would like to search.

Segmentation The process of breaking natural language text up into distinct words and phrases. This often includes operations such as "named entity recognition" (such as recognizing that the three word sequence Dr. Jonas Salk is a person's name).

Stemming A process or operation that reduces words to their "stem" or root. For example, the words runs, ran, and running all share the stem run. This some sometimes called (more formally) lemmatization and the stem is sometimes called the lemma.

-

Unicode, also known as the Universal Character Set, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the Unicode Standard [[Unicode]] provide useful background reading. Also see the Unicode Collation Algorithm [[UTS10]], which contains a chapter on searching.

+

Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

+

Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.

+ @@ -204,12 +208,8 @@

String Searching in Natural Language Content

Users of the Web often want to search documents for particular words or phrases within the natural language text of a given document. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]), and which are described by our document [[CHARMOD-NORM]].

-

There are different types of string searching.

- -

Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

- -

Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.

- +

There are different types of string searching. +

One limited form of full-text search—and the topic of this document—is sub-string matching. One familiar form of sub-string matching is the "find" feature of your browser. A sub-string match searches the body or corpus of a document with the user's input, seeking a match. Find operations can have different options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

One way that sub-string matching usually differs from other types of full-text search is that, while it may use algorithms in an attempt to suppress or ignore textual variations, it does not produce matches that contain additional or unspecified character sequences, words, or phrases.

@@ -219,23 +219,71 @@

Comparison of some differences between sub-string matching and other types o

-

Additional Types of Equivalence

-

The note Character Model for the World-Wide Web: String Matching [[CHARMOD-NORM]] describes a number of textual equivalences found in [[Unicode]]. There are other types of equivalence that are interesting when performing string searching. The forms of equivalence found in [[CHARMOD-NORM]] are all based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. Most of the "interesting equivalences" in this section go outside of those defined by Unicode. These additional "text normalizations" are sometimes application, natural language, or domain specific and shouldn't be overlooked by specifications or implementations as an additional consideration.

+

The Character Model for the World-Wide Web: String Matching [[CHARMOD-NORM]] describes textual equivalences such as case folding and different Unicode normalization forms which can affect sub-string matching. + +

There are other types of equivalence that are interesting when performing sub-string searching. The forms of equivalence mentioned above are all based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. Many of the "interesting equivalences" in this section go outside of those defined by Unicode. These additional "text normalizations" are sometimes application, natural language, or domain specific and should not be overlooked by specifications or implementations as an additional consideration.

+ +
+

Variations in Content and User Input

+ +

Quite often, the user's input does not use a sequence of code points identical to that in the text being searched. This can happen for a variety of reasons. Sometimes it is because the text varies in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately.

+ +

For example, users often omit accents when entering search terms in Latin-script languages, particularly on mobile keyboards, even though the text they are searching includes the accents. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.

+ +

A different example might be the presence or absence of short vowels in the Arabic and Hebrew scripts. For most languages in these scripts, the inclusion of the short vowels is entirely optional, but the presence of vowels in text being searched might impede a match if the user doesn't enter or know to enter them.

+ +

Arabic users generally do not enter short vowels and most Arabic texts do not include these vowels—but some texts do include them. Searching is affected by this, but meaning generally is not. A generalized description of this might be "optional to encode" sequences.

+ +

German users sometimes enter accents directly and sometimes by replacing umlauts with the letter e, e.g. Dürst vs. Duerst. Either spelling is recognizable and has the same meaning. The umlauts are probably "better" than the e spelling, but German speakers are not confused by the difference.

+ +

Note well that other languages use the same characters for different purposes. The formal name of the "umlaut" diacritic in Unicode is diaeresis and languages such as French, Spanish, and occasionally English use it to indicate the need to pronounce a specific letter, such as the word "ambigüedad" in Spanish or a name like "Zoë" in English.

+ +

Users in languages such as French sometimes omit entering accents when inputting search terms because it is more work to enter the correct character, even though this affects the meaning. For example, they might type cote and might expect to find the variations (which have different meanings) like côte or côtè, etc. This is "misspelling".

+ +

(Ed.: This is the current discussion in #10)

+ +

Some languages like Bengali have minor variations in how to "spell" a word which affects the pronunciation slightly but which are considered the same. For example: রাণি (rɑn̈i) vs. রাণী (rɑn̈ī) vs. রানি (rɑni) vs. রানী (rɑnī).

+ +

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

+ +
+ +
+

Case Folding

+ +

A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Most sub-string matching feature, such as the browser "find" command, offer a user-selectable option for matching the case of the input to that of the text.

+ +

For a survey of case folding, see the discussion here in [[CHARMOD-NORM]].

+ +

Script Equivalence

-

Some languages are written in more than one script. For example, Japanese uses two syllabic scripts, hiragana and katakana. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.

+ +

Some languages are written in more than one script. For example, Japanese uses two syllabic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.

+ +

East Asian Width

-

Needs more work. This is not intended to be more than a placeholder.

-

Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in legacy character encodings or for compatibility with certain layout behaviors in East Asian languages. For example, the full-width characters in the range U+FF01 through U+FF5E or the half-width katakana characters.

+ +
@@ -324,26 +372,12 @@

Digit Shaping

-
-

Encoding Variation

- -

See Indic doc and #10 for details

-

Some words or graphemes can be encoded or "spelled differently".

- -

Arabic users generally do not enter short vowels and most Arabic texts do not include these vowels—but some texts do include them. Searching is affected by this, but meaning generally is not. A generalized description of this might be "optional to encode" sequences.

- -

German users sometimes enter accents directly and sometimes by replacing umlauts with the letter e, e.g. Dürst vs. Duerst. Either spelling is potentially correct and has the same meaning. The umlauts are probably "better" than the e spelling, but Germans are not confused by the difference. (Note well that other languages do not always share this use of the umlaut). This might be viewed as an "alternately encoded spelling".

- -

French users sometimes omit entering accents when (for example) searching because it is more work to enter the correct character, even though this affects the meaning. For example, they might type cote and might expect to find the variations (which have different meanings) like côte or côtè, etc. This is "misspelling".

- -

(Ed.: This is the current discussion in #10)

-

Some languages like Bengali have minor variations in how to "spell" a word which affects the pronunciation slightly but which are considered the same. For example: রাণি (rɑn̈i) vs. রাণী (rɑn̈ī) vs. রানি (rɑni) vs. রানী (rɑnī).

-
-

Orthographic Variation

Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.

+ +

US English and UK English have different spelling traditions, which manifest in different ways. For example, color versus colour or exchanging the letters s and z as in internationaliZation vs. internationaliSation. A few words have even more divergent spellings, such as jail vs. gaol.

@@ -352,48 +386,10 @@

Whitespace Normalization

Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [[Unicode]] must be normalized so that the match succeeds.

-
-

Variations in User Input

- -

One of the primary considerations for string searching is that, quite often, the user's input is not identical to the way that the text being searched is encoded.

+
+

Input Effort

-

One primary reason this happens is because the text can vary in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately. For example, users often omit accents when entering Latin-script languages, particularly on mobile keyboards, even though the text they are searching includes the accents. In these cases, users generally expect the search operation to be more "promiscuous" to make up for the failure to add additional effort to their input.

- -

For example, a user might expect a term entered in lowercase to match uppercase equivalents. Conversely, when the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.

- -

A different case is where the text can vary in multiple ways, but the user can only type a single search term in. For example, the Japanese language uses two different phonetic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana.

- -

A different example might be the presence or absence of short vowels in the Arabic and Hebrew scripts. For most languages in these scripts, the inclusion of the short vowels is entirely optional, but the presence of vowels in text being searched might impede a match if the user doesn't enter or know to enter them.

- -

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

- -

Consider a document containing these strings: "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".

- -

In the table below, the user's input (on the left) might be considered a match for the above items as follows:

- - - - - - - - - - - - - - - - - - - - - - - -
User InputMatched Strings
e (lowercase 'e')"re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"
E (uppercase 'E')"RE-RESUME" and "RE-RÉSUMÉ"
é (lowercase 'e' with acute accent)"re-résumé" and "RE-RÉSUMÉ"
É (uppercase 'E' with acute accent)"RE-RÉSUMÉ"
+

In addition to variations of case or the use of accents, Unicode also has an array of canonical equivalents or compatibility characters (as described in the sections above) that might impact string searching.

For example, consider the letter "K". Characters with a compatibility mapping to U+004B LATIN CAPITAL LETTER K include:

@@ -433,6 +429,40 @@

Considerations for Searching

Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.

+

Increasing input effort from the user SHOULD be mirrored by more selective matching.

+ +

When the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.

+ + +