diff --git a/index.html b/index.html index e954ccc..0a68b3a 100644 --- a/index.html +++ b/index.html @@ -12,7 +12,7 @@ useExperimentalStyles: true, // specification status (e.g. WD, LCWD, NOTE, etc.). If in doubt use ED. specStatus: "ED", - publishDate: "2017-11-29", + publishDate: "2017-12-06", previousPublishDate: "2015-11-19", previousMaturity: "WD", @@ -66,13 +66,13 @@ localBiblio: { "UTS18": { title: "Unicode Technical Standard #18: Unicode Regular Expressions", - href: "http://unicode.org/reports/tr18/", + href: "https://unicode.org/reports/tr18/", authors: [ "Mark Davis", "Andy Heninger" ] }, "Encoding": { title: "Encoding", - href: "http://www.w3.org/TR/encoding/", + href: "https://www.w3.org/TR/encoding/", authors: [ "Anne van Kesteren", "Joshua Bell", "Addison Phillips" ] }, @@ -84,59 +84,58 @@ "UTS10": { title: "Unicode Technical Standard #10: Unicode Collation Algorithm", - href: "http://www.unicode.org/reports/tr10/", + href: "https://www.unicode.org/reports/tr10/", authors: [ "Mark Davis", "Ken Whistler", "Markus Scherer" ] }, "UAX9": { title: "Unicode Standard Annex #9: Unicode Bidirectional Algorithm", - href: "http://unicode.org/reports/tr9/", + href: "https://unicode.org/reports/tr9/", authors: [ "Mark Davis", "Aharon Lahnin", "Andrew Glass" ] }, "UAX11": { title: "Unicode Standard Annex #11: East Asian Width", - href: "http://www.unicode.org/reports/tr11/", + href: "https://www.unicode.org/reports/tr11/", authors: [ "Ken Lunde 小林劍" ] }, "UAX29": { title: "Unicode Standard Annex #29: Unicode Text Segmentation", - href: "http://www.unicode.org/reports/tr29/", + href: "https://www.unicode.org/reports/tr29/", authors: [ "Mark Davis" ] }, "UTS39": { title: "Unicode Technical Standard #39: Unicode Security Mechanisms", - href: "http://www.unicode.org/reports/tr39/", + href: "https://www.unicode.org/reports/tr39/", authors: [ "Mark Davis", "Michel Suignard" ] }, "UTR36": { title: "Unicode Technical Report #36: Unicode Security Considerations", - href: "http://www.unicode.org/reports/tr36/", + href: "https://www.unicode.org/reports/tr36/", authors: [ "Mark Davis", "Michel Suignard" ] }, "UTR50": { title: "Unicode Technical Report #50: Unicode Vertical Text Layout", - href: "http://www.unicode.org/reports/tr50/", + href: "https://www.unicode.org/reports/tr50/", authors: [ "Koji Ishii 石井宏治" ] }, "UTR51": { title: "Unicode Technical Report #51: Unicode Emoji", - href: "http://www.unicode.org/reports/tr51/", + href: "https://www.unicode.org/reports/tr51/", authors: [ "Mark Davis", "Peter Edberg" ] }, - - - "Nicol": { - title: "The Multilingual World Wide Web, Chapter 2: The WWW As A Multilingual Application", - href: "http://www.mind-to-mind.com/i18n/multilingual-www.html", - authors: [ "Gavin Nicol" ] - } - + + "STRING-SEARCH": { + title: "Character Model for the World Wide Web: String Searching", + href: "https://w3c.github.io/string-search/", + authors: [ "Addison Phillips" ] + }, + } }; @@ -354,7 +353,7 @@

Terminology and Notation

wish to treat the default pair of grapheme clusters "ch" as a single grapheme cluster. Note that the interaction between the language of string content and the end-user's preferences might be complex.

-

Characters that are identical or confusable in appearance can present spoofing and @@ -1206,7 +1144,7 @@

Identical-Appearing Characters and the Limitations of Normalization

-

Character Escapes

+

Character Escapes and Includes

Most document formats or protocols provide an escaping mechanism to permit the inclusion of characters that are otherwise difficult to input, process, or encode. These escaping mechanisms provide an @@ -1247,16 +1185,14 @@

Character Escapes

<span class="h&#xe9;llo">Hello World!</span> -

You would expect that text to display like the following: Hello - world!

-

In order for this to work, the user-agent (browser) had to match two - strings representing the class name héllo, even though - the CSS and HTML each used a different escaping mechanism. The above - fragment demonstrates one way that text can vary and still be - considered "the same" according to a specification: the class name h\e9llo - matched the class name in the HTML mark-up h&#xe9;llo - (and would also match the literal value héllo using the - code point U+00E9).

+

You would expect that text to display like the following: Hello world!

+ +

In order for this to work, the user-agent (browser) had to match two strings representing the class name héllo, even though the CSS and HTML each used a different escaping mechanism. The above fragment demonstrates one way that text can vary and still be considered "the same" according to a specification: the class name h\e9llo matched the class name in the HTML mark-up h&#xe9;llo (and would also match the literal value héllo using the code point é [U+00E9 LATIN SMALL LETTER E WITH ACUTE]).

+ +

Formal languages and document formats often offer facilities for including a piece of text from one resource inside another. An include is a mechanism for inserting content into the body of a resource. Include mechanisms import content into a resource at processing time. This affects the structure of the document and potentially matching against the vocabulary of the document. Examples of includes are entity references in XML, the XInclude [[XInclude]] specification, and @import rules in CSS.

+ +

An include is said to be include normalized if it does not begin with a combining mark (either in the form of a character escape or as a character literal in the included resource).

+

Invisible Unicode Characters

@@ -1366,7 +1302,16 @@

Emoji Sequences

(indicated by U+FF0F Variation Selector 16) presentation of the base emoji.

-

Still another wrinkle in the use of emoji are flags. National flags can be composed using country codes derived from the [[BCP47]] registry, such as the sequence 🇿 [U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z] 🇲 [U+1F1F2 REGIONAL INDICATOR SYMBOL LETTER M], which is the country code (ZM) for the country Zambia: 🇿🇲. Other regional or special purpose flags can be composed using a flag emoji with various symbols or with regional indicator codes terminating in a cancel tag. For example, the flag of Scotland (🏴󠁧󠁢󠁳󠁣󠁴󠁿) can be composed like this: 🏴 [U+1F3F4 WAVING BLACK FLAG] 󠁧 [U+E0067 TAG LATIN SMALL LETTER G] 󠁢 [U+E0062 TAG LATIN SMALL LETTER B] 󠁳 [U+E0073 TAG LATIN SMALL LETTER S] 󠁣 [U+E0063 TAG LATIN SMALL LETTER C] 󠁴 [U+E0074 TAG LATIN SMALL LETTER T] 󠁿 [U+E007F CANCEL TAG].

+

Still another wrinkle in the use of emoji are flags. National flags can be composed using country codes derived from the [[BCP47]] registry, such as the sequence 🇿 [U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z] 🇲 [U+1F1F2 REGIONAL INDICATOR SYMBOL LETTER M], which is the country code (ZM) for the country Zambia: 🇿🇲. Other regional or special purpose flags can be composed using a flag emoji with various symbols or with regional indicator codes terminating in a cancel tag. For example, the flag of Scotland (🏴󠁧󠁢󠁳󠁣󠁴󠁿) can be composed like this:

+

Each of these mechanisms can be used together, so quite complex sequences of characters can be used to form a single emoji grapheme or image. Even very similar emoji sequences might @@ -1433,114 +1378,10 @@

Legacy Character Encodings

Other Types of Equivalence

-

The preceding types of character equivalence are all based on - character properties assigned by Unicode or due to the mapping of - legacy character encodings to the Unicode character set. There also - exist certain types of "interesting equivalence" that may be useful, - particularly in searching text, that are outside of the equivalences - defined by Unicode.

-

These types of equivalence are called out here for completeness. Specifications for a vocabulary or which define a matching algorithm for use in a formal syntax should avoid trying to match or harmonize these types of equivalence. These types of equivalence are of more interest for applications that provide natural language text searching or "find" features. +

There are additional kinds of equivalence or processing that are appropriate when performing natural language searching or "find" features. These are described in another part of the Character Model series of documents ([[STRING-SEARCH]]). Specifications for a vocabulary or which define a matching algorithm for use in a formal syntax SHOULD avoid trying to apply additional custom folding, mapping, or processing such as described in that document, since these interfere with producing consistent, predictable results.

-

The above note really wants to have a link to StringSearch, but we haven't even published a FPWD of that.

- -

For example, Japanese uses two syllabic scripts, - hiragana and katakana. A - user searching a document might type in text in one script, but wish to find - equivalent text in both scripts. These additional "text - normalizations" are sometimes application, natural language, or domain - specific and shouldn't be overlooked by specifications or - implementations as an additional consideration.

-

Another similar example is called digit shaping. Some scripts, - such as Arabic or Thai, have their own digit characters for the numbers from 0 to 9. - In some Web applications, the familiar ASCII digits are replaced for display - purposes with the local digit shapes. In other cases, the text actually might - contain the Unicode characters for the local digits. Users attempting to search - a document might expect that typing one form of digit will find the eqivalent - digits.

-
@@ -1567,49 +1408,32 @@

The Matching Algorithm

  • Perform any additional matching tailoring specific to the specification.
  • -
  • Compare the result sequences of code points for identity.
  • +
  • Compare the resulting sequences of code points for identity.
  • +

    Converting to a Sequence of Unicode Code Points

    - -

    The first step in comparing text is to ensure that both use the same character encoding form. Applications or implementations need to convert any text in a legacy character encoding to a Unicode encoding [[Encoding]] or convert disparate Unicode character encodings to the one they will use for comparison purposes.

    - -

    A normalizing transcoder is a transcoder that performs - a conversion from a legacy character encoding to Unicode and ensures that the result is in - Unicode Normalization Form C (NFC). For most legacy character encodings, it - is possible to construct a normalizing transcoder (by using any - transcoder followed by a normalizer); it is not possible to do so if - the legacy character encoding's repertoire - contains characters not represented in Unicode.

    -

    Previous versions of this document recommended the use of a normalizing transcoder when mapping from a - legacy character encoding to Unicode. Normalizing transcoders are expected to produce only character sequences in - Unicode Normalization Form C (NFC), although the resulting character sequence might still be partially - de-normalized (for example, if it begins with a combining mark).

    +

    [C] Content authors SHOULD enter and store resources in a Unicode character encoding (generally UTF-8 on the Web).

    -

    It turns out that, while most transcoders used on the Web produce Normalization Form C as their output, - several do not. The difference is important if the transcoder is to be round-trip - compatible with the source legacy character encoding or consistent with the transcoders used by - browsers and other user-agents on the Web. This includes several of the transcoders in [[Encoding]].

    +

    [C] Content authors SHOULD choose a normalizing transcoder when converting legacy encoded text or resources to Unicode unless the mapping of specific characters interferes with the meaning.

    + +

    The first step in comparing text is to ensure that both use the same digital representation. This means that implementations need to convert any text in a legacy character encoding to a sequence of Unicode code points. Normally this is done by applying a transcoder to convert the data to a consistent Unicode encoding form (such as UTF-8 or UTF-16). This allows bitwise comparison of the strings in order to determine string equality.

    + +

    A normalizing transcoder is a transcoder that performs a conversion from a legacy character encoding to Unicode and ensures that the result is in Unicode Normalization Form C (NFC). For most legacy character encodings, it is possible to construct a normalizing transcoder (by using any transcoder followed by a normalizer); it is not possible to do so if the legacy character encoding's repertoire contains characters not represented in Unicode. While normalizing transcoders only produce character sequences that are in NFC, the converted character sequence might not be include normalized (for example, if it begins with a combining mark).

    + +

    Because document formats on the Web often interact with or are processed using additional, external resources (for example, a CSS style sheet being applied to an HTML document), the consistent representation of text becomes important when matching values between documents that use different character encodings. Use of a normalizing transcoder helps ensure interoperability by making legacy encoded documents match the normally expected Unicode character sequence for most languages.

    + +

    Most transcoders used on the Web produce NFC as their output, but several do not. This is usually to allow the transcoder to be round-trip compatible with the source legacy character encoding, to preserve other character distinctions, or to be consistent with other transcoders in use in user-agents. This means that the Encoding specification [[!Encoding]] and various other important transcoding implementations include a number of non-normalizing transcoders. Indeed, most compatibility characters in Unicode exist solely for round-trip conversion from legacy encodings and a number of these have singleton canonical mappings in NFC. You saw an example of this earlier in the document with [U+212B ANGSTROM SIGN].

    + +

    Bear in mind that most transcoders produce NFC output and that even those transcoders that do not produce NFC for all characters produce NFC for the preponderence of characters. In particular, there are no commonly-used transcoders that produce decomposed forms where precomposed forms exist or which produce a different combining character sequence from the normalized sequence (and this is true for all of the transcoders in [[!Encoding]]).

    -
    -

    [C][I] For content authors, it is RECOMMENDED that content converted from a legacy character encoding - be normalized to Unicode Normalization Form C unless the mapping of specific characters interferes with - the meaning.

    -
    -
    -

    [I] Authoring tools SHOULD provide a means of normalizing resources - and warn the user when a given resource is not in Unicode - Normalization Form C.

    -
    +

    Expanding Character Escapes and Includes

    -

    Most document formats and protocols provide a means for - encoding characters or including external data, including text, into a - resource. This is discussed in detail in Section 4.6 of [[!CHARMOD]] - as well as above.

    +

    Most document formats and protocols provide a means for encoding characters as an escape sequence or including external data, including text, into a resource. This is discussed in detail in Section 4.6 of [[!CHARMOD]] as well as above.

    When performing matching, it is important to know when to interpret character escapes so that a match succeeds (or fails) appropriately. Normally, escapes, references, and includes are processed @@ -1623,318 +1447,174 @@

    Expanding Character Escapes and Includes

    <p id="&#x300;">Combining mark used as the value of 'id' attribute<p> -

    Although technically the combining mark U+0300 combines with the preceding quote mark, - HTML does not consider the character (whether or not it is encoded as an entity) to form part of the - HTML syntax.

    -

    When performing a matching operation on a resource, the general rule is to expand escapes on the same "level" as the user is interacting with. For example, when considering the above example, a tool to view the source of the HTML would show the escape sequence &#x300; as a string of characters starting with an ampersand. A JavaScript program, by contrast, operates on the browser's interpretation of the document and would match the character  ̀ [U+0300 COMBINING GRAVE ACCENT​] as the value of the attribute id.

    -

    When processing the syntax of a document format, escapes should be - converted to the character sequence they represent before the - processing of the syntax, unless explicitly forbidden by the format's - processing rules. This allows resources to include characters of all - types into the resource's syntactic structures.

    -

    In some cases, pre-processing escapes creates problems. - For example, expanding the sequence &lt; before parsing an HTML - document would produce document errors.

    +

    Although technically the combining mark  ̀ [U+0300 COMBINING GRAVE ACCENT​] combines with the preceding quote mark, HTML does not consider the character (whether or not it is encoded as an entity) to form part of the HTML syntax.

    + +

    When performing a matching operation on a resource, the general rule is to expand escapes on the same "level" as the user is interacting with. For example, when considering the above example, a tool to view the source of the HTML would show the escape sequence &#x300; as a string of characters starting with an ampersand. A JavaScript program, by contrast, operates on the browser's interpretation of the document and would match the character U+0300 as the value of the attribute id.

    + +

    When processing the syntax of a document format, escapes are usually converted to the character sequence they represent before the processing of the syntax, except where explicitly forbidden by the format's processing rules. This allows resources to include characters of all types into the resource's syntactic structures.

    + +

    In some cases, pre-processing escapes creates problems. For example, expanding the sequence &lt; before parsing an HTML document would produce document errors.

    Choice of Normalization Form

    -

    Specifications SHOULD avoid specifying Unicode normalization.

    -

    Implementations SHOULD NOT apply Unicode normalization unless the user requests it or it is required by a specification..

    -

    Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.

    -

    Content authors SHOULD always encode text using consistent Unicode character sequences to facilitate matching, even if a Unicode normalization form is included in the matching performed by the format or implementation.

    -

    Note that NFC is not always appropriate or available to content authors. The encoding choices of end users might not be obvious to downstream consumers of the data and normalization can remove distinctions that the users applied intentionally. Given that there are many different ways that content authors or applications could choose to represent the same semantic values when inputting or exchanging text, if a specification needs to choose a normalization form, be aware of the following considerations:

    +

    A specific Unicode normalization form is not always appropriate or available to content authors and the text encoding choices of users might not be obvious to downstream consumers of the data. As shown in this document, there are many different ways that content authors or applications could choose to represent the same semantic values when inputting or exchanging text. Normalization can remove distinctions that the users applied intentionally. Therefore:

    + + +

    [S] Specifications SHOULD NOT specify the Unicode normalization in string matching for vocabularies.

    + +

    [I] Implementations MUST NOT alter the normalization form of syntactic or natural language content being exchanged, read, parsed, or processed except when required to do so as a side-effect of text transformation such as transcoding the content to a Unicode character encoding, case mapping or folding, or other user-initiated change, as consumers or the content itself might depend on the de-normalized representation.

    + +

    [I] Authoring tools SHOULD provide a means of normalizing resources and warn the user when a given resource is not in Unicode Normalization Form C.

    + +

    [S] Specifications of text-based formats and protocols that as part of their syntax definition require the text be in a normalized form MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC. Such a specification needs to address the requirements in .

    + +

    Specifications are generally discouraged from requiring formats or protocols to store or exchange data in a normalized form unless there are specific, clear reasons why the additional requirement is necessary. As many document formats on the Web do not require normalization, content authors might occasionally rely on denormalized character sequences. A normalization step could negatively affect such content.

    + +

    The canonical normalization forms (form NFC or form NFD) are intended to preserve the meaning and presentation of the text to which they are applied. This is not always the case, which is one reason why normalization is not recommended. NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations in most languages with respect to the relationship between characters and graphemes.

    + +

    [S] Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD).

    + +

    [I] Implementations MUST NOT apply compatibility normalization forms (NFKC, NFKD) unless specifically requested by the end user.

    + +

    The compatibility normalization forms (form NFKC and form NFKD) change the structure and lose the meaning of the text in important ways. Users sometimes use characters with a compatibility mapping in Unicode on purpose or they use characters in a legacy character encoding that have a compatibility mapping when converted to Unicode. This has to be considered intentional on the part of the content author. Although NFKC/NFKD can sometimes be useful in "find" operations or string searching natural language content, erasing compatibility differences is harmful.

    + +

    Requiring NFC requires additional care on the part of the specification developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content need to be carefully considered and well-specified in these cases.

    + +

    [S] Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.

    + +

    [C] Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.

    + +

    [C] Content authors SHOULD always encode text using consistent Unicode character sequences to facilitate matching, even if a Unicode normalization form is included in the matching performed by the format or implementation.

    + +

    In order for their content to be processed consistently, content authors should try to use a consistent sequence of code points to represent the same text. While content can be in any normalization form or might use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequences as different. The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC.

    + +

    [C] Content authors SHOULD NOT include combining marks without a preceding base character in a resource.

    + +

    There can be exceptions to this. For example, when making a list of characters (such as a list of [[!Unicode]] characters), an author might want to use combining marks without a corresponding base character. However, use of a combining mark without a base character can cause unintentional display or, with naive implementations that combine the combining mark with adjacent syntactic content or other natural language content, processing problems. For example, if you were to use a combining mark, such as the character  ́ [U+0301 COMBINING ACUTE ACCENT​], as the start of a class attribute value in HTML, the class name might not display properly in your editor and be difficult to edit.

    + +

    Some recommended base characters include [U+25CC DOTTED CIRCLE] (when the base character needs to be visible) or   [U+00A0 NO-BREAK SPACE] (when the base character should be invisible).

    + +

    Since content authors do not always following these guidelines:

    + +

    [S] Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism). These need to include any boundary that may create conflicts when processing or matching content when instances of the language are processed, while allowing for character escapes designed to express arbitrary characters.

    + +
    + +

    Considerations When Requiring Normalization

    + +

    When a specification requires Unicode normalization for storage, transmission, or string matching, some additional considerations need to be addressed by the specification authors as well as by implementers of that specification:

    + +

    [S] Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off.

    + +

    [S] Specifications that require normalization MUST NOT make the implementation of normalization optional. Interoperability of matching cannot be achieved if some implementations normalize while others do not.

    + +

    An implementation that is required to perform normalization needs to consider these requirements:

    + +

    [I] Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

    + +

    [I] A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

    + +

    [I] Authoring tool implementations SHOULD warn users or prevent the input or creation of syntactic content starting with a combining mark that could interfere with processing, display, or interchange.

    +
    -

    The canonical normalization forms (form NFC or form NFD) are intended to preserve the meaning and presentation of the text to which they are applied. This is not always the case, which is one reason why normalization is not recommended. NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations with respect to the character vs. grapheme issue. For storage or interchange, if normalization is to be applied, form NFC is RECOMMENDED.

    -

    The compatibility normalization forms (form NFKC and form NFKD) change the structure and lose the meaning of the text in important ways. These normalization forms do produce more promiscuous matching, which is usually undesirable in a string matching context, but can be useful in "find" operations or string searching. The NFKD and NFKC normalization forms SHOULD NOT be used for storage or interchange of text. String matching applications or specifications SHOULD avoid specifying these normalization forms unless there is a compelling reason.

    Choice of Case Folding

    +

    One important consideration in string identity matching is whether the comparison is case sensitive or case insensitive.

    -

    Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used.

    - -
    -

    [C] Content authors SHOULD always spell identifiers using consistent upper, lower, and mixed case formatting to facilitate matching, even if case-insensitive matching is supported by the format or implementation.

    -
    +

    [C] Content authors SHOULD always spell identifiers using consistent upper, lower, and mixed case formatting to facilitate matching, even if case-insensitive matching is supported by the format or implementation.

    Case-sensitive matching

    -
    -

    [S] Case-sensitive matching is RECOMMENDED for new protocols and formats.

    -
    -

    Case-sensitive matching is the easiest to implement and introduces - the least potential for confusion, since it generally consists of a - comparison of the underlying Unicode code point sequence. Because it - is not affected by considerations such as language-specific case - mappings, it produces the least surprise for document authors that - have included words, such as the Turkish examples above, in their - syntactic content.

    -

    However, cases exist in which case-insensitivity is desirable. Where case-insensitive matching is desired, there are several - implementation choices that a formal language needs to consider.

    + +

    [S] Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values.

    + +

    Vocabularies usually puts a premium on predictability for content authors and users. Case-sensitive matching is the easiest to implement and introduces the least potential for confusion, since it generally consists of a comparison of the underlying Unicode code point sequence. Because it is not affected by considerations such as language-specific case mappings, it produces the least surprise for document authors that have included words, such as the Turkish examples above, in their syntactic content.

    + +

    Case insensitivity is usually be reserved for processing natural language content, such as running a feature for searching text. However, cases exist in which case-insensitivity is desirable. When case-insensitive matching is necessary, there are several implementation choices that a formal language needs to consider.

    Unicode case-insensitive matching

    -

    Vocabularies generally should allow for a wide range of Unicode characters, particularly for user-defined values, so as to enable use by the broadest range of languages and cultures without disadvantage. As a result, text operations such as case folding need to address the full range of Unicode and not just selected portions. When case-insensitive matching is desired, this means using Unicode case folding:

    - -
    -

    [S][I] The Unicode full casefolding is RECOMMENDED as the case-insensitive matching for vocabularies. -

    -
    -

    The Unicode simple casefolding form is not appropriate for string identity matching on the Web.

    +

    [S] Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching.

    + +

    [S] Specifications SHOULD allow the full range of Unicode for user-defined values.

    + +

    Vocabularies generally should allow for a wide range of Unicode characters, particularly for user-supplied values, so as to enable use by the broadest range of languages and cultures without disadvantage. As a result, text operations such as case folding need to address the full range of Unicode and not just selected portions. When case-insensitive matching is desired, this means using Unicode case folding:

    +

    The Unicode simple casefolding form is not appropriate for string identity matching on the Web.

    ASCII case-insensitive matching

    - + +

    [S] Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.

    +

    A formal language whose vocabulary is limited to ASCII and which does not allow user-defined names or identifiers can specify ASCII case-insensitive matching. An example of this is HTML, which defines the use of ASCII case-insensitive comparison for element and attribute names defined by the HTML specification.

    -
    -

    [S] For a vocabulary limited to the Basic Latin (ASCII) subset of Unicode, ASCII case-insensitive matching MAY be specified.

    -
    - -

    A vocabulary is considered to be "ASCII-only" if and only if all - tokens and identifiers are defined by the specification directly and - these identifiers or tokens use only the Basic Latin subset of - Unicode. If user-defined identifiers are permitted, the full range of - Unicode characters (limited, as appropriate, for security or - interchange concerns, see [[UTR36]]) should be allowed and Unicode - case insensitivity used for identity matching.

    -

    Note that an ASCII-only vocabulary can exist inside a document format - or protocol that allows a larger range of Unicode in identifiers or - values.

    -

    For example [[CSS-SYNTAX-3]] defines the format of CSS - style sheets in a way that allows the full range of Unicode to be used - for identifiers and values. However, CSS specifications always define - CSS keywords using a subset of the ASCII range. The vocabulary of CSS is - thus ASCII-only, even though many style sheets contain identifiers or - data values that are not ASCII.

    +

    A vocabulary is considered to be "ASCII-only" if and only if all tokens and identifiers are defined by the specification directly and these identifiers or tokens use only the Basic Latin subset of Unicode. If user-defined identifiers are permitted, the full range of Unicode characters (limited, as appropriate, for security or interchange concerns, see [[UTR36]]) should be allowed and Unicode case insensitivity used for identity matching.

    + +

    An ASCII-only vocabulary can exist inside a document format or protocol that allows a larger range of Unicode in identifiers or values. For example [[CSS-SYNTAX-3]] defines the format of CSS style sheets in a way that allows the full range of Unicode to be used for identifiers and values. However, CSS specifications always define CSS keywords using a subset of the ASCII range. The vocabulary of CSS is thus ASCII-only, even though many style sheets contain identifiers or data values that are not ASCII.

    Language-specific tailoring

    -

    Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations. Because language-specific tailoring of case mapping or case folding produces different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.

    -
    -

    [S][I] Locale- or language-specific tailoring is NOT RECOMMENDED for specifications and implementations that define string - matching as part of the definition of a format, protocol, or formal language.

    -
    - -
    -

    Language-sensitive case-insensitive matching in document formats and protocols is NOT RECOMMENDED.

    -
    -

    This is because language information can be hard to obtain, verify, or manage and because the resulting operations can produce results that frustrate users or which fail for some users and succeed for others depending on the language configuration that they are using. Operations that are themselves language-specific can include language-specific case folding where appropriate.

    -

    Although Unicode case folding is the preferred case-insensitive matching for document formats and protocols, content authors and users can be surprised by the results, since their expectations are generally consistent with the languages that they speak.

    + +

    Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations (which is beyond the scope of this document). Because language-specific tailoring of case mapping or case folding produces different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.

    + +

    [S] Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching.

    + +

    [S] If language-sensitive case-sensitive matching is specified, Unicode case-fold mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified.

    + +

    Two strings being matched can be in different languages and might appear in yet a third language context. Which language to use for case folding therefore depends on the application and user expectations.

    + +

    Language specific tailoring is not recommended for formal languages because the language information can be hard to obtain, verify, or manage and because the resulting operations can produce results that frustrate users or which fail for some users and succeed for others depending on the language configuration that they are using or the configuration of the system where the match is performed.

    + +

    [S] Operations that are language-specific SHOULD include language-specific case folding where appropriate.

    + +

    For example, the CSS operation text-transform is language-sensitive when used to case map strings.

    + +

    Although Unicode case folding is the preferred case-insensitive matching for document formats and protocols, content authors and users of languages that have mappings different from the default can still be surprised by the results, since their expectations are generally consistent with the languages that they speak.

    + +

    Language-sensitive string comparison is often referred to as being locale-sensitive, since most programming languages and operating environments access language-specific tailoring using their respective locale-based APIs. For example, see the java.text.Collator class in the Java programming language or Intl.Collator in JavaScript.

    +

    Additional Match Tailoring

    -

    Some implementations might require additional tailoring to assist with matching. This might include removing certain Unicode controls or invisbile markers, mapping together or removing characters that are part of the syntax, or performing a whitespace trim.

    -

    Specificiations need to clearly define any additional tailoring done as part of the matching process. Care should be taken not to interfere with the encoding of different languages. For example, a process that removes all combining characters based on Unicode character classes will not support languages that rely on combining marks and lead to user frustration. An example of this would be the various Indic scripts which use combining marks to encode or suppress vowels.

    -
    - +
    -
    -

    Requirements for Resources

    -

    These requirements pertain to the authoring and creation of - documents and are intended as guidelines for resource authors.

    -
    -

    [C] Resources SHOULD be produced, serialized, and exchanged in Unicode Normalization Form C (NFC).

    -
    -
    -

    In order to be processed correctly a resource must use a - consistent sequence of code points to represent text. While - content can be in any normalization form or may use a - de-normalized (but valid) Unicode character sequence, - inconsistency of representation will cause implementations to - treat the different sequence as "different". The best way to - ensure consistent selection, access, extraction, processing, or - display is to always use NFC.

    -
    -
    -

    [I] Implementations MUST NOT normalize any resource during - processing, storage, or exchange except with explicit permission - from the user.

    -
    -
    -

    The [[!Encoding]] specification includes a number of transcoders that do not produce - Unicode text in a normalized form when converting to Unicode from a legacy character encoding. - This is necessary to preserve round-trip behavior and other character distinctions. Indeed, many - compatibility characters in Unicode exist solely for round-trip conversion from legacy encodings. - Earlier versions of this specification recommended or required that implementations use a - normalizing transcoder that produced Unicode Normalization Form C (NFC), but, given that this - is at odds with how transcoders are actually implemented, this version no longer includes - this requirement. Bear in mind that most transcoders produce NFC output and that even those - transcoders that do not produce NFC for all characters mainly produce NFC for the preponderence - of characters. In particular, there are no commonly-used transcoders that produce decomposed forms where - precomposed forms exist or which produce a different combining character sequence from the - normalized sequence.

    - -
    -
    -

    [C] Authors SHOULD NOT include combining marks without a - preceding base character in a resource.

    -
    - -

    There can be exceptions to this. For example, when making a list of - characters (such as a list of [[!Unicode]] characters), an author might want to use - combining marks without a corresponding base character. However, use - of a combining mark without a base character can cause - unintentional display or, with naive implementations that combine the - combining mark with adjacent syntactic content or other natural language - content, processing problems. For example, if you were to use - a combining mark, such as the character - U+301 Combining Acute Accent, - as the start of a "class" attribute value in HTML, the class name - might not display properly in your editor.

    -
    -
    -

    [S] Specifications of text-based formats and protocols MAY specify - that all or part of the textual content of that format or protocol - is normalized using Unicode Normalization Form C (NFC).

    -
    -

    Specifications are generally discouraged from requiring formats or - protocols to store or exchange data in a normalized form unless there - are specific, clear reasons why the additional requirement is - necessary. As many document formats on the Web do not require - normalization, content authors might occasionally rely on denormalized - character sequences and a normalization step could negatively affect - such content.

    -
    -

    Requiring NFC requires additional care on the part of the - specification developer, as content on the Web generally is not in a - known normalization state. Boundary and error conditions for - denormalized content need to be carefully considered and well - specified in these cases.

    -
    -
    -

    Requirements for Specifications

    -

    This section discusses the requirements that different types of specification ought to consider. Most new specifications should use the Requirements for Non-Normalizing Specifications. Specifications that require Unicode normalization should use the Requirements for Unicode Normalizing Specifications.

    + +
    +

    Other Matching and Processing Considerations

    + +

    While matching strings and tokens in a formal language is the primary concern of this document, sometimes a specification needs to consider additional types of matching beyond pure string equality.

    + +
    +

    Regular Expressions

    -
    -

    Requirements for Non-Normalizing Specifications

    +

    [S] Specifications that define a regular expression syntax MUST provide at least Basic Unicode Level 1 support per [[!UTS18]] and SHOULD provide Extended or Tailored (Levels 2 and 3) support.

    +

    Regular expression syntaxes are sometimes useful in defining a format or protocol, since they allow users to specify values that are only partially known or which can vary in predictable ways. As seen in the various sections of this document, there is variation in the different ways that characters can be encoded in Unicode and this potentially interferes with how strings are specified or matched in expressions. For example, counting characters might need to depend on grapheme boundaries rather than the number of Unicode code points used; caseless matching might need to consider variations in case folding; or the Unicode normalization of the expression or text being processed might need to be considered.

    + +

    Unicode Regular Expressions Level 1 support includes the ability to specify Unicode code points in regular expressions, including via the use of escapes, and to access Unicode character properties as well as certain kinds of boundaries common to most regular expression syntaxes.

    + +

    Level 2 extends this with a number of important capabilities, notably the ability to select text on certain kinds of grapheme cluster boundary and support for case conversion (two topics mentioned extensively above). Level 3 provides for locale [[LTLI]] based tailoring of regular expressions, which are less useful in formal languages but can be useful in processing natural language content.

    -

    The following requirements pertain to any specification that specifies that normalization is not to be applied - automatically to content (which should include all new specifications):

    -
    -

    [S] Specifications that do not normalize MUST document or provide - a health-warning if canonically equivalent but disjoint Unicode - character sequences represent a security issue.

    -
    -
    -

    [S][I] Specifications and implementations MUST NOT assume that - content is in any particular normalization form.

    -
    -

    The normalization form or lack of normalization for any given - content has to be considered intentional in these cases.

    -
    -

    [I] Implementations MUST NOT alter the normalization form of syntactic or natural language - content being exchanged, read, parsed, or processed except when required to do so as a - side-effect of text transformation such as transcoding the content to a Unicode character encoding, - case mapping/folding, or other user-initiated change, as consumers or the content itself might depend on the - de-normalized representation.

    -
    -
    -

    [S] Specifications MUST specify that string matching takes the - form of "code point-by-code point" comparison of the Unicode - character sequence, or, if a specific Unicode character encoding - is specified, code unit-by-code unit comparison of the sequences. -

    -
    -

    Regular expression syntaxes are sometimes useful in defining a format or protocol, since they allow users to specify values that are only partially known or which can vary in predictable ways. As seen in the various sections of this document, there is variation in the different ways that characters can be encoded in Unicode and this potentially interferes with how strings are specified or matched in expressions. For example, counting characters might need to depend on grapheme boundaries rather than the number of Unicode code points used; caseless matching might need to consider variations in case folding; or the Unicode normalization of the expression or text being processed might need to be considered.

    - -
    -

    [S][I] Specifications that define a regular expression syntax - MUST provide at least Basic Unicode Level 1 support per [[!UTS18]] - and SHOULD provide Extended or Tailored (Levels 2 and 3) support.

    -
    -
    -
    -

    Requirements for Unicode Normalizing Specifications

    -

    This section contains requirements for specifications of text-based formats and protocols that define - Unicode Normalization as a requirement. New specifications SHOULD NOT require normalization - unless special circumstances apply.

    -
    -

    [S] Specifications of text-based formats and protocols that, as - part of their syntax definition, require that the text be in - normalized form MUST define string matching in terms of normalized - string comparison and MUST define the normalized form to be NFC.

    -
    -
    -

    [S] [I] A normalizing text-processing component which receives - suspect text MUST NOT perform any normalization-sensitive - operations unless it has first either confirmed through inspection - that the text is in normalized form or it has re-normalized the - text itself. Private agreements MAY, however, be created within - private systems which are not subject to these rules, but any - externally observable results MUST be the same as if the rules had - been obeyed.

    -
    -
    -

    [I] A normalizing text-processing component which modifies text - and performs normalization-sensitive operations MUST behave as if - normalization took place after each modification, so that any - subsequent normalization-sensitive operations always behave as if - they were dealing with normalized text.

    -
    -
    -

    [S] Specifications of text-based languages and protocols SHOULD - define precisely the construct boundaries necessary to obtain a - complete definition of full-normalization. These definitions - SHOULD include at least the boundaries between syntactic content and - character data as well as entity boundaries (if the language has - any include mechanism) , SHOULD include any other boundary that - may create denormalization when instances of the language are - processed, but SHOULD NOT include character escapes designed to - express arbitrary characters.

    -
    -
    -

    [I] Authoring tool implementations for a formal language that - does not mandate full-normalization SHOULD either prevent users - from creating content with composing characters at the beginning - of constructs that may be significant, such as at the beginning of - an entity that will be included, immediately after a construct - that causes inclusion or immediately after syntactic content, or SHOULD warn - users when they do so.

    -
    -
    -

    [S] Where operations can produce denormalized output from - normalized text input, specifications of API components - (functions/methods) that implement these operations MUST define - whether normalization is the responsibility of the caller or the - callee. Specifications MAY state that performing normalization is - optional for some API components; in this case the default SHOULD - be that normalization is performed, and an explicit option SHOULD - be used to switch normalization off. Specifications SHOULD NOT - make the implementation of normalization optional.

    -
    -
    -

    [S] Specifications that define a mechanism (for example an API or - a defining language) for producing textual data object SHOULD - require that the final output of this mechanism be normalized.

    -
    -
    +
    diff --git a/local.css b/local.css index df806f3..3c542c3 100644 --- a/local.css +++ b/local.css @@ -94,13 +94,15 @@ samp, kbd { -div.requirement { - counter-increment: requirement; +.requirement { +/* counter-increment: requirement; */ background-color:#FFC; + font-style: italic; + font-weight: bold; } -div.requirement p:before { - content: "C" counter(requirement) " \00A0"; +.requirement p:before { +/* content: "C" counter(requirement) " \00A0"; */ font-family:Tahoma, Geneva, NoToFu, sans-serif; font-weight: bold; font-size: smaller; @@ -108,6 +110,15 @@ div.requirement p:before { color: #63F; } +span.qrec { + font-family:Tahoma, Geneva, NoToFu, sans-serif; + font-weight: bold; + font-style: normal; + font-size: smaller; + text-transform: capitalize; + color: #63F; +} + SPAN.h\e9llo { text-decoration: underline; }