Converting to a Sequence of Unicode Code Points
-
- The first step in comparing text is to ensure that both use the same character encoding form. Applications or implementations need to convert any text in a legacy character encoding to a Unicode encoding [[Encoding]] or convert disparate Unicode character encodings to the one they will use for comparison purposes.
-
- A normalizing transcoder is a transcoder that performs
- a conversion from a legacy character encoding to Unicode and ensures that the result is in
- Unicode Normalization Form C (NFC). For most legacy character encodings, it
- is possible to construct a normalizing transcoder (by using any
- transcoder followed by a normalizer); it is not possible to do so if
- the legacy character encoding 's repertoire
- contains characters not represented in Unicode.
- Previous versions of this document recommended the use of a normalizing transcoder when mapping from a
- legacy character encoding to Unicode. Normalizing transcoders are expected to produce only character sequences in
- Unicode Normalization Form C (NFC), although the resulting character sequence might still be partially
- de-normalized (for example, if it begins with a combining mark).
+ [C] Content authors SHOULD enter and store resources in a Unicode character encoding (generally UTF-8 on the Web).
- It turns out that, while most transcoders used on the Web produce Normalization Form C as their output,
- several do not. The difference is important if the transcoder is to be round-trip
- compatible with the source legacy character encoding or consistent with the transcoders used by
- browsers and other user-agents on the Web. This includes several of the transcoders in [[Encoding]].
+ [C] Content authors SHOULD choose a normalizing transcoder when converting legacy encoded text or resources to Unicode unless the mapping of specific characters interferes with the meaning.
+
+ The first step in comparing text is to ensure that both use the same digital representation. This means that implementations need to convert any text in a legacy character encoding to a sequence of Unicode code points. Normally this is done by applying a transcoder to convert the data to a consistent Unicode encoding form (such as UTF-8 or UTF-16). This allows bitwise comparison of the strings in order to determine string equality.
+
+ A normalizing transcoder is a transcoder that performs a conversion from a legacy character encoding to Unicode and ensures that the result is in Unicode Normalization Form C (NFC). For most legacy character encodings, it is possible to construct a normalizing transcoder (by using any transcoder followed by a normalizer); it is not possible to do so if the legacy character encoding 's repertoire contains characters not represented in Unicode. While normalizing transcoders only produce character sequences that are in NFC, the converted character sequence might not be include normalized (for example, if it begins with a combining mark).
+
+ Because document formats on the Web often interact with or are processed using additional, external resources (for example, a CSS style sheet being applied to an HTML document), the consistent representation of text becomes important when matching values between documents that use different character encodings. Use of a normalizing transcoder helps ensure interoperability by making legacy encoded documents match the normally expected Unicode character sequence for most languages.
+
+ Most transcoders used on the Web produce NFC as their output, but several do not. This is usually to allow the transcoder to be round-trip compatible with the source legacy character encoding, to preserve other character distinctions, or to be consistent with other transcoders in use in user-agents. This means that the Encoding specification [[!Encoding]] and various other important transcoding implementations include a number of non-normalizing transcoders. Indeed, most compatibility characters in Unicode exist solely for round-trip conversion from legacy encodings and a number of these have singleton canonical mappings in NFC. You saw an example of this earlier in the document with Å [U+212B ANGSTROM SIGN ] .
+
+ Bear in mind that most transcoders produce NFC output and that even those transcoders that do not produce NFC for all characters produce NFC for the preponderence of characters. In particular, there are no commonly-used transcoders that produce decomposed forms where precomposed forms exist or which produce a different combining character sequence from the normalized sequence (and this is true for all of the transcoders in [[!Encoding]]).
-
-
[C][I] For content authors, it is RECOMMENDED that content converted from a legacy character encoding
- be normalized to Unicode Normalization Form C unless the mapping of specific characters interferes with
- the meaning.
-
-
-
[I] Authoring tools SHOULD provide a means of normalizing resources
- and warn the user when a given resource is not in Unicode
- Normalization Form C.
-
+
Expanding Character Escapes and Includes
- Most document formats and protocols provide a means for
- encoding characters or including external data, including text, into a
- resource . This is discussed in detail in Section 4.6 of [[!CHARMOD]]
- as well as above .
+ Most document formats and protocols provide a means for encoding characters as an escape sequence or including external data, including text, into a resource . This is discussed in detail in Section 4.6 of [[!CHARMOD]] as well as above .
When performing matching, it is important to know when to interpret character escapes so that
a match succeeds (or fails) appropriately. Normally, escapes, references, and includes are processed
@@ -1623,318 +1447,174 @@
Expanding Character Escapes and Includes
<p id="̀">Combining mark used as the value of 'id' attribute<p>
- Although technically the combining mark U+0300
combines with the preceding quote mark,
- HTML does not consider the character (whether or not it is encoded as an entity) to form part of the
- HTML syntax.
- When performing a matching operation on a resource, the general rule is to expand escapes on the same "level" as the user is interacting with. For example, when considering the above example, a tool to view the source of the HTML would show the escape sequence ̀
as a string of characters starting with an ampersand. A JavaScript program, by contrast, operates on the browser's interpretation of the document and would match the character ̀ [U+0300 COMBINING GRAVE ACCENT ] as the value of the attribute id
.
- When processing the syntax of a document format, escapes should be
- converted to the character sequence they represent before the
- processing of the syntax, unless explicitly forbidden by the format's
- processing rules. This allows resources to include characters of all
- types into the resource's syntactic structures.
- In some cases, pre-processing escapes creates problems.
- For example, expanding the sequence <
before parsing an HTML
- document would produce document errors.
+ Although technically the combining mark ̀ [U+0300 COMBINING GRAVE ACCENT ] combines with the preceding quote mark, HTML does not consider the character (whether or not it is encoded as an entity) to form part of the HTML syntax.
+
+ When performing a matching operation on a resource, the general rule is to expand escapes on the same "level" as the user is interacting with. For example, when considering the above example, a tool to view the source of the HTML would show the escape sequence ̀
as a string of characters starting with an ampersand. A JavaScript program, by contrast, operates on the browser's interpretation of the document and would match the character U+0300
as the value of the attribute id
.
+
+ When processing the syntax of a document format, escapes are usually converted to the character sequence they represent before the processing of the syntax, except where explicitly forbidden by the format's processing rules. This allows resources to include characters of all types into the resource's syntactic structures.
+
+ In some cases, pre-processing escapes creates problems. For example, expanding the sequence <
before parsing an HTML document would produce document errors.
Choice of Normalization Form
- Specifications SHOULD avoid specifying Unicode normalization.
- Implementations SHOULD NOT apply Unicode normalization unless the user requests it or it is required by a specification. .
- Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.
- Content authors SHOULD always encode text using consistent Unicode character sequences to facilitate matching, even if a Unicode normalization form is included in the matching performed by the format or implementation.
- Note that NFC is not always appropriate or available to content authors. The encoding choices of end users might not be obvious to downstream consumers of the data and normalization can remove distinctions that the users applied intentionally. Given that there are many different ways that content authors or applications could choose to represent the same semantic values when inputting or exchanging text, if a specification needs to choose a normalization form, be aware of the following considerations:
+ A specific Unicode normalization form is not always appropriate or available to content authors and the text encoding choices of users might not be obvious to downstream consumers of the data. As shown in this document, there are many different ways that content authors or applications could choose to represent the same semantic values when inputting or exchanging text. Normalization can remove distinctions that the users applied intentionally. Therefore:
+
+
+ [S] Specifications SHOULD NOT specify the Unicode normalization in string matching for vocabularies.
+
+ [I] Implementations MUST NOT alter the normalization form of syntactic or natural language content being exchanged, read, parsed, or processed except when required to do so as a side-effect of text transformation such as transcoding the content to a Unicode character encoding, case mapping or folding, or other user-initiated change, as consumers or the content itself might depend on the de-normalized representation.
+
+ [I] Authoring tools SHOULD provide a means of normalizing resources and warn the user when a given resource is not in Unicode Normalization Form C.
+
+ [S] Specifications of text-based formats and protocols that as part of their syntax definition require the text be in a normalized form MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC. Such a specification needs to address the requirements in .
+
+ Specifications are generally discouraged from requiring formats or protocols to store or exchange data in a normalized form unless there are specific, clear reasons why the additional requirement is necessary. As many document formats on the Web do not require normalization, content authors might occasionally rely on denormalized character sequences. A normalization step could negatively affect such content.
+
+ The canonical normalization forms (form NFC or form NFD) are intended to preserve the meaning and presentation of the text to which they are applied. This is not always the case, which is one reason why normalization is not recommended. NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations in most languages with respect to the relationship between characters and graphemes.
+
+ [S] Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD).
+
+ [I] Implementations MUST NOT apply compatibility normalization forms (NFKC, NFKD) unless specifically requested by the end user.
+
+ The compatibility normalization forms (form NFKC and form NFKD) change the structure and lose the meaning of the text in important ways. Users sometimes use characters with a compatibility mapping in Unicode on purpose or they use characters in a legacy character encoding that have a compatibility mapping when converted to Unicode. This has to be considered intentional on the part of the content author. Although NFKC/NFKD can sometimes be useful in "find" operations or string searching natural language content, erasing compatibility differences is harmful.
+
+ Requiring NFC requires additional care on the part of the specification developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content need to be carefully considered and well-specified in these cases.
+
+ [S] Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.
+
+ [C] Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.
+
+ [C] Content authors SHOULD always encode text using consistent Unicode character sequences to facilitate matching, even if a Unicode normalization form is included in the matching performed by the format or implementation.
+
+ In order for their content to be processed consistently, content authors should try to use a consistent sequence of code points to represent the same text. While content can be in any normalization form or might use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequences as different. The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC.
+
+ [C] Content authors SHOULD NOT include combining marks without a preceding base character in a resource.
+
+ There can be exceptions to this. For example, when making a list of characters (such as a list of [[!Unicode]] characters), an author might want to use combining marks without a corresponding base character. However, use of a combining mark without a base character can cause unintentional display or, with naive implementations that combine the combining mark with adjacent syntactic content or other natural language content, processing problems. For example, if you were to use a combining mark, such as the character ́ [U+0301 COMBINING ACUTE ACCENT ] , as the start of a class
attribute value in HTML, the class name might not display properly in your editor and be difficult to edit.
+
+ Some recommended base characters include ◌ [U+25CC DOTTED CIRCLE ] (when the base character needs to be visible) or [U+00A0 NO-BREAK SPACE ] (when the base character should be invisible).
+
+ Since content authors do not always following these guidelines:
+
+ [S] Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism). These need to include any boundary that may create conflicts when processing or matching content when instances of the language are processed, while allowing for character escapes designed to express arbitrary characters.
+
+
+
+ Considerations When Requiring Normalization
+
+ When a specification requires Unicode normalization for storage, transmission, or string matching, some additional considerations need to be addressed by the specification authors as well as by implementers of that specification:
+
+ [S] Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off.
+
+ [S] Specifications that require normalization MUST NOT make the implementation of normalization optional. Interoperability of matching cannot be achieved if some implementations normalize while others do not.
+
+ An implementation that is required to perform normalization needs to consider these requirements:
+
+ [I] Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.
+
+ [I] A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.
+
+ [I] Authoring tool implementations SHOULD warn users or prevent the input or creation of syntactic content starting with a combining mark that could interfere with processing, display, or interchange.
+
- The canonical normalization forms (form NFC or form NFD) are intended to preserve the meaning and presentation of the text to which they are applied. This is not always the case, which is one reason why normalization is not recommended. NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations with respect to the character vs. grapheme issue. For storage or interchange, if normalization is to be applied, form NFC is RECOMMENDED.
- The compatibility normalization forms (form NFKC and form NFKD) change the structure and lose the meaning of the text in important ways. These normalization forms do produce more promiscuous matching, which is usually undesirable in a string matching context, but can be useful in "find" operations or string searching. The NFKD and NFKC normalization forms SHOULD NOT be used for storage or interchange of text. String matching applications or specifications SHOULD avoid specifying these normalization forms unless there is a compelling reason.
Choice of Case Folding
+
One important consideration in string identity matching is whether the comparison is case sensitive or case insensitive.
- Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used.
-
-
-
[C] Content authors SHOULD always spell identifiers using consistent upper, lower, and mixed case formatting to facilitate matching, even if case-insensitive matching is supported by the format or implementation.
-
+ [C] Content authors SHOULD always spell identifiers using consistent upper, lower, and mixed case formatting to facilitate matching, even if case-insensitive matching is supported by the format or implementation.
Case-sensitive matching
-
-
[S] Case-sensitive matching is RECOMMENDED for new protocols and formats.
-
- Case-sensitive matching is the easiest to implement and introduces
- the least potential for confusion, since it generally consists of a
- comparison of the underlying Unicode code point sequence. Because it
- is not affected by considerations such as language-specific case
- mappings, it produces the least surprise for document authors that
- have included words, such as the Turkish examples above, in their
- syntactic content.
- However, cases exist in which case-insensitivity is desirable. Where case-insensitive matching is desired, there are several
- implementation choices that a formal language needs to consider.
+
+ [S] Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values.
+
+ Vocabularies usually puts a premium on predictability for content authors and users. Case-sensitive matching is the easiest to implement and introduces the least potential for confusion, since it generally consists of a comparison of the underlying Unicode code point sequence. Because it is not affected by considerations such as language-specific case mappings, it produces the least surprise for document authors that have included words, such as the Turkish examples above, in their syntactic content.
+
+ Case insensitivity is usually be reserved for processing natural language content , such as running a feature for searching text. However, cases exist in which case-insensitivity is desirable. When case-insensitive matching is necessary, there are several implementation choices that a formal language needs to consider.
Unicode case-insensitive matching
- Vocabularies generally should allow for a wide range of Unicode characters, particularly for user-defined values, so as to enable use by the broadest range of languages and cultures without disadvantage. As a result, text operations such as case folding need to address the full range of Unicode and not just selected portions. When case-insensitive matching is desired, this means using Unicode case folding :
-
-
- The Unicode simple casefolding form is not appropriate for string identity matching on the Web.
+ [S] Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching.
+
+ [S] Specifications SHOULD allow the full range of Unicode for user-defined values.
+
+ Vocabularies generally should allow for a wide range of Unicode characters, particularly for user-supplied values , so as to enable use by the broadest range of languages and cultures without disadvantage. As a result, text operations such as case folding need to address the full range of Unicode and not just selected portions. When case-insensitive matching is desired, this means using Unicode case folding :
+ The Unicode simple casefolding form is not appropriate for string identity matching on the Web.
ASCII case-insensitive matching
-
+
+ [S] Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.
+
A formal language whose vocabulary is limited to ASCII and which does not allow user-defined names or identifiers can specify ASCII case-insensitive matching. An example of this is HTML, which defines the use of ASCII case-insensitive comparison for element and attribute names defined by the HTML specification.
-
-
[S] For a vocabulary limited to the Basic Latin (ASCII) subset of Unicode, ASCII case-insensitive matching MAY be specified.
-
-
- A vocabulary is considered to be "ASCII-only" if and only if all
- tokens and identifiers are defined by the specification directly and
- these identifiers or tokens use only the Basic Latin subset of
- Unicode. If user-defined identifiers are permitted, the full range of
- Unicode characters (limited, as appropriate, for security or
- interchange concerns, see [[UTR36]]) should be allowed and Unicode
- case insensitivity used for identity matching.
- Note that an ASCII-only vocabulary can exist inside a document format
- or protocol that allows a larger range of Unicode in identifiers or
- values.
- For example [[CSS-SYNTAX-3]] defines the format of CSS
- style sheets in a way that allows the full range of Unicode to be used
- for identifiers and values. However, CSS specifications always define
- CSS keywords using a subset of the ASCII range. The vocabulary of CSS is
- thus ASCII-only, even though many style sheets contain identifiers or
- data values that are not ASCII.
+ A vocabulary is considered to be "ASCII-only" if and only if all tokens and identifiers are defined by the specification directly and these identifiers or tokens use only the Basic Latin subset of Unicode. If user-defined identifiers are permitted, the full range of Unicode characters (limited, as appropriate, for security or interchange concerns, see [[UTR36]]) should be allowed and Unicode case insensitivity used for identity matching.
+
+ An ASCII-only vocabulary can exist inside a document format or protocol that allows a larger range of Unicode in identifiers or values. For example [[CSS-SYNTAX-3]] defines the format of CSS style sheets in a way that allows the full range of Unicode to be used for identifiers and values. However, CSS specifications always define CSS keywords using a subset of the ASCII range. The vocabulary of CSS is thus ASCII-only, even though many style sheets contain identifiers or data values that are not ASCII.
Language-specific tailoring
- Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations. Because language-specific tailoring of case mapping or case folding produces different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.
-
-
[S][I] Locale- or language-specific tailoring is NOT RECOMMENDED for specifications and implementations that define string
- matching as part of the definition of a format, protocol, or formal language.
-
-
- Language-sensitive string comparison is often referred to as being locale-sensitive , since most programming
- languages and operating environments access language-specific tailoring
- using their respective locale-based APIs. For example, see the java.text.Collator
class
- in the Java programming language or Intl.Collator
in JavaScript.
-
-
-
Language-sensitive case-insensitive matching in document formats and protocols is NOT RECOMMENDED.
-
- This is because language information can be hard to obtain, verify, or manage and because the resulting operations can produce results that frustrate users or which fail for some users and succeed for others depending on the language configuration that they are using. Operations that are themselves language-specific can include language-specific case folding where appropriate.
- Although Unicode case folding is the preferred case-insensitive matching for document formats and protocols, content authors and users can be surprised by the results, since their expectations are generally consistent with the languages that they speak.
+
+ Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations (which is beyond the scope of this document). Because language-specific tailoring of case mapping or case folding produces different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.
+
+ [S] Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching.
+
+ [S] If language-sensitive case-sensitive matching is specified, Unicode case-fold mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified.
+
+ Two strings being matched can be in different languages and might appear in yet a third language context. Which language to use for case folding therefore depends on the application and user expectations.
+
+ Language specific tailoring is not recommended for formal languages because the language information can be hard to obtain, verify, or manage and because the resulting operations can produce results that frustrate users or which fail for some users and succeed for others depending on the language configuration that they are using or the configuration of the system where the match is performed.
+
+ [S] Operations that are language-specific SHOULD include language-specific case folding where appropriate.
+
+ For example, the CSS operation text-transform
is language-sensitive when used to case map strings.
+
+ Although Unicode case folding is the preferred case-insensitive matching for document formats and protocols, content authors and users of languages that have mappings different from the default can still be surprised by the results, since their expectations are generally consistent with the languages that they speak.
+
+ Language-sensitive string comparison is often referred to as being locale-sensitive , since most programming languages and operating environments access language-specific tailoring using their respective locale-based APIs. For example, see the java.text.Collator
class in the Java programming language or Intl.Collator
in JavaScript.
+