index.html

<!DOCTYPE html>
<html dir="ltr" lang="en">
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="content-type">
    <title>Character Model for the World Wide Web: String Matching and Searching</title>
    <link rel="canonical" href="http://www.w3.org/TR/2015/WD-charmod-norm-20151119/"/>
    <!-- local styles. Includes the styles from http://www.w3.org/International/docs/styleguide -->
    <link rel="stylesheet" href="local.css" type="text/css">
	<script src="http://pandora.aptest.com/w3c/respec-style/builds/respec-w3c-common.js" class="remove"></script>
	<!--script src="http://www.w3.org/Tools/respec/respec-w3c-common" async class="remove"></script-->
    <script class="remove">
      var respecConfig = {
          useExperimentalStyles: true,
          // specification status (e.g. WD, LCWD, NOTE, etc.). If in doubt use ED.
          specStatus:				"ED",
          //publishDate:  			"2015-XX-XX",
          //previousPublishDate:  	"2015-11-19",
          //previousMaturity:  		"WD",

          noRecTrack:           true,
          shortName:            "charmod-norm",
          copyrightStart: 		"2004",
          edDraftURI:   		"http://w3c.github.io/charmod-norm/",

          // lcEnd: "2009-08-05",

          // editors, add as many as you like
          // only "name" is required
          editors:  [
              { name: "Addison Phillips", 
                company: "Invited Expert" },
          ],

          // authors, add as many as you like. 
          //authors:  [
          //    { name: "Your Name", url: "http://example.org/",
          //      company: "Your Company", companyURL: "http://example.com/" },
          //],
          
          // name of the WG
          wg:           "Internationalization Working Group",
          wgURI:        "http://www.w3.org/International/core/",
          wgPublicList: "www-international",
          
		  bugTracker: { new: "https://github.com/w3c/charmod-norm/issues", open: "https://github.com/w3c/charmod-norm/issues" } ,
		otherLinks: [
			{
			key: "Github",
			data: [
				{
			  	value: "repository",
			  	href: "https://github.com/w3c/charmod-norm/"
		 		}
				]
			}
			],

          // URI of the patent status for this WG, for Rec-track documents
          // !!!! IMPORTANT !!!!
          // This is important for Rec-track documents, do not copy a patent URI from a random
          // document unless you know what you're doing. If in doubt ask your friendly neighbourhood
          // Team Contact.
          wgPatentURI:  "http://www.w3.org/2004/01/pp-impl/32113/status",
		  

		  localBiblio: {
		"UTS18": {
		    title: "Unicode Technical Standard #18: Unicode Regular Expressions",
			href: "http://unicode.org/reports/tr18/",
			authors: [ "Mark Davis", "Andy Heninger" ]
		},
		
		"Encoding": {
			title: "Encoding",
			href: "http://www.w3.org/TR/encoding/",
			authors: [ "Anne van Kesteren", "Joshua Bell", "Addison Phillips" ]
		},
		
		"UTS10": {
			title: "Unicode Technical Standard #10: Unicode Collation Algorithm",
			href: "http://www.unicode.org/reports/tr10/",
			authors: [ "Mark Davis", "Ken Whistler", "Markus Scherer" ]
		},
		
		"UAX11": {
		    title: "Unicode Standard Annex #11: East Asian Width",
		    href: "http://www.unicode.org/reports/tr11/",
		    authors: [ "Ken Lunde 小林劍" ]
		},
		
		"UTR29": {
			title: "Unicode Text Segmentation",
			href: "http://www.unicode.org/reports/tr29/",
			authors: [ "Mark Davis" ]
		},
		
		"UTR36": {
			title: "Unicode Technical Report #36: Unicode Security Considerations",
			href: "http://www.unicode.org/reports/tr36/",
			authors: [ "Mark Davis", "Michel Suignard" ]
		},
		
		"UTR50": {
		    title: "Unicode Technical Report #50: Unicode Vertical Text Layout",
		    href: "http://www.unicode.org/reports/tr50/",
		    authors: [ "Koji Ishii 石井宏治" ]
		},
		
		"Nicol": {
			title: "The Multilingual World Wide Web, Chapter 2: The WWW As A Multilingual Application",
			href: "http://www.mind-to-mind.com/i18n/multilingual-www.html",
			authors: [ "Gavin Nicol" ]
		}
		
	}
		  
      };
	  

</script> </head>
  <body>
    <section id="abstract">
      <p>This document builds upon on <cite>Character Model for the World Wide
          Web 1.0: Fundamentals </cite>[[!CHARMOD]] to provide authors of
        specifications, software developers, and content developers a common
        reference on string identity matching on the World Wide Web and thereby
        increase interoperability. </p>
    </section>
    <section id="sotd">
      <div class="note">
        <p>This version of the document represents a significant change from the
          <a href="http://www.w3.org/TR/2012/WD-charmod-norm-20120501/">earlier
            editions</a>. Much of the content is changed and the recommendations
          are significantly altered. This fact is reflected in a change to the
          name of the document from "Character Model: Normalization".</p>
      </div>
      <div class="note">
        <p data-lang="en" style="font-weight: bold; font-size: 120%">Sending
          comments on this document</p>
        <p data-lang="en">If you wish to make comments regarding this document,
          please raise them as <a href="https://github.com/w3c/charmod-norm/issues"

            style="font-size: 120%;">github issues</a> against the <a href="http://www.w3.org/TR/2015/WD-charmod-norm-20151119/"

            style="font-size: 120%">latest dated version in /TR</a>. Only send
          comments by email if you are unable to raise issues on github (see
          links below). All comments are welcome.</p>
        <p data-lang="en">To make it easier to track comments, please raise
          separate issues or emails for each comment, and point to the section
          you are commenting on&nbsp; using a URL for the dated version of the
          document.</p>
      </div>
    </section>
    <section id="intro">
      <h2>Introduction</h2>
      <section id="goals">
        <h3>Goals and Scope</h3>
        <p>The goal of the Character Model for the World Wide Web is to
          facilitate use of the Web by all people, regardless of their language,
          script, writing system, and cultural conventions, in accordance with
          the <a href="http://www.w3.org/Consortium/mission"><cite>W3C goal of
              universal access</cite></a>. One basic prerequisite to achieve
          this goal is to be able to transmit and process the characters used
          around the world in a well-defined and well-understood way.</p>
        <p class="note">This document builds on <cite>Character Model for the
            World Wide Web: Fundamentals</cite> [[!CHARMOD]]. Understanding the
          concepts in that document are important to being able to understand
          and apply this document successfully.</p>
        <p>This part of the Character Model for the World Wide Web covers string
          matching—the process by which a specification or implementation
          defines whether two string values are the same or different from one
          another. It describes the ways in which texts that are semantically
          equivalent can be encoded differently and the impact this has on
          matching operations important to formal languages (such as those used
          in the formats and protocols that make up the Web). Finally, it
          discusses the problem of substring searching within documents.</p>
        <p>The main target audience of this specification is W3C specification
          developers. This specification and parts of it can be referenced from
          other W3C specifications and it defines conformance criteria for W3C
          specifications, as well as other specifications.</p>
        <p>Other audiences of this specification include software developers,
          content developers, and authors of specifications outside the W3C.
          Software developers and content developers implement and use W3C
          specifications. This specification defines some conformance criteria
          for implementations (software) and content that implement and use W3C
          specifications. It also helps software developers and content
          developers to understand the character-related provisions in W3C
          specifications.</p>
        <p>The character model described in this specification provides authors
          of specifications, software developers, and content developers with a
          common reference for consistent, interoperable text manipulation on
          the World Wide Web. Working together, these three groups can build a
          globally accessible Web.</p>
      </section>
      <section id="structure">
        <h3>Structure of this Document</h3>
        <p>This document defines two basic building blocks for the Web related
          to this problem. First, it defines rules and processes for String
          Identity Matching in document formats. These rules are designed for
          the identifiers and structural markup (<a href="#def_syntactic_content" class="termref">syntactic content</a>)
          used in document formats to ensure consistent processing of each and
          are targeted to Specification writers. Second, it defines broader
          guidelines for handling user visible text, such as natural language
          text that forms most of the <strong>content</strong> of the Web. This
          section is targeted to implementers.</p>
        <p>This document is divided into three main sections.</p>
        <p>The <a href="#problemStatement">first section</a> lays out the
          problems involved in string matching; the effects of Unicode and case
          folding on these problems; and outlines the various issues and
          normalization mechanisms that might be used to address these issues.</p>
        <p>The <a href="#identityMatching">second section</a> provides
          requirements and recommendations for string identity matching for use
          in <span class="qterm">formal languages</span>, such as many of the
          document formats defined in W3C Specifications. This primarily is
          concerned with making the Web functional and providing document
          authors with consistent results. </p>
        <p>The <a href="#searching">third section</a> discusses considerations
          for the handling of content by implementations, such as browsers or
          text editors on the Web. This mainly is related to how and why to
          preserve the author's original sequences and how to search or find
          content in natural language text. </p>
      </section>
      <section id="background">
        <h3>Background</h3>
        <p>This section provides some historical background on the topics
          addressed in this specification.</p>
        <p>At the core of the character model is the Universal Character Set
          (UCS), defined jointly by the <cite>Unicode Standard</cite>
          [[!Unicode]] and ISO/IEC 10646 [[!ISO10646]]. In this document, <dfn>Unicode</dfn>
          is used as a synonym for the Universal Character Set. A successful
          character model allows Web documents authored in the world's writing
          systems, scripts, and languages (and on different platforms) to be
          exchanged, read, and searched by the Web's users around the world.</p>
        <p>The first few chapters of the <cite>Unicode Standard</cite>
          [[!Unicode]] provide useful background reading.</p>
        <p>For information about the requirements that informed the development
          of important parts of this specification, see <cite>Requirements for
            String Identity Matching and String Indexing</cite> [[CHARREQ]].</p>
      </section>
      <section id="terminology">
        <h3>Terminology and Notation</h3>
        <p>This section contains terminology and notation specific to this
          document.</p>
        <p>The Web is built on text-based formats and protocols. In order to
          describe string matching or searching effectively, it is necessary to
          establish terminology that allows us to talk about the different kinds
          of text within a given format or protocol, as the requirements and
          details vary significantly. </p>
        <p>Unicode code points are denoted as <code class="kw" translate="no">U+hhhh</code>,
          where <code class="kw" translate="no">hhhh</code> is a sequence of at
          least four, and at most six hexadecimal digits. For example, the
          character <span class="qchar">€</span> <span class="uname" translate="no">EURO
            SIGN</span> has the code point <span class="uname" translate="no">U+20AC</span>.</p>
        <p>Some characters that are used in the various examples might not
          appear as intended unless you have the appropriate font. Care has been
          taken to ensure that the examples nevertheless remain understandable.</p>
        <p>A <dfn data-lt="legacy character encoding|legacy character encodings">legacy
            character encoding</dfn> is a character encoding not based on the
          Unicode character set.</p>
        <p>A <dfn data-lt="transcoder|transcoders">transcoder</dfn> is a process that converts code units (generally bytes) from a <a>legacy character encoding</a> 
        to a <a href="http://www.w3.org/TR/2005/REC-charmod-20050215/#Unicode_Encoding_Form">Unicode encoding form</a>.</p>
        <p><dfn id="def_syntactic_content">Syntactic content</dfn> is any text in a document format or
          protocol that belongs to the structure of the format or protocol. This
          definition can include values that are not typically thought of as
          "markup", such as the name of a field in an HTTP header, as well as
          all of the characters that form the structure of a format or protocol.
          For example, <span class="qchar">&lt;</span> or <span class="qchar">&gt;</span>
          are part of the syntactic content in an HTML document. </p>
        <p>Syntactic content usually is defined by a specification or specifications and
          includes both the defined, reserved keywords for the given protocol or
          format as well as string tokens and identifiers that are defined by
          document authors to form the structure of the document (rather than
        the "content" of the document).</p>
        <aside class="example">
          <p><cite>XML</cite> [[XML10]] defines specific elements, attributes,
            and values that are reserved across all XML documents. Thus, the
            word <code class="kw" translate="no">encoding</code> has a defined
            meaning inside the XML document declaration: it is a reserved name.
            XML also allows a user to define elements and attributes for a given
            document using a DTD. In a document that uses a DTD that defines an
            element called <code class="kw">&lt;muffin&gt;</code>, <span class="qterm">muffin</span>
          is a part of the syntactic content.</p>
        </aside>
        <p><dfn>Natural language content</dfn> refers to the language-bearing
          content in a document and <b>not</b> to any of the surrounding syntactic content
          or identifiers that form part of the document structure. You can think
          of it as the actual "content" of the document or the "message" in a
          given protocol. Note that the natural language content can include
          items such as the document title as well as prose content within the
          document.</p>
        <p>A <dfn data-lt="resource|resources">resource</dfn> is a given
          document, file, or protocol "message" which includes both the <a>natural
            language content</a> as well as the <a href="#def_syntactic_content" class="termref">syntactic content</a>
          such as identifiers surrounding or containing it. For example, in an
          HTML document that also has some CSS and a few <code class="kw" translate="no">script</code>
          tags with embedded JavaScript, the entire HTML document, considered as
          a file, is the resource.</p>
        <p>A <dfn id="def_vocabulary">vocabulary</dfn> provides the list of
          reserved names as well as the set of rules and specifications
          controlling how user values (such as identifiers) can be assigned in a
          format or protocol. This can include restrictions on range, order, or
          type of characters that can appear in different places. For example,
          HTML defines the names of its elements and attributes, as well as
          enumerated attribute values, which defines the "vocabulary" of HTML
          <a href="#def_syntactic_content" class="termref">syntactic content</a>. ECMAScript restricts the range of characters that can appear
          at the start or in the body of an identifier or variable name (while
          different rules apply to the values of, say, string literals).</p>
        <p>A <dfn data-lt="grapheme|graphemes">grapheme</dfn> is a sequence of
          one or more Unicode characters in a visual representation of some text
          that a typical user would perceive as being a single unit (<q>character</q>).
          Graphemes are important for a number of text operations such as
          sorting or text selection, so it is necessary to be able to compute
          the boundaries between each user-perceived character. Unicode defines
          the default mechanism for computing graphemes in <cite>Unicode
            Standard Annex #29: Text Segmentation</cite> [[!UTR29]] and calls
          this approximation a <dfn>grapheme cluster</dfn>. There are two types
          of default grapheme cluster defined. Unless otherwise noted, grapheme
          cluster in this document refers to an extended default grapheme
          cluster. (A discussion of grapheme clusters is also given at the end
          of Section 2.10 of the <cite>Unicode Standard</cite>, [[!Unicode]].)</p>
        <p>Because different natural languages have different needs, grapheme clusters
          can also sometimes require tailoring. For example, a Slovak user might
          wish to treat the default pair of grapheme clusters "ch" as a single
          grapheme cluster. Note that the interaction between the language of
          string content and the end-user's preferences might be complex.</p>
        <aside class="example">
          <p>The Hindi word for Unicode <q>यूनिकोड</q> is composed of a
            sequence of seven Unicode characters from the Devanagari script (<span

              class="uname" translate="no">U+092F U+0942 U+0928 U+093F U+0915
              U+094B U+0921</span>). However, most users would identify this
            word as containing four units of text—यू, नि, को, and ड. Each of the
            first three graphemes consists of two characters: a syllable and a
            modifying vowel character. So the word contains seven Unicode
            characters, but only four graphemes.</p>
        </aside>
        <section>
        <h5>Terminology Examples</h5>
        <p>This section illustrates some of the terminology defined above.</p>
          <div style="background-color:white;text-align: left; border-style: solid; border-width:3px; padding-left: 50px; padding-right: 50px; padding-top: 10px; width: 80%">
            <p> <span class="markup">&lt;<span class="vocabulary">html</span> <span class="vocabulary">lang</span>="en" <span class="vocabulary">dir</span>="<span class="vocabulary">ltr</span>"&gt;<br>
                &lt;<span class="vocabulary">head</span>&gt;</span></p>
            <p><span class="markup">&nbsp; &lt;<span class="vocabulary">meta</span> <span class="vocabulary">charset</span>="UTF-8"&gt;<br>
                &nbsp;&nbsp;&lt;<span class="vocabulary">title</span>&gt;</span><span class="shakespeare">Shakespeare</span><span

                class="markup">&lt;/<span class="vocabulary">title</span>&gt;<br>
                &lt;/<span class="vocabulary">head</span>&gt;<br>
                &lt;<span class="vocabulary">body</span>&gt;<br>
                &nbsp;&nbsp;&lt;<span class="vocabulary">img</span> <span class="vocabulary">src</span>="<span class="userValue">shakespeare.jpg</span>"
                <span class="vocabulary">alt</span>="<span class="userValue"><span class="shakespeare">William
                    Shakespeare</span></span>" <span class="vocabulary">id</span>="<span class="userValue">shakespeare_image</span>"&gt;<br>
                <br>
                &nbsp;&nbsp;&lt;<span class="vocabulary">p</span>&gt;</span><span class="shakespeare">What<span

                  class="markup">&amp;#x2019;</span>s in a name? That which we
                call a rose by any other name would smell as sweet.</span><span

                class="markup">&lt;/<span class="vocabulary">p</span>&gt;<br>
                &lt;/<span class="vocabulary">body</span>&gt;<br>
                &lt;/<span class="vocabulary">html</span>&gt;</span> </p>
          </div>
          <ul style="text-align:left">
          <li>Everything inside the black rectangle (that is, in this HTML file)
            is part of the resource.</li>
          <li><a>Syntactic content</a> is shown in a <span class="markup">monospaced font</span>.</li>
          <li><a>Natural language content</a> is shown in a <span class="shakespeare">bold
              blue font with a gray background</span>.</li>
          <li>User values are shown in <span class="userValue">italics</span>.</li>
          <li><a>Vocabulary</a> is shown with <span class="vocabulary">red underlining</span>.</li>
          <li>All of the text above (all text in a text file) makes up a
            resource. It's possible that a given resource will contain no
            natural language content at all (consider an HTML document
            consisting of four empty <code>div</code> elements styled to be
            orange rectangles). It's also possible that a resource will contain
            <em>no</em> syntactic content and consist solely of natural language content:
            for example, a plain text file with a soliloquy from <cite>Hamlet</cite>
            in it. Notice too that the HTML entity <code>&amp;#x2019;</code>
            appears in the natural language content and belongs to both the
          natural language content and the syntactic content in this resource.</li>
          </ul>
        </section>
      </section>
      <section id="conformance">
        <h4>Conformance</h4>
        <p>This specification places conformance criteria on specifications, on
          software (implementations) and on Web content. To aid the reader, all
          conformance criteria are preceded by <span class="qterm">[X]</span>
          where <span class="qchar">X</span> is one of <span class="qchar">S</span>
          for specifications, <span class="qchar">I</span> for software
          implementations, and <span class="qchar">C</span> for Web content.
          These markers indicate the relevance of the conformance criteria and
          allow the reader to quickly locate relevant conformance criteria by
          searching through this document.</p>
        <p>Specifications conform to this document if they:</p>
        <ol type="1">
          <li>
            <p> do not violate any conformance criteria preceded by [S] where
              the imperative is MUST or MUST NOT,</p>
          </li>
          <li>
            <p>document the reason for any deviation from criteria where the
              imperative is <span class="rfc2119">SHOULD</span>, <span class="rfc2119">SHOULD
                NOT</span>, or <span class="rfc2119">RECOMMENDED</span>,</p>
          </li>
          <li>
            <p> make it a conformance requirement for implementations to conform
              to this document,</p>
          </li>
          <li>
            <p> make it a conformance requirement for content to conform to this
              document.</p>
          </li>
        </ol>
        <p>Software conforms to this document if it does not violate any
          conformance criteria preceded by [I].</p>
        <p>Content conforms to this document if it does not violate any
          conformance criteria preceded by [C].</p>
        <div class="note">
          <p><span class="note-head">NOTE: </span>Requirements placed on
            specifications might indirectly cause requirements to be placed on
            implementations or content that claim to conform to those
            specifications.</p>
        </div>
        <p>Where this specification contains a procedural description, it is to
          be understood as a way to specify the desired external behavior.
          Implementations can use other means of achieving the same results, as
          long as observable behavior is not affected.</p>
      </section>
    </section>
    <section id="problemStatement">
      <h2>The String Matching Problem</h2>
      <p>The Web is primarily made up of document formats and protocols based on
        character data. These formats or protocols can be viewed as a set of
        text files (<a data-lt="resource">resources</a>) that include some form
        of structural markup or syntactic content. Processing such syntactic content or document data requires
        string-based operations such as matching, indexing, searching, sorting,
        regular expression matching, and so forth. As a result, the Web is
        sensitive to the different ways in which text might be represented in a
        document. Failing to consider the different ways in which the same text
        can be represented can confuse users or cause unexpected or frustrating
        results.</p>
      <section id="definitionCaseFolding">
        <h3>Case Folding</h3>
        <p>Some scripts and writing systems make a distinction between UPPER,
          lower, and Title case characters. Most scripts, such as the Brahmic
          scripts of India, the Arabic script, and the non-Latin scripts used to
          write Chinese, Japanese, or Korean do not have a case distinction, but
          some important ones do. Examples of such scripts include the Latin
          script used in the majority of this document, as well as scripts such
          as Greek, Armenian, or Cyrillic. </p>
        <p>Some document formats or protocols seek to aid interoperability or
          provide an aid to content authors by ignoring case variations in the
          <a data-lt="vocabulary">vocabulary</a> they define or in user-defined values permitted by the
          format or protocol. For example, this occurs when matching class names
          between an HTML document and its associated style sheet. Consider this
          HTML fragment: </p>
          <aside class="example">
        <pre>&lt;style type="text/css"&gt;

  SPAN.h\e9llo {
     text-decoration: underline;
  }
&lt;/style&gt;

&lt;span class="h&amp;#xe9;llo"&gt;Hello World!&lt;/span&gt;
</pre>
</aside>
        <p>The <code class="kw" translate="no">SPAN</code> in the stylesheet
          matches the <code class="kw" translate="no">span</code> element in
          the document, even though one is uppercase and the other is not.</p>
        <p><dfn>Case folding</dfn> is the process of making two texts identical
          which differ in case but are otherwise "the same".</p>
        <p>Case folding might, at first, appear simple. However there are
          variations that need to be considered when treating the full range of
          Unicode in diverse languages. For more information, 
          <cite>[[!Unicode]]</cite> Section 5.18 discusses case folding in detail.</p>
          
        <p>Unicode defines the default case fold mapping for each Unicode code point.
         Since most scripts do not provide a case distinction, most Unicode code 
		points do not require a case fold mapping. For those characters that 
		have a case fold mapping, the majority have a simple, straight-forward 
		mapping to a single matching (generally lowercase) code point. Unicode 
		calls these the <code class="kw">common</code> case fold mappings, as they are shared by 
		Unicode's case fold mappings.
         </p>
		  <p>In addition to the <code class="kw">common</code> case folding mappings, a few characters 
		  have a case fold mapping that would normally require more than one 
		  Unicode character. These are called the <code class="kw">full</code> case fold mappings. 
		  Together with the <code class="kw">common</code> case fold mappings, these provide the 
		  default case fold mapping for all of Unicode. This case fold mapping is referred to in this 
		  document as <dfn id="dfn-UnicodeC+F">Unicode C+F</dfn>.
         </p>
		  <p>Because some applications cannot allocate additional storage when 
		  performaing a case fold operation, Unicode provides a <code class="kw">simple</code> case 
		  fold mapping that maps characters that would normally map to more or 
		  fewer code points to use a single code point for comparison purposes 
		  instead. Unlike the full mapping, this mapping invariably alters the 
		  content (and potentially the meaning) of the text. This <code class="kw">simple</code> case fold mapping, referred to in this document 
		  as <dfn id="UnicodeC+S">Unicode C+S</dfn>, is not appropriate for the Web. </p>
		  
		  <aside class="example">
		  <p>One well-known example of a 'full' case fold mapping is the character <span class="qchar">&#xdf;</span>
		  <span class="uname" translate="no">U+00DF LATIN SMALL LETTER SHARP S</span>, a letter that is commonly
		  used in the German language. The 'full' mapping of this character is to two ASCII letters 's'. 
		  There is no 'simple' mapping for this letter. </p>
		  <p>Other examples can 
		  be found in the Greek script, where several precomposed characters have multi-character
		  case fold mappings. For example, consider the character <code>U+1F9B</code> (<span class="uname" translate="no">GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND 
					PROSGEGRAMMENI</span>). This character has both a <code class="kw">full</code> and <code class="kw">simple</code> mapping:</p>
		  	<table style="width: 100%">
				<tr>
					<th>Source</th>
					<th>Full</th>
					<th>Simple</th>
					<th>Comments</th>
				</tr>
				<tr>
					<td>ᾛ <code class="kw">U+1F9B</code></td>
					<td>ἣι <code class="kw">U+1F23&nbsp;U+03B9</code></td>
					<td>ᾓ <code class="kw">U+1F93</code></td>
					<td><span class="uname">GREEK SMALL LETTER ETA WITH DASIA AND VARIA</span> + <span class="uname">GREEK SMALL LETTER IOTA</span><br>
					<em>versus</em> <span class="uname" translate="no">GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI</span></td>
				</tr>
			  </table>

		  </aside>
          
        <p>Note that case folding removes information from a string which cannot 
		be recovered later. </p>
        <p>Another aspect of case folding is that it can be language sensitive.
          Unicode defines default case mappings for each encoded character, but
          these are only defaults and are not appropriate in all cases. Some
          languages need case-folding to be tailored to meet specific linguistic
          needs. One common example of this are Turkic languages written in the
          Latin script.</p>
        <aside class="example">
          <p>The name of the second largest city in Turkey is "<code>Diyarbakır</code>", which
           contains both the dotted and dotless
            letters <span class="qchar">i</span>. When rendered into upper
            case, this word appears like this: <span class="qterm"><code>DİYARBAKIR</code></span>.
            Notice that the ASCII letter <span class="qchar">i</span> maps to <span

              class="uname" translate="no">U+0130 LATIN CAPITAL LETTER I WITH
              DOT ABOVE</span>, while the letter <span class="qchar">ı</span> (<span

              class="uname" translate="no">U+0131 LATIN SMALL LETTER DOTLESS I</span>)
            maps to the ASCII uppercase <span class="qchar">I</span>. </p>
        </aside>
        <p>Sometimes case can vary in a way that is not semantically meaningful
          or is not fully under the user's control. This is particularly true
          when <a href="#searching">searching</a> a document, but also applies
          when defining rules for matching user- or content-generated values,
          such as identifiers. In these situations, case-<em>in</em>sensitive
          matching might be desirable instead.</p>
        <p>When defining a <a>vocabulary</a>, one important consideration is
          whether the values are restricted to the ASCII subset of Unicode or if
          the vocabulary permits the use of characters (such as accents on Latin
          letters or a broad range of Unicode including non-Latin scripts) that
          potentially have more complex case folding requirements.
        To address these different requirements, there are four types of casefold matching defined by this document for the purposes of
        string identity matching in document formats or protocols:
        <p><dfn data-lt="case-sensitive">Case sensitive matching</dfn>: code
          points are compared directly with no case folding.</p>
        <p><dfn data-lt="ASCII case-insensitive">ASCII case-insensitive matching</dfn>
          compares a sequence of code points as if all ASCII code points in the
          range 0x41 to 0x5A (A to Z) were mapped to the corresponding code
          points in the range 0x61 to 0x7A (a to z). When a vocabulary is itself
          constrained to ASCII, ASCII case-insensitive matching can be required.
        </p>
        <p id="uci"><dfn data-lt="Unicode case-insensitive">Unicode
            case-insensitive matching</dfn> compares a sequence of code points
          as if the Unicode C+F Unicode-defined language-independent default case
          folding form mentioned above had been applied to both input sequences.</p>
        <p><dfn>Language-sensitive case-sensitive matching</dfn> is useful in
          the rare case where a document format or protocol contains information
          about the language of the syntactic content and where language-sensitive case
          folding might sensibly be applied. <span class="requirement">In these
            cases, tailoring of the Unicode case-fold mappings above to match
            the expectations of that language SHOULD be specified and applied.</span>
          These case-fold mappings are defined in the <cite>Common Locale Data
            Repository</cite> [[UAX35]] project of the Unicode Consortium.</p>
      </section>
      <section id="unicodeNormalization">
        <h3>Unicode Normalization</h3>
        <p>Other kinds of variations can occur in Unicode text: some <a data-lt="grapheme">graphemes</a>
          can be represented by several different Unicode code point sequences.
          Consider the character &#x01FA; <span class="uname" translate="no"> U+01FA
            LATIN LETTER CAPITAL A WITH RING ABOVE AND ACUTE</span>. Here are
          some of the different character sequences that an HTML document could
          use to represent this character:</p>
        <ul class="dropExampleList">
          <li class="dropExampleItem"><span class="dropExample">&#x01FA;</span> <span class="uname" translate="no">U+01FA</span>—A "precomposed" character.</li>
          <li class="dropExampleItem"><span class="dropExample">A&#x030A;&#x0301;</span><span

              class="uname" translate="no">A&nbsp;+&nbsp;U+030A&nbsp;+&nbsp;U+0301</span>—
            A <span class="qterm">base</span> letter <span class="qchar">A</span>
            followed by two combining marks (<span class="uname" translate="no">U+030A
              COMBINING RING ABOVE</span> and <span class="uname" translate="no">U+0301
              COMBINING ACUTE ACCENT</span>)</li>
          <li class="dropExampleItem"><span class="dropExample">&#x00C5;&#x0301;</span><span class="uname"

              translate="no">U+00C5 + U+0301</span>—An accented letter (<span class="uname"

              translate="no">U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE</span>)
            followed by a combining accent (<span class="uname" translate="no">U+0301
              COMBINING ACUTE ACCENT</span>)</li>
          <li class="dropExampleItem"><span class="dropExample">&#x212B;&#x0301;</span><span class="uname"

              translate="no">U+212B + U+0301</span>—A compatibility character (<span

              class="uname" translate="no">U+212B ANGSTROM SIGN</span>) followed
            by a combining accent (<span class="uname" translate="no">U+0301
              COMBINING ACUTE ACCENT</span>)</li>
          <li class="dropExampleItem"><span class="dropExample">&#xFF21;&#x030A;&#x0301;</span><span

              class="uname" translate="no">U+FF21 + U+030A + U+0301</span>— A
            compatibility character <span class="uname" translate="no">U+FF21
              FULLWIDTH LATIN LETTER CAPITAL A</span>) followed by two combining
            marks (<span class="uname" translate="no">U+030A COMBINING RING
              ABOVE</span> and <span class="uname" translate="no">U+0301
              COMBINING ACUTE ACCENT</span>)</li>
        </ul>
        <p>Each of the above strings contains the same apparent 
        <span class="quote">meaning</span> as <span class="qchar">Ǻ</span> (<span class="uname" translate="no">U+01FA
              LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE</span>), but each
            one is encoded slightly differently. More variations are possible,
            but are omitted for brevity.</p>
        <p>Because applications need to find the semantic equivalence in texts
          that use different code point sequences, Unicode defines a means of
          making two semantically equivalent texts identical: the Unicode
          Normalization Forms [[!UAX15]].</p>
        <p><a data-lt="resource">Resources</a> are often susceptible to the
          effects of these variations because their specifications and
          implementations on the Web do not require Unicode Normalization of the
          text, nor do they take into consideration the string matching
          algorithms used when processing the syntactic content and natural language content later. For this
          reason, content developers need to ensure that they have provided a
          consistent representation in order to avoid problems later.</p>
        <p>However, it can be difficult for users to assure that a given <a data-lt="resource">resource</a>
          or set of resources uses a consistent textual representation because
          the differences are usually not visible when viewed as text. Tools and
          implementations thus need to consider the difficulties experienced by
          users when visually or logically equivalent strings that "ought to"
          match (in the user's mind) are considered to be distinct values.
          Providing a means for users to see these differences and/or normalize
          them as appropriate makes it possible for end users to avoid failures
          that spring from invisible differences in their source documents. For
          example, the W3C Validator warns when an HTML document is not fully in
          Unicode Normalization Form C.</p>
        <section id="canonical_compatibility">
          <h4>Canonical vs. Compatibility Equivalence</h4>
          <p>Unicode defines two types of equivalence between characters: <em>canonical
              equivalence</em> and <em>compatibility equivalence</em>.</p>
          <p><dfn>Canonical equivalence</dfn> is a fundamental equivalency
            between Unicode characters or sequences of Unicode characters that
            represent the same abstract character. When correctly displayed,
            these should always have the same visual appearance and behavior.
            Generally speaking, two canonically equivalent Unicode texts should
            be considered to be identical as text. Canonical decomposition
            removes these primary distinctions between two texts.</p>
          <p>Examples of canonical equivalence defined by Unicode include:</p>
          <ul class="dropExampleList">
            <li class="dropExampleItem"><span class="dropExample">Ç<span style="font-size:75%">
                  vs.</span>C&#x0327;</span> <em>Precomposed versus combining
                sequences.</em> Some characters can be composed from a base
              character followed by one or more combining characters. The same
              characters are sometimes also encoded as a distinct "precomposed"
              character. In this example, the character <span class="qchar">Ç</span>
              <span class="uname" translate="no">U+00C7</span> is canonically
              equivalent to the base character <span class="qchar">C</span> <span

                class="uname" translate="no">U+0043</span> followed by the
              combining cedilla character <span class="qchar">̧</span> <span class="uname"

                translate="no">U+0327</span>. Such equivalence can extend to
              characters with multiple combining marks.</li>
            <li class="dropExampleItem"><span class="dropExample">q&#x0307;&#x0323;<span style="font-size:75%">
                  vs.</span>q&#x0323;&#x0307;</span> <em>Order of combining marks.</em> When
              a base character is modified by multiple combining marks, the
              order of the combining marks might not represent a distinct
              character. Here the sequence <span class="qterm">q&#x0307;&#x0323;</span>(<span

                class="uname" translate="no">U+0071 U+0323 U+0307</span>) and <span

                class="qterm">q&#x0323;&#x0307;</span>(<span class="uname" translate="no">U+0071
                U+0307 U+0323</span>) are equivalent, even though the combining
              marks are in a different order. Note that this example is chosen
              carefully: the dot-above character and dot-below character are on
              opposite "sides" of the base character. The order of combining
              diacritics on the same side have a positional meaning.</li>
            <li class="dropExampleItem"><span class="dropExample">&#x2126;<span style="font-size:75%">
                  vs.</span>Ω</span> <em>Singleton mappings.</em> These result
              from the need to separately encode otherwise equivalent characters
              to support legacy character encodings. In this example, the Ohm
              symbol <span class="qchar">Ω</span> <span class="uname" translate="no">U+2126</span>
              is canonically equivalent (and identical in appearance) to the
              Greek letter Omega <span class="qchar">Ω</span> <span class="uname"

                translate="no">U+03A9</span>.</li>
            <li class="dropExampleItem"><span class="dropExample">가<span style="font-size:75%">
                  vs.</span>&#x1100;&#x1161;</span> <em>Hangul.</em> The Hangul script is
              used to write the Korean language. This script is constructed
              logically, with each syllable being a roughly-square <a>grapheme</a>
              formed from specific sub-parts that represent consonants and
              vowels. These specific sub-parts, called <em>jamo</em>, are
              encoded in Unicode. So too are the precomposed syllables. Thus the
              syllable <span class="qchar">가</span>&nbsp; <span class="uname"

                translate="no">U+AC00</span> is canonically equivalent to its
              constituent <em>jamo</em> characters <span class="qchar">ᄀ</span>&nbsp;<span

                class="uname" translate="no">U+1100</span> and <span class="qchar">ᅡ</span>&nbsp;<span

                class="uname" translate="no">U+1161</span>.</li>
          </ul>
          <p><dfn>Compatibility equivalence</dfn> is a weaker equivalence
            between characters or sequences of characters that represent the
            same abstract character, but may have a different visual appearance
            or behavior. Generally a compatibility decomposition removes
            formatting variations, such as superscript, subscript, rotated,
            circled, and so forth, but other variations also occur. In many
            cases, characters with compatibility decompositions represent a
            distinction of a semantic nature; replacing the use of distinct
            characters with their compatibility decomposition can therefore
            cause problems and texts that are equivalent after compatibility
            decomposition often were not perceived as being identical beforehand
            and usually should not be treated as equivalent by a formal
            language.</p>
          <p>The following table illustrates various kinds of compatibility
            equivalence in Unicode:</p>
              <table class="data">
                <thead>
                  <tr>
                    <th colspan="5">Compatibility Equivalance</th>
                  </tr>
                </thead>
                <tbody>
                  <tr>
                    <td class="long"><strong>Font variants</strong>—characters that have a
                      specific visual appearance (generally associated with a
                      specialized use, such as in mathematics).</td>
                    <td style="text-align: center" colspan="2"> <span class="sampleCharacter">ℌ</span></td>
                    <td style="text-align: center" colspan="2"> <span class="sampleCharacter">ℍ</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Breaking versus non-breaking</strong>—variations
                      in breaking or joining rules, such as the difference
                      between a <span class="qterm">normal</span> and a
                      non-breaking space.</td>
                    <td style="text-align: center" colspan="4"><span class="uname"

                        translate="no">U+00A0 NON-BREAKING SPACE</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Presentation forms of Arabic</strong>—
                      characters that encode the specific shapes&nbsp;(initial,
                      medial, final, isolated) needed by visual legacy encodings
                      of the Arabic script.</td>
                    <td style="text-align: center"> <span class="sampleCharacter">ﻨ</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">ﻧ</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">ﻦ</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">ﻥ</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Circled</strong>—numbers, letters, and other
                      characters in a circled, bullet, or other presentational
                      form; often used for lists, footnotes, and specialized
                      presentation</td>
                    <td style="text-align: center" colspan="1"> <span class="sampleCharacter">①</span></td>
                    <td style="text-align: center" colspan="1"> <span class="sampleCharacter">❿</span></td>
                    <td style="text-align: center" colspan="1"> <span class="sampleCharacter">㉄</span></td>
                    <td style="text-align: center" colspan="1"> <span class="sampleCharacter">㊞</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Width variation, size, rotated presentation
                        forms</strong>—narrow vs. wide presentational forms of
                      characters (such as those associated with legacy
                      multibyte encodings), as well as "rotated" presentation
                      forms necessary for vertical text.</td>
                    <td style="text-align: center"><span class="sampleCharacter">ｶ</span></td>
                    <td style="text-align: center"><span class="sampleCharacter">カ</span></td>
                    <td style="text-align: center"><span class="sampleCharacter">︷</span></td>
                    <td style="text-align: center"><span class="sampleCharacter">{</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Superscripts/subscripts</strong>—superscript or
                      subscript letters, numbers, and symbols.</td>
                    <td style="text-align: center"> <span class="sampleCharacter">⁹</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">₉</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">ª</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">₊</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong><span class="quote">Squared</span> characters</strong>—East
                      Asian (particularly kana) sequences encoded as a
                      presentation form to fit in a single ideographic "cell" in
                      text.</td>
                    <td style="text-align: center"> <span class="sampleCharacter">㌀</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">㍐</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">🄠</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">㎉</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Fractions</strong>—precomposed vulgar fractions,
                      often encoded for compatibility with font glyph sets.</td>
                    <td style="text-align: center"> <span class="sampleCharacter">¼</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">½</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">⅟</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">↉</span></td>
                  </tr>
                  <tr>
                    <td class="long"><strong>Others</strong>—compatibility characters encoded
                      for other reasons, generally for compatibility with legacy
                      character encodings. Many of these characters are simply a
                      sequence of characters encoded as a single presentational
                      unit.</td>
                    <td style="text-align: center"> <span class="sampleCharacter">ǆ</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">⑴</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">⒈</span></td>
                    <td style="text-align: center"> <span class="sampleCharacter">⻳</span></td>
                  </tr>
                </tbody>
              </table>
          <p>In the above table, it is important to note that the characters
            illustrated are <em>actual Unicode codepoints</em>. They were
            encoded into Unicode for compatibility with various legacy character
            encodings. They should not be confused with the normal kinds of
            presentational processing used on their non-compatibility
            counterparts.</p>
          <p>For example, most Arabic-script text uses the characters in the
            Arabic script block of Unicode (starting at <span class="uname" translate="no">U+0600</span>).
            The actual glyphs used to display the text are selected using fonts
            and text processing logic based on the position inside a word
            (initial, medial, final, or isolated), in a process called
            "shaping". In the table above, the four presentation forms of the
            Arabic letter <span class="uname" translate="no">NOON</span> are
            shown. The characters shown are compatibility characters in the <span

              class="uname" translate="no">U+FE00</span> block, each of which
            represents a specific "positional" shape and each of the four code
            points shown have a compatibility decomposition to the <span class="quote">regular</span>
            Arabic letter <span class="uname" translate="no">U+0646 NOON</span>.</p>
          <p>Similarly, the variations in half-width and full-width forms and rotated
            characters (for use in vertical text) are encoded as separate code
            points, mainly for compatibility with legacy character encodings. In 
		  many cases these variations are associated with the Unicode properties 
		  described in <cite>East Asian Width</cite> [[UAX11]]. See also <cite>Unicode 
		  Vertical Text Layout</cite> [[UTR50]] for a discussion of vertical text 
		  presentation forms.</p>
          <p>In the case of characters with compatibility decompositions, such
            as those shown above, the <span class="qchar">K</span> Unicode
            Normalization forms convert the text to the "normal" or "expected"
            Unicode code point. But the existence of these compatibility
            characters cannot be taken to imply that similar appearance
            variations produced in the normal course of text layout and
            presentation are affected by Unicode Normalization. They are not.</p>
        </section>
        <section id="composition_decomposition">
          <h4>Composition vs. Decomposition</h4>
          <p>These two types of Unicode-defined equivalence are then grouped by
            another pair of variations: "decomposition" and "composition". In
            "decomposition", separable logical parts of a visual character are
            broken out into a sequence of base characters and combining marks
            and the resulting code points are put into a fixed, canonical order.
            In "composition", the decomposition is performed and then any
            combining marks are recombined, if possible, with their base
            characters. Note that this does <strong>not</strong> mean that all
            of the combining marks have been removed from the resulting
            normalized text. </p>
          <div class="note">
            <p>Roughly speaking, <abbr title="Normalization Form C">NFC</abbr>
              is defined such that each combining character sequence (a base
              character followed by one or more combining characters) is
              replaced, as far as possible, by a canonically equivalent
              precomposed character. Text in a Unicode character encoding form
              (such as UTF-8 or UTF-16) is said to be in NFC if it doesn't
              contain any combining sequence that could be replaced with a
              precomposed character and if any remaining combining sequence is
              in canonical order.</p>
          </div>
        </section>
        <section id="normalization_forms">
          <h4>Unicode Normalization Forms</h4>
          <p>The Unicode Normalization Forms are named using letter codes, with
            'C' standing for Composition, 'D' for Decomposition, and 'K' for
            Compatibility decomposition. Having converted a resource to a
            sequence of Unicode characters and unescaped any escape sequences,
            we can finally "normalize" the Unicode texts given in the example
            above. Here are the resulting sequences in each Unicode
            Normalization form for the U+01FA example given earlier: </p>
          <figure>
            <div>
              <table class="data">
                <thead>
                  <tr>
                    <th>Original Codepoints</th>
                    <th>NFC</th>
                    <th>NFD</th>
                    <th>NFKC</th>
                    <th>NFKD</th>
                  </tr>
                </thead>
<tbody>
                  <tr>
                    <td class="b-clear">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                  </tr>
                  <tr>
                    <td class="b-clear">&#x00C5;&#x0301;<br>
                      <span class="tableSub">U+00C5 U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                  </tr>
                  <tr>
                    <td class="b-clear">&#x212B;&#x0301;<br>
                      <span class="tableSub">U+212B U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                  </tr>
                  <tr>
                    <td class="b-clear">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub"> U+0041 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                  </tr>
                  <tr>
                    <td class="b-clear">&#xFF21;&#x030A;&#x0301;<br>
                      <span class="tableSub">U+FF21 U+030A U+0301</span></td>
                    <td class="b3">&#xFF21;&#x030A;&#x0301;<br>
                      <span class="tableSub">U+FF21 U+030A U+0301</span></td>
                    <td class="b3">&#xFF21;&#x030A;&#x0301;<br>
                      <span class="tableSub"> U+FF21 U+030A U+0301</span></td>
                    <td class="b1">&#x01FA;<br>
                      <span class="tableSub">U+01FA</span></td>
                    <td class="b2">A&#x030A;&#x0301;<br>
                      <span class="tableSub">U+0041 U+030A U+0301</span></td>
                  </tr>
                </tbody>	
               </table>
            </div>
            <figcaption>Comparison of Unicode Normalization Forms</figcaption> </figure>
          <p>Unicode Normalization reduces these (and other potential sequences
            of escapes representing the same character) to just three possible
            variations. However, Unicode Normalization doesn't remove all
            textual distinctions and sometimes the application of Unicode
            Normalization can remove meaning that is distinctive or meaningful
            in a given context. For example: </p>
          <ul>
            <li>Not all compatibility characters have a compatibility
              decomposition.</li>
            <li>Some characters that look alike or have similar semantics are
              actually distinct in Unicode and don't have canonical or
              compatibility decompositions to link them together. For example, <span

                class="qchar">。</span> <span class="uname" translate="no">U+3002
                IDEOGRAPHIC FULL STOP</span> is used as a <span class="quote">period</span>
              at the end of sentences in languages such as Chinese or Japanese.
              However, it is not considered equivalent to the ASCII <span class="quote">period</span>
              character <span class="uname" translate="no">U+002E FULL STOP</span>.</li>
            <li>Some character variations are not handled by the Unicode
              Normalization Forms. For example, UPPER, Title, and lowercase
              variations are a separate and distinct textual variation that must
              be separately handled when comparing text.</li>
            <li>Normalization can remove meaning. For example, the character
              sequence <span class="qterm"><samp>8½</samp></span> (including
              the character <span class="uname" translate="no">U+00BD VULGAR
                FRACTION ONE HALF</span>), when normalized using one of the <span

                class="quote">compatibility</span> normalization forms (that is,
              NFKD or NFKC), becomes an ASCII character sequence that looks
              like: <samp>81/2</samp>.</li>
          </ul>
        </section>
      </section>
      <section id="characterEscapes">
        <h3>Character Escapes</h3>
        <p>Most document formats or protocols provide an escaping mechanism to
          permit the inclusion of characters that are otherwise difficult to
          input, process, or encode. These escaping mechanisms provide an
          additional equivalent means of representing characters inside a given
          resource. They also allow for the encoding of Unicode characters not
          represented in the character encoding scheme used by the document.</p>
        <p>For example, <span class="qchar">€</span> <span class="uname" translate="no">U+20AC
            EURO SIGN</span> can also be encoded in HTML as the hexadecimal
          entity <code>&amp;#x20ac;</code> or as the decimal entity <code>&amp;#8364;</code>.
          In a JavaScript or JSON file, it can appear as <code>\u20ac</code>
          while in a CSS stylesheet it can appear as <code>\20ac</code>. All of
          these representations encode the same literal character value: <span

            class="qchar">€</span>.</p>
        <p>Character escapes are normally interpreted before a document is
          processed and strings within the format or protocol are matched.
          Returning to an example we used above: </p>
<aside class="example">
        <pre>&lt;style type="text/css"&gt;

  span.h\e9llo {
      text-decoration:underline;
  }
&lt;/style&gt;

&lt;span class="h&amp;#xe9;llo"&gt;Hello World!&lt;/span&gt;
</pre>
</aside>
        <p>You would expect that text to display like the following: <span class="héllo">Hello
            world!</span></p>
        <p>In order for this to work, the user-agent (browser) had to match two
          strings representing the class name <code>héllo</code>, even though
          the CSS and HTML each used a different escaping mechanism. The above
          fragment demonstrates one way that text can vary and still be
          considered "the same" according to a specification: the class name <code>h\e9llo</code>
          matched the class name in the HTML mark-up <code>h&amp;#xe9;llo</code>
          (and would also match the literal value <code>héllo</code> using the
          code point <span class="uname" translate="no">U+00E9</span>).</p>
      </section>
      <section id="unicodeControls">
        <h3>Unicode Controls and Invisible Markers</h3>
        <p>Unicode provides a number of invisble, special-purpose characters 
		that help document authors control the appearance or performance of 
		text. Because these characters are invisible, users are not always aware 
		of their presence or absence. As a result, these characters can 
		interfere with string matching when they are part of the encoded 
		character sequence but the expected matching text does not include them. 
		Some examples of these characters include:</p>
          <p>The Unicode control characters <span class="uname" translate="no">U+200D Zero Width Joiner</span> (also known 
        as <em>ZWJ</em>) and <span class="uname" translate="no">U+200C Zero Width Non-Joiner</span> (also known as 
		  <em>ZWNJ</em>). 
		Their original use was to control
          ligature formation&mdash; either preventing the formation of undesirable ligatures or encouraging the formation
          for desirable ones. However, their primary use today is control 
		joining and shape selection in Arabic and Indic scripts. For example, ZWJ and ZWNJ are used in some Indic scripts to allow 
		  authors to specify the shape that certain conjuncts take. See the 
		  discussion in Chapter 12 of [[!Unicode]].</p>
		  <aside class="example">
		  <div class="example-title marker"></div>
			  <p>The <span class="uname" translate="no">Zero Width Non-Joiner</span> is used in Persian to 
		  prevent certain "normal" Arabic script joining. In these cases, the 
		  character affects the meaning. For example, the word تنها ("alone") and the word تن‌ها&nbsp; ("bodies" 
		  or "corpuses") are encoded as "<span class="uname">U+062A 
		  U+0646 U+0647 U+0627</span>" and "<span class="uname">U+062A U+0646 
			  <span style="text-decoration:underline">U+200C</span> U+0647 U+0627</span>" 
		  respectively, the only difference being the ZWNJ in the latter word.</p>
		  </aside>
		  <p>Variation selectors (<span class="uname">U+FE00</span> through 
		  <span class="uname" translate="no">U+FE0F</span>) are 
        characters used to select an alternate appearance or glyph 
        (see Character Model: Fundamentals [[CHARMOD]]). For example, they are used to select between black-and-white and color emoji. 
        These are also used in predefined ideographic variation sequences (<span class="qterm">IVS</span>). Many
        examples are given in the "Standardized Variants" portion of the Unicode Character Database (UCD).</p>
		  <p>A few scripts also provide a way to encode visual variation selection: a prominent example of this are the Mongolian 
		  script's free 
        variation selectors (<span class="uname">U+180B</span> through 
		  <span class="uname" translate="no">U+180D</span>). </p>
		  <p>The character <span class="uname" translate="no">U+034F Combining Grapheme Joiner</span>, 
		  whose name is misleading (as it does not join graphemes or affect line 
		  breaking), is used to separate characters that might otherwise be 
		  considered a grapheme for the purposes of sorting or to provide a 
		  means of maintaing certain textual distinctions when applying Unicode 
		  normalization to text. </p>
		  <p>Whitespace variations can also affect the interpretation and 
		  matching of text. For example, the various non-breaking space 
		  characters, such as NBSP, NNBSP, etc.</p>
		  <p><span class="uname" translate="no">U+200B Zero Width Space</span> is a character used to 
		  indicate word boundaries in text where spaces do not otherwise appear. 
		  For example, it might be used in a Thai language document to assist 
		  with word-breaking. </p>
		  <p>The <span class="uname" translate="no">U+00AD Soft Hyphen</span> can be used in text 
		  to indidate a potential or preferred hyphenation position. It only 
		  becomes visible when the text is reflowed to wrap at that position.</p>
		  <p>In almost all of these cases, users may not be aware of or cannot 
		  be sure if a given document or text string has included or omitted one 
		  of these characters. Because text matching depends on matching the 
		  underlying codepoints, variation in the encoding of the text due to 
		  these markers can cause matches that ought to succeed to mysteriously 
		  fail (from the point of view of the user).</p>

      </section>
      <section id="legacyCharacterEncoding">
        <h3>Legacy Character Encodings</h3>
        <p><a>Resources</a> can use different character encoding
          schemes, including <a>legacy character encodings</a>, to serialize
          document formats on the Web. Each character encoding scheme uses
          different byte values and sequences to represent a given subset of the
          Universal Character Set.</p>
        <div class="note">
          <p>Choosing a Unicode character encoding, such as UTF-8, for all
            documents, formats, and protocols is strongly encouraged, since no
            additional utility is be gained from using a legacy character
            encoding and the considerations in the rest of this section would be
            completely avoided.</p>
        </div>
        <p>For example, <span class="qchar">€</span> (<span class="uname" translate="no">U+20AC
            EURO SIGN</span>) is encoded as the byte sequence <code>0xE2.82.AC</code>
          in the <code class="kw">UTF-8</code> character encoding. This same
          character is encoded as the byte sequence <code>0x80</code> in the
          legacy character encoding <code class="kw">windows-1252</code>.
          (Other legacy character encodings may not provide any byte sequence to
          encode the character.)</p>
        <p>Specifications mainly address these resulting variations by
          considering each document to be a sequence of Unicode characters after
          converting from the document's character encoding (be it a legacy
          character encoding or a Unicode encoding such as UTF-8) and then
          unescaping any character escapes before proceeding to process the
          document. </p>
        <p class="note">Even within a single legacy character encoding there can
          be variations in implementation. One famous example is the legacy
          Japanese encoding <code class="kw">Shift_JIS</code>. Different
          transcoder implementations faced choices about how to map specific
          byte sequences to Unicode. So the byte sequence <code>0x80.60</code>
          (<code>0x2141</code> in the JIS X 0208 character set) was mapped by
          some implementations to <span class="uname" translate="no">U+301C
            WAVE DASH</span> while others chose <span class="uname" translate="no">U+FF5E
            FULL WIDTH TILDE</span>. This means that two reasonable,
          self-consistent, transcoders could produce different Unicode character
          sequences from the same input. The <cite>Encoding</cite> [[Encoding]]
          specification exists, in part, to ensure that Web implementations use
          interoperable and identical mappings. However, there is no guarantee
          that transcoders inconsistent with the Encoding specification won't be
          applied to documents found on the Web or used to process data
          appearing in a particular document format or protocol.</p>
      </section>
      <section id="otherEquivalences">
         <h3>Other Types of Equivalence</h3>
         <p>The preceding types of character equivalence are all based on 
		 character properties assigned by Unicode or due to the mapping of 
		 legacy character encodings to the Unicode character set. There also 
		 exist certain types of "interesting equivalence" that may be useful, 
		 particularly in searching text, that are outside of the equivalences 
		 defined by Unicode. For example, Japanese uses two syllabic scripts,
		 <code>hiragana</code> and <code>katakana</code>. A 
		 user searching a document may type in one script, but wish to find 
		 equivalent text in the both scripts. These additional "text 
		 normalizations" are sometimes application, natural language, or domain 
		 specific and shouldn't be overlooked by specifications or 
		 implementations as an additional consideration.</p>
      </section>
    </section>
    <section id="identityMatching">
      <h2>String Matching of Syntactic Content in Document Formats and Protocols</h2>
        <p>In the Web environment, where strings can be encoded in different 
		encodings, using different character sequences, and with variations such 
		as case, it's important to
          establish a consistent process for evaluating string identity.</p>
      <p>This chapter defines the implementation and requirements for string
        matching in <a href="#def_syntactic_content" class="termref">syntactic content</a>.</p>
      <section id="matchingAlgorithm">
        <h2>The Matching Algorithm</h2>
        <p>This section defines the algorithm for matching strings. String
          identity matching MUST be performed as if the following steps were
          followed: </p>
        <ol>
          <li>Conversion to a common Unicode encoding form of the strings to be
            compared [[Encoding]].</li>
          <li>
            <p>Expansion of all character escapes and includes.</p>
            <div class="note">
              <p>The expansion of character escapes and includes is dependent on
                context, that is, on which <a href="#def_syntactic_content" class="termref">syntactic content</a> or programming language is
                considered to apply when the string matching operation is
                performed. Consider a search for the string <span class="qterm">suçon</span>
                in an XML document containing <code>su&amp;#xE7;on</code> but
                not <code>suçon</code>. If the search is performed in a plain
                text editor, the context is <span class="new-term">plain text</span>
                (no <a href="#def_syntactic_content" class="termref">syntactic content</a> or programming language applies), the <code class="kw">&amp;#xE7;</code>
                character escape is not recognized, hence not expanded and the
                search fails. If the search is performed in an XML browser, the
                context is <code>XML</code>, the character escape (defined by
                XML) is expanded and the search succeeds. </p>
              <p>An intermediate case would be an XML editor that <em>purposefully</em>
                provides a view of an XML document with entity references left
                unexpanded. In that case, a search over that pseudo-XML view
                will deliberately <em>not</em> expand entities: in that
                particular context, entity references are not considered
                includes and need not be expanded</p>
            </div>
          </li>
          <li>Perform one of the following case foldings, as appropriate:
            <ol>
              <li><em><a href="#case-sensitive">Case sensitive</a></em>: Go to
                step 4.</li>
              <li><em><a data-lt="ASCII case-insensitive">ASCII case folding</a></em>:
                map all code points in the range 0x41 to 0x5A (A to Z) to the
                corresponding code points in the range 0x61 to 0x7A (a to z).</li>
              <li><em><a data-lt="case folding">Unicode case folding</a></em>:
                map all code points to their Unicode C+F case fold equivalents.
                Note that this can change the length of the string.</li>
            </ol>
          </li>
          <li>Remove <a href="#unicodeControls">Unicode control characters</a></li>
          <li>Test the resulting sequences of code points bit-by-bit for
            identity.</li>
        </ol>
      </section>
      <section id="convertingToCommonUnicodeForm">
        <h2>Converting to a Common Unicode Form</h2>

        <p>A <dfn>normalizing transcoder</dfn> is a <a>transcoder</a> that performs
        a conversion from a <a>legacy character encoding</a> to Unicode <em>and</em> ensures that the result is in
        Unicode Normalization Form C. For most legacy character encodings, it
          is possible to construct a normalizing transcoder (by using any
          transcoder followed by a normalizer); it is not possible to do so if
          the <a>legacy character encoding</a>'s <a href="http://www.w3.org/TR/2005/REC-charmod-20050215/#def-repertoire">repertoire</a>
          contains characters not represented in Unicode.</p>

        <p>Previous versions of this document recommended the use of a <a>normalizing transcoder</a> when mapping from a 
        legacy character encoding to Unicode. Normalizing transcoders are expected to produce only character sequences in 
        Unicode Normalization Form C (NFC), although the resulting character sequence might still be partially
        de-normalized (for example, if it begins with a combining mark).</p>
        
        <p>It turns out that, while most transcoders used on the Web produce Normalization Form C as their output,
        several do not. The difference is important if the transcoder is to be round-trip
        compatible with the source legacy character encoding or consistent with the transcoders used by 
        browsers and other user-agents on the Web. This includes several of the transcoders in [[Encoding]].</p>
        
        <div class="requirement">
          <p>[C][I] For content authors, it is RECOMMENDED that content converted from a legacy character encoding
          be normalized to Unicode Normalization Form C unless the mapping of specific characters interferes with
          the meaning.</p>
        </div>
        <div class="requirement">
          <p>[I] Authoring tools SHOULD provide a means of normalizing resources
            and warn the user when a given resource is not in Unicode
            Normalization Form C.</p>
        </div>
        <section>
          <h4>Choice of Normalization Form</h4>
          <p>Given that there are many character sequences that content authors
            or applications could choose when inputting or exchanging text, and
            that when providing text in a normalized form, there are different
            options for the normalization form to be used, what form is most
            appropriate for content on the Web?</p>
          <p>For use on the Web, it is important not to lose compatibility
            distinctions, which are often important to the content (see Chapter
            5 <span class="qterm">Characters with Compatibility Mappings</span>
            in <cite>Unicode in XML and other Markup Languages</cite>
            [[UNICODE-XML]] for a discussion). The NFKD and NFKC normalization
            forms are therefore excluded.</p>
          <p>Among the remaining two forms, NFC has the advantage that almost
            all legacy data (if transcoded trivially, one-to-one, to a Unicode
            encoding), as well as data created by current software, is already
            in this form; NFC also has a slight compactness advantage and is a
            better match to user expectations with respect to the character vs.
            <a>grapheme</a> issue. This document therefore recommends, when
            possible, that all content be stored and exchanged in Unicode
            Normalization Form C (NFC).</p>
        </section>
        <section id="content-reqs">
          <h4>Requirements for Resources</h4>
          <p>These requirements pertain to the authoring and creation of
            documents and are intended as guidelines for resource authors.</p>
          <div class="requirement">
            <p>[C] Resources SHOULD be produced, stored, and exchanged in
              Unicode Normalization Form C (NFC). </p>
          </div>
          <div class="note">
            <p>In order to be processed correctly a resource must use a
              consistent sequence of code points to represent text. While
              content can be in any normalization form or may use a
              de-normalized (but valid) Unicode character sequence,
              inconsistency of representation will cause implementations to
              treat the different sequence as "different". The best way to
              ensure consistent selection, access, extraction, processing, or
              display is to always use NFC. </p>
          </div>
          <div class="requirement">
            <p>[I] Implementations MUST NOT normalize any resource during
              processing, storage, or exchange except with explicit permission
              from the user.</p>
          </div>
          <div class="note">
          <p>The [[!Encoding]] specification includes a number of <a>transcoders</a> that do not produce
             Unicode text in a normalized form when converting to Unicode from a legacy character encoding.
             This is necessary to preserve round-trip behavior and other character distinctions. Indeed, many
             compatibility characters in Unicode exist solely for round-trip conversion from legacy encodings.
             Earlier versions of this specification recommended or required that implementations use a 
             normalizing transcoder that produced Unicode Normalization Form C (NFC), but, given that this
             is at odds with how transcoders are actually implemented, this version no longer includes
             this requirement. Bear in mind that most transcoders produce NFC output and that even those
             transcoders that do not produce NFC for all characters mainly produce NFC for the preponderence
             of characters. In particular, there are no commonly-used transcoders that produce decomposed forms where 
             precomposed forms exist or which produce a different combining character sequence from the
             normalized sequence.</p>

          </div>
          <div class="requirement">
            <p>[C] Authors SHOULD NOT include combining marks without a
              preceding base character in a resource.</p>
          </div>

          <p>There can be exceptions to this. For example, when making a list of
            characters (such as a list of [[!Unicode]] characters), an author might want to use 
		  combining marks without a corresponding base character. However, use 
		  of a combining mark without a base character can cause
            unintentional display or, with naive implementations that combine the
            combining mark with adjacent syntactic content or other natural language
            content, processing problems. For example, if you were to use 
            a combining mark, such as the character 
            <span class="uname" translate="no">U+301 Combining Acute Accent</span>,
            as the start of a "class" attribute value in HTML, the class name
            might not display properly in your editor.</p>
        </section>
        <div class="requirement">
          <p>[S] Specifications of text-based formats and protocols MAY specify
            that all or part of the textual content of that format or protocol
            is normalized using Unicode Normalization Form C (NFC).</p>
        </div>
        <p>Specifications are generally discouraged from requiring formats or
          protocols to store or exchange data in a normalized form unless there
          are specific, clear reasons why the additional requirement is
          necessary. As many document formats on the Web do not require
          normalization, content authors might occasionally rely on denormalized
          character sequences and a normalization step could negatively affect
          such content.</p>
        <div class="note">
          <p>Requiring NFC requires additional care on the part of the
            specification developer, as content on the Web generally is not in a
            known normalization state. Boundary and error conditions for
            denormalized content need to be carefully considered and well
            specified in these cases. </p>
        </div>
        <section id="non-normalizing">
          <h4> Non-Normalizing Specification Requirements </h4>

          <p>The following requirements pertain to any specification that
            specifies explicitly that normalization is not to be applied
            automatically to content (which SHOULD include all new
            specifications): </p>
          <div class="requirement">
            <p>[S] Specifications that do not normalize MUST document or provide
              a health-warning if canonically equivalent but disjoint Unicode
              character sequences represent a security issue. </p>
          </div>
          <div class="requirement">
            <p>[S][I] Specifications and implementations MUST NOT assume that
              content is in any particular normalization form. </p>
          </div>
          <p>The normalization form or lack of normalization for any given
            content has to be considered intentional in these cases.</p>
          <div class="requirement">
            <p>[I] Implementations MUST NOT alter the normalization form of
              content being exchanged, read, parsed, or processed except when
              required to do so as a side-effect of transcoding the content to a
              Unicode character encoding, as content might depend on the
              de-normalized representation. </p>
          </div>
          <p class="issue"> The following requirement was noted by Mati as
            being problematic. It was not marked with mustard and needs further
            consideration. </p>
          <div class="requirement">
            <p>[S] Specifications MUST specify that string matching takes the
              form of "code point-by-code point" comparison of the Unicode
              character sequence, or, if a specific Unicode character encoding
              is specified, code unit-by-code unit comparison of the sequences.
            </p>
          </div>
          <p class="issue">Following requirements added 2013-10-29. Needs
            discussion of regular expressions.</p>
          <div class="requirement">
            <p>[S][I] Specifications that define a regular expression syntax
              MUST provide at least Basic Unicode Level 1 support per [[!UTS18]]
              and SHOULD provide Extended or Tailored (Levels 2 and 3) support.</p>
          </div>
        </section>
        <section id="normalizing-spec">
          <h4> Unicode Normalizing Specification Requirements </h4>
          <p>This section contains requirements for specifications of text-based formats and protocols that define
            Unicode Normalization as a requirement. New specifications SHOULD NOT require normalization
            unless special circumstances apply.</p>
          <div class="requirement">
            <p>[S] Specifications of text-based formats and protocols that, as
              part of their syntax definition, require that the text be in
              normalized form MUST define string matching in terms of normalized
              string comparison and MUST define the normalized form to be NFC. </p>
          </div>
          <div class="requirement">
            <p>[S] [I] A normalizing text-processing component which receives
              suspect text MUST NOT perform any normalization-sensitive
              operations unless it has first either confirmed through inspection
              that the text is in normalized form or it has re-normalized the
              text itself. Private agreements MAY, however, be created within
              private systems which are not subject to these rules, but any
              externally observable results MUST be the same as if the rules had
              been obeyed. </p>
          </div>
          <div class="requirement">
            <p>[I] A normalizing text-processing component which modifies text
              and performs normalization-sensitive operations MUST behave as if
              normalization took place after each modification, so that any
              subsequent normalization-sensitive operations always behave as if
              they were dealing with normalized text. </p>
          </div>
          <div class="requirement">
            <p>[S] Specifications of text-based languages and protocols SHOULD
              define precisely the construct boundaries necessary to obtain a
              complete definition of full-normalization. These definitions
              SHOULD include at least the boundaries between syntactic content and
              character data as well as entity boundaries (if the language has
              any include mechanism) , SHOULD include any other boundary that
              may create denormalization when instances of the language are
              processed, but SHOULD NOT include character escapes designed to
              express arbitrary characters. </p>
          </div>
          <div class="requirement">
            <p>[I] Authoring tool implementations for a formal language that
              does not mandate full-normalization SHOULD either prevent users
              from creating content with composing characters at the beginning
              of constructs that may be significant, such as at the beginning of
              an entity that will be included, immediately after a construct
              that causes inclusion or immediately after syntactic content, or SHOULD warn
              users when they do so. </p>
          </div>
          <div class="requirement">
            <p>[S] Where operations can produce denormalized output from
              normalized text input, specifications of API components
              (functions/methods) that implement these operations MUST define
              whether normalization is the responsibility of the caller or the
              callee. Specifications MAY state that performing normalization is
              optional for some API components; in this case the default SHOULD
              be that normalization is performed, and an explicit option SHOULD
              be used to switch normalization off. Specifications SHOULD NOT
              make the implementation of normalization optional. </p>
          </div>
          <div class="requirement">
            <p>[S] Specifications that define a mechanism (for example an API or
              a defining language) for producing textual data object SHOULD
              require that the final output of this mechanism be normalized. </p>
          </div>
        </section>
      </section>
      <section id="expandingCharacterEscapes">
        <h2>Expanding Character Escapes and Includes</h2>
        <p>Character escapes, such as HTML's numeric character references (for example, <code>&amp;#x20AC;</code>)
        or named entity references (<code>&amp;amp;</code>), and other included values that are intended
        to form part of matched string values require expansion when matching strings.</p>
        <p class="issue">Edit me!</p>
      </section>
      <section id="handlingCaseFolding">
        <h2>Handling Case Folding</h2>
        <p>As described <a href="#definitionCaseFolding">above</a>, one 
		important consideration in string identity matching is whether the
          comparison is case sensitive or case insensitive.</p>
        <div class="requirement">
          <p>[S] <a href="#case-sensitive">Case sensitive</a> matching is
            RECOMMENDED as the default for new protocols and formats.</p>
        </div>
        <p>However, cases exist in which case-insensitivity is desirable.</p>
        <p>Where case-insensitive matching is desired, there are several
          implementation choices that a formal language needs to consider. If
          the vocabulary of strings to be compared is limited to the Basic Latin
          (ASCII) subset of Unicode, ASCII case-insensitive matching MAY be
          used.</p>
        <p>If the vocabulary of strings to be compared is not limited, then <a>ASCII
            case-insensitive</a> matching MUST NOT be used. <a href="#uci">Unicode
            case-insensitive</a> matching MUST be applied, even if the
          vocabulary does not allow the full range of Unicode.</p>
        <p><a href="#uci">Unicode case-insensitive</a> matching can take several
          forms. Unicode defines the "common" (C) casefoldings for characters
          that always have 1:1 mappings of the character to its case folded form
          and this covers the majority of characters that have a case folding. A
          few characters in Unicode have a 1:many case folding. This 1:many
          mapping is called the "full" (F) case fold mapping. For compatibility
          with certain types of implementation, Unicode also defines a "simple"
          (S) case fold that is always 1:1.</p>
        <div class="requirement">
          <p>Because the "simple" case-fold mapping removes information that can
            be important to forming an identity match, the "Common plus Full"
            (or "Unicode C+F") case fold mapping is RECOMMENDED for Unicode
            case-insensitive matching.</p>
        </div>
        <p>A vocabulary is considered to be "ASCII-only" if and only if all
          tokens and identifiers are defined by the specification directly and
          these identifiers or tokens use only the Basic Latin subset of
          Unicode. If user-defined identifiers are permitted, the full range of
          Unicode characters (limited, as appropriate, for security or
          interchange concerns, see [[UTR36]]) SHOULD be allowed and Unicode
          case insensitivity used for identity matching.</p>
        <div class="requirement">
          <p>ASCII case-insensitive matching MUST only be applied to
            vocabularies that are restricted to ASCII. Unicode
            case-insensitivity MUST be used for all other vocabularies.</p>
        </div>
        <p>Note that an ASCII-only vocabulary can exist inside a document format
          or protocol that allows a larger range of Unicode in identifiers or
          values.</p>
        <p class="issue">Insert example from CSS here.</p>
        <div class="requirement">
          <p>Case sensitive matching is RECOMMENDED as the default for any new
            protocol or format.</p>
        </div>
        <p>Case-sensitive matching is the easiest to implement and introduces
          the least potential for confusion, since it generally consists of a
          comparison of the underlying Unicode code point sequence. Because it
          is not affected by considerations such as language-specific case
          mappings, it produces the least surprise for document authors that
          have included words (such as the Turkish example above) in their
          syntactic content.</p>
        <div class="requirement">
          <p>If the vocabulary is not restricted to ASCII or permits
            user-defined values that use a broader range of Unicode, ASCII
            case-insensitive matching MUST NOT be required.</p>
        </div>
        <div class="requirement">
          <p>[S][I] The <a>Unicode C+F</a> case-fold form is RECOMMENDED as the
            case-insensitive matching for <a data-lt="vocabulary">vocabularies</a>.
            The Unicode C+S form MUST NOT be used for string identity matching
            on the Web.</p>
        </div>
        <p>Language-sensitive case-sensitive matching in document formats and
          protocols is NOT RECOMMENDED because language information can be hard
          to obtain, verify, or manage and the resulting operations can produce
          results that frustrate users.</p>
        <div class="requirement">
          <p>[C] Identifiers SHOULD use consistent case (upper, lower, mixed
            case) to facilitate matching, even if case-insensitive matching is
            supported by the format or implementation. </p>
        </div>
        <section id="formal-language">
          <h4>Requirements for Specifications</h4>
          <p>These requirements pertain to specifications for document formats
            or programming/scripting languages and their implementations.</p>
          <div class="requirement">
            <p>[S][I] Specifications and implementations that define string
              matching as part of the definition of a format, protocol, or
              formal language (which might include operations such as parsing,
              matching, tokenizing, etc.) MUST define the criteria and matching
              forms used. These MUST be one of: </p>
            <ul>
              <li>Case-sensitive</li>
              <li>Unicode case-insensitive using Unicode case-folding C+F</li>
              <li>ASCII case-insensitive</li>
            </ul>
          </div>
          <div class="requirement">
            <p>[S] Specifications SHOULD NOT specify case-insensitive comparison
              of strings.</p>
          </div>
          <div class="requirement">
            <p>[S] Specifications that specify case-insensitive comparison for
              non-ASCII vocabularies SHOULD specify Unicode case-folding C+F.</p>
          </div>
          <p>In some limited cases, locale- or language-specific tailoring might
            also be appropriate. However, such cases are generally linked to
            natural language processing operations. Because they produce
            potentially different results from the generic case folding rules,
            these should be avoided in formal languages, where predictability is
            at a premium. </p>
          <div class="requirement">
            <p>[S] Specifications MAY specify ASCII case-insensitive comparison
              for portions of a format or protocol that are restricted to an
              ASCII-only vocabulary.</p>
          </div>
          <p>This requirement applies to formal languages whose keywords are all
            ASCII and which do not allow user-defined names or identifiers. An
            example of this is HTML, which defines the use of ASCII
            case-insensitive comparison for element and attribute names defined
            by the HTML specification.</p>
          <div class="requirement">
            <p>[S][I] Specifications and implementations MUST NOT specify
              ASCII-only case-insensitive matching for values or constructs that
              permit non-ASCII characters. </p>
          </div>
        </section>
        <section id="non-normalizing-requirements">
          <h4> Non-Normalizing Specification Requirements </h4>
          <div class="requirement">
            <p>[S][I] For vocabularies and values that are not restricted to
              Basic Latin (ASCII), case-insensitive matching MUST specify either
              Unicode C+F or locale-sensitive string comparison. </p>
          </div>
        </section>
      </section>
      <section id="handlingUnicodeControls">
        <h2>Handling Unicode Controls and Invisible Markers</h2>
        <div class="requirement">
          <p>Applications that do string matching SHOULD ignore Unicode
            formatting controls such as variation selectors; grapheme or word
            joiners; or other non-semantic controls.</p>
        </div>
      </section>
    </section>
    <section id="searching">
      <h2>String Searching in Natural Language Content</h2>
      <p>Many Web implementations and applications have a different sort of
        string matching requirement from the one described above: the need for
        users to search documents for particular words or phrases of text. This
        section addresses the various considerations that an implementer might
        need to consider when implementing natural language text processing on
        the Web <em>other than</em> that mandated by a formal language or
        document format.</p>
      <p>There are several different kinds of string searching.</p>
      <p>When you are using a search engine, you are generally using a form of
        full text search. <dfn>Full text search</dfn> generally breaks natural
        language text into word segments and may apply complex processing to get
        at the semantic "root" values of words. For example, if the user
        searches for "run", you might want to find words like "running", "ran",
        or "runs" in addition to the actual search term "run". This process,
        naturally, is sensitive to language, context, and many other aspects of
        textual variation. It is also beyond the scope of this document.</p>
      <p>Another form of string searching, which we'll concern ourselves with
        here, is sub-string matching or "find" operations. This is the direct
        searching of the body or "corpus" of a document with the user's input.
        Find operations can have different options or implementation details,
        such as the addition or removal of case sensitivity, or whether the
        feature supports different aspects of a regular expression language or
        "wildcards".</p>
      <section id="searchingConsiderations">
        <h2>Considerations for Matching Natural Language Content</h2>
        <p class="issue">This section was identified as a new area needing
          document as part of the overall rearchitecting of the document. The
          text here is incomplete and needs further development. Contributions
          from the community are invited.</p>
        <p>Searching content (one example is using the "find" command in your
          browser) generates different user expectations and thus has different
          requirements from the need for absolute identity matching needed by
          document formats and protocols. Searching text has different
          contextual needs and often provides different features.</p>
        <p>One description of Unicode string searching can be found in Section 8
          (Searching and Matching) of [[UTS10]].</p>
        <p>One of the primary considerations for string searching is that, quite
          often, the user's input is not identical to the way that the text is
          encoded in the text being searched. Users generally expect matching to
          be more "promiscuous", particularly when they don't add additional
          effort to their input. For example, they expect a term entered in
          lowercase to match uppercase equivalents. Conversely, when the user
          expends more effort on the input—by using the shift key to produce
          uppercase or by entering a letter with diacritics instead of just the
          base letter—they expect their search results to match (only) their
          more-specific input.</p>
        <p>This effect might vary depending on context as well. For example, a
          person using a physical keyboard may have direct access to accented
          letters, while a virtual or on-screen keyboard may require extra
          effort to access and select the same letters.</p>
        <p>Consider a document containing these strings: "re-resume",
          "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".</p>
        <p>In the table below, the user's input (on the left) might be
          considered a match for the above items as follows:</p>
        <table class="data">
          <tbody>
            <tr>
              <th scope="col">User Input</th>
              <th scope="col">Matched Strings</th>
            </tr>
            <tr>
              <td>e (lowercase 'e')</td>
              <td>"re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"</td>
            </tr>
            <tr>
              <td>E (uppercase 'E')</td>
              <td>"RE-RESUME" and "RE-RÉSUMÉ"</td>
            </tr>
            <tr>
              <td>é (lowercase 'e' with acute accent)</td>
              <td>"re-résumé" and "RE-RÉSUMÉ"</td>
            </tr>
            <tr>
              <td>É (uppercase 'E' with acute accent)</td>
              <td>"RE-RÉSUMÉ"</td>
            </tr>
          </tbody>
        </table>
        <p>In addition to variations of case or the use of accents, Unicode also
          has an array of canonical equivalents or compatibility characters (as
          described in the sections above) that might impact string searching.</p>
        <p>For example, consider the letter "K". Characters with a compatibility
          mapping to <code>U+004B LATIN CAPITAL LETTER K</code> include:</p>
        <ol>
          <li>Ķ U+0136</li>
          <li>Ǩ U+01E8</li>
          <li>ᴷ U+1D37</li>
          <li>Ḱ U+1E30</li>
          <li>Ḳ U+1E32</li>
          <li>Ḵ U+1E34</li>
          <li>K U+212A</li>
          <li>Ⓚ U+24C0</li>
          <li>㎅ U+3385</li>
          <li>㏍ U+33CD</li>
          <li>㏎ U+33CE</li>
          <li>Ｋ U+FF2B</li>
          <li>(a variety of mathematical symbols such as
            U+1D40A,U+1D43E,U+1D472,U+1D4A6,U+1D4DA)</li>
          <li>🄚  U+1F11A</li>
          <li>🄺 U+1F13A.</li>
        </ol>
        <p>Other differences include Unicode Normalization forms (or lack
          thereof). There are also ignorable characters (such as the variation
          selectors), whitespace differences, bidirectional controls, and other
          code points that can interfere with a match. </p>
        <p>Users might also expect certain kinds of equivalence to be applied to
          matching. For example, a Japanese user might expect that hiragana,
          katakana, and half-width compatibility katakana equivalents all match
          each other (regardless of which is used to perform the selection or
          encoded in the text). </p>
        <p>When searching text, the concept of "grapheme boundaries" and
          "user-perceived characters" can be important. See Section 3 of <cite>Character
            Model for the World Wide Web: Fundamentals</cite> [[!CHARMOD]] for a
          description. For example, if the user has entered a capital "A" into a
          search box, should the software find the character À (<span class="uname"

            translate="no">U+00C0 LATIN CAPITAL LETTER A WITH ACCENT GRAVE</span>)?
          What about the character "A" followed by U+0300 (a combining accent
          grave)? What about writing systems, such as Devanagari, which use
          combining marks to suppress or express certain vowels?</p>
      </section>
    </section>
    <section>
      <h2 id="changeLog" class="informative">Changes Since the Last Published
        Version</h2>
      <p>The following changes have been made since the <a href="http://www.w3.org/TR/2014/WD-charmod-norm-20140715/Overview.html">Working
          Draft</a> of 2014-07-15: </p>
      <ul>
        <li>Added this change log.</li>
        <li>Moved the section <a href="#unicodeNormalization">Unicode
            Normalization</a> after the section <a href="#definitionCaseFolding">Casefolding</a>
          and adjusted text appropriately</li>
        <li>Added the example and explanatory text about case matching of the
          HTML fragment in the section <a href="#definitionCaseFolding">Casefolding</a></li>
        <li>Added the definitions for "grapheme cluster" and "grapheme" in <a href="#terminology">Terminology
            and Notation</a></li>
        <li>Addition of section discussing <a href="#unicodeControls">Unicode
            controls</a>, including a new requirement.</li>
        <li>Shakespeare -&gt; natural language content; Wildebeest -&gt;
          resource; namespace -&gt; vocabulary</li>
        <li>Changed order of sections in section on "The String Matching
          Problem"</li>
        <li>Edited intro and integrated the case folding text from the string
          matching algorithm into the case folding section.</li>
        <li>Replaced the table in Section 2.2.1 as a first attempt to fix the
          various examples we borrowed from UAX15.</li>
        <li>Replaced first table in normalization section with a list of
          examples, addressing existing ednote.</li>
        <li> Extensive changes to incorporate the "standard" styles for
          International docs.</li>
        <li>Added explanatory text to the compatibility equivalents examples.
          Added characters to the table to further illustrate each category.
          Removed the "note" marker around additional explanatory text and
          edited. Removed the ednote saying this was needed.</li>
        <li>Changes to SOTD and top matter to reflect new i18n publication
          process.</li>
      </ul>
      <p>See the <a href="https://github.com/w3c/charmod-norm/commits/gh-pages">github
          commit log</a> for more details.</p>
    </section>
    <section>
      <h2 id="Acknowledgements" class="informative">Acknowledgements</h2>
      <p>The W3C Internationalization Working Group and Interest Group, as well
        as others, provided many comments and suggestions. The Working Group
        would like to thank: Mati Allouche, Ebrahim Byagowi, John Cowan, Martin Dürst, Behdad Esfahbod, John Klensin, 
	  Amir Sarabadani, and all of the CharMod
        contributors over the many years of this document's development. </p>
      <p>The previous version of this document was edited by:</p>
      <ul>
        <li>François Yergeau, Invited Expert (and before at Alis Technologies)</li>
        <li>Martin J. Dürst, (until Dec 2004 while at W3C)</li>
        <li>Richard Ishida, W3C (and before at Xerox)</li>
        <li>Misha Wolf, (until Dec 2002 while at Reuters Ltd.)</li>
        <li>Tex Texin, (until Dec 2004 while an Invited Expert, and before at
          Progress Software)</li>
      </ul>
    </section>
  </body>
</html>