Skip to content
Permalink
Browse files

[giow] (0) Try to clean up the stuff about Unicode characters.

Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=12100

git-svn-id: http://svn.whatwg.org/webapps@6184 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Jun 3, 2011
1 parent cf5c3bc commit 45c598c8be892fb6c52622e1dc762a3fb788d724
Showing with 106 additions and 138 deletions.
  1. +35 −46 complete.html
  2. +33 −44 index
  3. +38 −48 source
different <meta charset> elements applying in each case.
-->

<p>The term <dfn title="">Unicode character</dfn> is used to mean a
<i title="">Unicode scalar value</i> (i.e. any Unicode code point
that is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>



is passed an Infinity or Not-a-Number (NaN) value, a
<code><a href=#not_supported_err>NOT_SUPPORTED_ERR</a></code> exception must be raised.</p>

<p>Except where otherwise specified, if a method has an argument
of type <code>DOMString</code>, or if an IDL attribute is assigned
a new value of type <code>DOMString</code>, the user agent must
<span title=dfn-obtain-unicode>convert the
<code>DOMString</code> to a sequence of Unicode characters</span>
to obtain the string on which the algorithms in this specification
are to operate. <a href=#refsWEBIDL>[WEBIDL]</a></p>

</dd>

<dt>JavaScript</dt>
characters as defined by UTF-8.</p>

<p>If any percent-encoded octets in that component are not valid
UTF-8 sequences, then return an error and abort these steps.</p>
UTF-8 sequences (e.g. sequences of percent-encoded octets that
expand to surrogate code points), then return an error and abort
these steps.</p>

<p>Apply the IDNA ToASCII algorithm to the matching substring,
with both the AllowUnassigned and UseSTD3ASCIIRules flags

<dd>

<p>The contents of that file, interpreted as string of
Unicode characters, are the script source.</p>
<p>The contents of that file, interpreted as a Unicode
string, are the script source.</p>

<p>To obtain the string of Unicode characters, the user
agent run the following steps:</p>
<p>To obtain the Unicode string, the user agent run the
following steps:</p>

<ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content
Type metadata</a>, if any, specifies a character
star = %x002A ; U+002A ASTERISK (*)
slash = %x002F ; U+002F SOLIDUS (/)
not-newline = %x0000-0009 / %x000B-10FFFF
; a Unicode character other than U+000A LINE FEED (LF)
; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF)
not-star = %x0000-0029 / %x002B-10FFFF
; a Unicode character other than U+002A ASTERISK (*)
; a <a href=#unicode-character>Unicode character</a> other than U+002A ASTERISK (*)
not-slash = %x0000-002E / %x0030-10FFFF
; a Unicode character other than U+002F SOLIDUS (/)</pre>
; a <a href=#unicode-character>Unicode character</a> other than U+002F SOLIDUS (/)</pre>

<p class=note>This corresponds to putting the contents of the
element in JavaScript comments.</p>
parsing the provided byte stream. If the stream lacks this WebVTT
file signature, then the parser aborts.</p>

<p>When converting the bytes into Unicode characters, if the
encoding used is UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as
UTF-8, with error handling">decoded with the error handling</a>
defined in this specification, and all U+0000 NULL characters must
be replaced by U+FFFD REPLACEMENT CHARACTERs.</p>

<p>The <dfn id=webvtt-parser-algorithm>WebVTT parser algorithm</dfn> is as follows:</p>

<ol><li><p>Let <var title="">input</var> be the string being parsed,
after conversion to Unicode and after the replacement of U+0000
NULL characters described above.</li>
after conversion to Unicode.</li>

<li><p>Replace all U+0000 NULL characters in <var title="">input</var> by U+FFFD REPLACEMENT CHARACTERs.</li>

<li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially pointing at the start of the
string. In an <a href=#incremental-webvtt-parser>incremental WebVTT parser</a>, when this
<li><p>Let <var title="">decoded fragid</var> be the result of
expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode
characters as defined by UTF-8. If any percent-encoded octets in
that string are not valid UTF-8 sequences, then skip this step and
the next one.</p>
that string are not valid UTF-8 sequences (e.g. they expand to
surrogate code points), then skip this step and the next one.</p>

<li><p>If this step was not skipped and there is an element in the
DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to <var title="">decoded
fragid</var>, then the first such element in tree order is
<a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop the algorithm
here.</li>
DOM that has an <a href=#concept-id title=concept-id>ID</a> exactly equal to
<var title="">decoded fragid</var>, then the first such element in
tree order is <a href=#the-indicated-part-of-the-document>the indicated part of the document</a>; stop
the algorithm here.</li>

<li><p>If there is an <code><a href=#the-a-element>a</a></code> element in the DOM that has a
<code title=attr-a-name><a href=#attr-a-name>name</a></code> attribute whose value is
colon = %x003A ; U+003A COLON (:)
bom = %xFEFF ; U+FEFF BYTE ORDER MARK
name-char = %x0000-0009 / %x000B-000C / %x000E-0039 / %x003B-10FFFF
; a Unicode character other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
any-char = %x0000-0009 / %x000B-000C / %x000E-10FFFF
; a Unicode character other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>
; a <a href=#unicode-character>Unicode character</a> other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)</pre>

<p>Event streams in this format must always be encoded as
UTF-8. <a href=#refsRFC3629>[RFC3629]</a></p>
<h4 id=text-1><span class=secno>13.1.3 </span>Text</h4>

<p><dfn id=syntax-text title=syntax-text>Text</dfn> is allowed inside elements,
attribute values, and comments. Text must consist of Unicode
characters. Text must not contain U+0000 characters. Text must not
contain permanently undefined Unicode characters (noncharacters).
Text must not contain control characters other than <a href=#space-character title="space character">space characters</a>. Extra constraints
are placed on what is and what is not allowed in text based on where
the text is to be put, as described in the other sections.</p>
attribute values, and comments. Text must consist of <a href=#unicode-character title="Unicode character">Unicode characters</a>. Text must not
contain U+0000 characters. Text must not contain permanently
undefined Unicode characters (noncharacters). Text must not contain
control characters other than <a href=#space-character title="space character">space
characters</a>. Extra constraints are placed on what is and what
is not allowed in text based on where the text is to be put, as
described in the other sections.</p>


<h5 id=newlines><span class=secno>13.1.3.1 </span>Newlines</h5>
<h4 id=overview-of-the-parsing-model><span class=secno>13.2.1 </span>Overview of the parsing model</h4>

<p>The input to the HTML parsing process consists of a stream of
Unicode characters, which is passed through a
Unicode code points, which is passed through a
<a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
construction</a> stage. The output is a <code><a href=#document>Document</a></code>
object.</p>

<h4 id=the-input-stream><span class=secno>13.2.2 </span>The <dfn>input stream</dfn></h4>

<p>The stream of Unicode characters that comprises the input to the
<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
that encoding is <i>tentative</i> or <i>certain</i>, is <a href=#meta-charset-during-parse>used during the parsing</a> to
determine whether to <a href=#change-the-encoding>change the encoding</a>. If no
encoding is necessary, e.g. because the parser is operating on a
stream of Unicode characters and doesn't have to use an encoding at
all, then the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
Unicode stream and doesn't have to use an encoding at all, then the
<a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
<i>irrelevant</i>.</p>

<ol><li><p>If the user has explicitly instructed the user agent to
<h5 id=preprocessing-the-input-stream><span class=secno>13.2.2.3 </span>Preprocessing the input stream</h5>

<p>Given an encoding, the bytes in the input stream must be
converted to Unicode characters for the tokenizer, as described by
converted to Unicode code points for the tokenizer, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).</p> <!-- this is to

0 comments on commit 45c598c

Please sign in to comment.