Skip to content

Commit

Permalink
[giow] (2) Rejig the wording of the character encoding section to mak…
Browse files Browse the repository at this point in the history
…e it more precise and in particular to not make CR processing require look-ahead.

Affected topics: HTML, HTML Syntax and Parsing

git-svn-id: http://svn.whatwg.org/webapps@6991 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Feb 13, 2012
1 parent de2b5d8 commit 0a42fa6
Show file tree
Hide file tree
Showing 4 changed files with 209 additions and 182 deletions.
137 changes: 73 additions & 64 deletions complete.html
Expand Up @@ -1115,7 +1115,7 @@ <h2 class="no-num no-toc">Living Standard &mdash; Last Updated 13 February 2012<
<li><a href=#parsing><span class=secno>12.2 </span>Parsing HTML documents</a> <li><a href=#parsing><span class=secno>12.2 </span>Parsing HTML documents</a>
<ol> <ol>
<li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li> <li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
<li><a href=#the-input-stream><span class=secno>12.2.2 </span>The input stream</a> <li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
<ol> <ol>
<li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li> <li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
<li><a href=#character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</a></li> <li><a href=#character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</a></li>
Expand Down Expand Up @@ -13639,7 +13639,7 @@ <h4 id=opening-the-input-stream><span class=secno>3.4.1 </span>Opening the input


<p>If the document has an <a href=#active-parser>active parser</a> that isn't a <p>If the document has an <a href=#active-parser>active parser</a> that isn't a
<a href=#script-created-parser>script-created parser</a>, and the <a href=#insertion-point>insertion <a href=#script-created-parser>script-created parser</a>, and the <a href=#insertion-point>insertion
point</a> associated with that parser's <a href=#the-input-stream>input point</a> associated with that parser's <a href=#input-stream>input
stream</a> is not undefined (that is, it <em>does</em> point to stream</a> is not undefined (that is, it <em>does</em> point to
somewhere in the input stream), then the method does somewhere in the input stream), then the method does
nothing. Abort these steps and return the <code><a href=#document>Document</a></code> nothing. Abort these steps and return the <code><a href=#document>Document</a></code>
Expand Down Expand Up @@ -13783,7 +13783,7 @@ <h4 id=opening-the-input-stream><span class=secno>3.4.1 </span>Opening the input
entry.</li> entry.</li>


<li><p>Finally, set the <a href=#insertion-point>insertion point</a> to point at <li><p>Finally, set the <a href=#insertion-point>insertion point</a> to point at
just before the end of the <a href=#the-input-stream>input stream</a> (which at this just before the end of the <a href=#input-stream>input stream</a> (which at this
point will be empty).</li> point will be empty).</li>


<li><p>Return the <code><a href=#document>Document</a></code> on which the method was <li><p>Return the <code><a href=#document>Document</a></code> on which the method was
Expand Down Expand Up @@ -13833,7 +13833,7 @@ <h4 id=closing-the-input-stream><span class=secno>3.4.2 </span>Closing the input
with the document, then abort these steps.</li> with the document, then abort these steps.</li>


<li><p>Insert an <a href=#explicit-eof-character>explicit "EOF" character</a> at the end <li><p>Insert an <a href=#explicit-eof-character>explicit "EOF" character</a> at the end
of the parser's <a href=#the-input-stream>input stream</a>.</li> of the parser's <a href=#input-stream>input stream</a>.</li>


<li><p>If there is a <a href=#pending-parsing-blocking-script>pending parsing-blocking script</a>, <li><p>If there is a <a href=#pending-parsing-blocking-script>pending parsing-blocking script</a>,
then abort these steps.</li> then abort these steps.</li>
Expand Down Expand Up @@ -13922,14 +13922,14 @@ <h4 id=document.write()><span class=secno>3.4.3 </span><code title=dom-document-
the user <a href=#refused-to-allow-the-document-to-be-unloaded>refused to allow the document to be the user <a href=#refused-to-allow-the-document-to-be-unloaded>refused to allow the document to be
unloaded</a>, then abort these steps. Otherwise, the unloaded</a>, then abort these steps. Otherwise, the
<a href=#insertion-point>insertion point</a> will point at just before the end of <a href=#insertion-point>insertion point</a> will point at just before the end of
the (empty) <a href=#the-input-stream>input stream</a>.</p> the (empty) <a href=#input-stream>input stream</a>.</p>


</li> </li>


<li> <li>


<p>Insert the string consisting of the concatenation of all the <p>Insert the string consisting of the concatenation of all the
arguments to the method into the <a href=#the-input-stream>input stream</a> just arguments to the method into the <a href=#input-stream>input stream</a> just
before the <a href=#insertion-point>insertion point</a>.</p> before the <a href=#insertion-point>insertion point</a>.</p>


</li> </li>
Expand Down Expand Up @@ -64273,12 +64273,12 @@ <h4 id=read-html><span class=secno>6.5.2 </span><dfn title=navigate-html>Page lo
an <a href=#html-documents title="HTML documents">HTML document</a>, set its <a href=#concept-document-content-type title=concept-document-content-type>content type</a> to "<code title="">text/html</code>", create an <a href=#html-parser>HTML parser</a>, and an <a href=#html-documents title="HTML documents">HTML document</a>, set its <a href=#concept-document-content-type title=concept-document-content-type>content type</a> to "<code title="">text/html</code>", create an <a href=#html-parser>HTML parser</a>, and
associate it with the document. Each <a href=#concept-task title=concept-task>task</a> that the <a href=#networking-task-source>networking task associate it with the document. Each <a href=#concept-task title=concept-task>task</a> that the <a href=#networking-task-source>networking task
source</a> places on the <a href=#task-queue>task queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> runs must then fill the source</a> places on the <a href=#task-queue>task queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> runs must then fill the
parser's <a href=#the-input-stream>input stream</a> with the fetched bytes and cause parser's <a href=#the-input-byte-stream>input byte stream</a> with the fetched bytes and
the <a href=#html-parser>HTML parser</a> to perform the appropriate processing cause the <a href=#html-parser>HTML parser</a> to perform the appropriate
of the input stream.</p> processing of the input stream.</p>


<p class=note>The <a href=#the-input-stream>input stream</a> converts bytes into <p class=note>The <a href=#the-input-byte-stream>input byte stream</a> converts bytes
characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part, into characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
on character encoding information found in the real <a href=#content-type title=Content-Type>Content-Type metadata</a> of the resource; on character encoding information found in the real <a href=#content-type title=Content-Type>Content-Type metadata</a> of the resource;
the "sniffed type" is not used for this purpose.</p> the "sniffed type" is not used for this purpose.</p>


Expand Down Expand Up @@ -64377,9 +64377,9 @@ <h4 id=read-text><span class=secno>6.5.4 </span><dfn title=navigate-text>Page lo
state</a>. Each <a href=#concept-task title=concept-task>task</a> that the state</a>. Each <a href=#concept-task title=concept-task>task</a> that the
<a href=#networking-task-source>networking task source</a> places on the <a href=#task-queue>task <a href=#networking-task-source>networking task source</a> places on the <a href=#task-queue>task
queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a>
runs must then fill the parser's <a href=#the-input-stream>input stream</a> with the runs must then fill the parser's <a href=#the-input-byte-stream>input byte stream</a> with
fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform the the fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform
appropriate processing of the input stream.</p> the appropriate processing of the input stream.</p>


<p>The rules for how to convert the bytes of the plain text document <p>The rules for how to convert the bytes of the plain text document
into actual characters, and the rules for actually rendering the into actual characters, and the rules for actually rendering the
Expand Down Expand Up @@ -81111,13 +81111,13 @@ <h3 id=parsing><span class=secno>12.2 </span>Parsing HTML documents</h3>


<h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</h4> <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</h4>


<p class=overview><object data=images/parsing-model-overview.svg height=450 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p> <p class=overview><object data=images/parsing-model-overview.svg height=535 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>


<p>The input to the HTML parsing process consists of a stream of <p>The input to the HTML parsing process consists of a stream of
Unicode code points, which is passed through a <a href=#unicode-code-point title="Unicode code point">Unicode code points</a>, which
<a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree is passed through a <a href=#tokenization>tokenization</a> stage followed by a
construction</a> stage. The output is a <code><a href=#document>Document</a></code> <a href=#tree-construction>tree construction</a> stage. The output is a
object.</p> <code><a href=#document>Document</a></code> object.</p>


<p class=note>Implementations that <a href=#non-scripted>do not <p class=note>Implementations that <a href=#non-scripted>do not
support scripting</a> do not have to actually create a DOM support scripting</a> do not have to actually create a DOM
Expand Down Expand Up @@ -81157,21 +81157,50 @@ <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of
</div> </div>





<div class=impl> <div class=impl>


<h4 id=the-input-stream><span class=secno>12.2.2 </span>The <dfn>input stream</dfn></h4> <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>


<p>The stream of Unicode code points that comprises the input to the <p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a file system). The bytes encode the actual characters according to a
particular <em>character encoding</em>, which the user agent must particular <i>character encoding</i>, which the user agent must use
use to decode the bytes into characters.</p> to decode the bytes into characters.</p>


<p class=note>For XML documents, the algorithm user agents must <p class=note>For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p> specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>


<p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
used to determine the character encoding.</p>

<p>Given an encoding, the bytes in the <a href=#the-input-byte-stream>input byte
stream</a> must be converted to Unicode code points for the
tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
that encoding, except that the leading U+FEFF BYTE ORDER MARK
character, if any, must not be stripped by the encoding layer (it is
stripped by the rule below).</p> <!-- this is to prevent two leading
BOMs from being both stripped, once by the decoder, and once by the
parser -->

<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>

<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>

<p>Any byte or sequence of bytes in the original byte stream that is
<a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
error</a>.</p>



<h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5> <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>


Expand Down Expand Up @@ -81428,7 +81457,7 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
</ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately </ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately
be set to the value returned from this algorithm, at the same time be set to the value returned from this algorithm, at the same time
as the user agent uses the returned value to select the decoder to as the user agent uses the returned value to select the decoder to
use for the input stream.</p> use for the input byte stream.</p>


<hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte <hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps. stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
Expand All @@ -81438,7 +81467,7 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
<ol><li> <ol><li>


<p>Let <var title="">position</var> be a pointer to a byte in the <p>Let <var title="">position</var> be a pointer to a byte in the
input stream, initially pointing at the first byte. If at any input byte stream, initially pointing at the first byte. If at any
point during these steps the user agent either runs out of bytes point during these steps the user agent either runs out of bytes
or reaches its <var title="">end condition</var>, then abort the or reaches its <var title="">end condition</var>, then abort the
<a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a> <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
Expand Down Expand Up @@ -81575,8 +81604,8 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
</dl></li> </dl></li>


<li><i>Next byte</i>: Move <var title="">position</var> so it <li><i>Next byte</i>: Move <var title="">position</var> so it
points at the next byte in the input stream, and return to the step points at the next byte in the input byte stream, and return to the
above labeld <i>loop</i>.</li> step above labeld <i>loop</i>.</li>


</ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its </ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>, encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
Expand Down Expand Up @@ -81851,32 +81880,12 @@ <h5 id=character-encodings-0><span class=secno>12.2.2.2 </span>Character encodin


<h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preprocessing the input stream</h5> <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preprocessing the input stream</h5>


<p>Given an encoding, the bytes in the input stream must be <p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
converted to Unicode code points for the tokenizer, as described by into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
the rules for that encoding, except that the leading U+FEFF BYTE various APIs that directly manipulate the input stream.</p>
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).</p> <!-- this is to
prevent two leading BOMs from being both stripped, once by the
decoder, and once by the parser -->

<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>

<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.</p>

<p>Any byte or sequence of bytes in the original byte stream that is
<a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
error</a>.</p>


<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.</p> any are present in the <a href=#input-stream>input stream</a>.</p>


<p class=note>The requirement to strip a U+FEFF BYTE ORDER MARK <p class=note>The requirement to strip a U+FEFF BYTE ORDER MARK
character regardless of whether that character was used to determine character regardless of whether that character was used to determine
Expand All @@ -81898,18 +81907,18 @@ <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preproce
undefined Unicode characters (noncharacters).</p> undefined Unicode characters (noncharacters).</p>


<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are characters are treated specially. All CR characters must be
followed by LF characters must be removed, and any CR characters not converted to LF characters, and any LF characters that immediately
followed by LF characters must be converted to LF characters. Thus, follow a CR character must be ignored. Thus, newlines in HTML DOMs
newlines in HTML DOMs are represented by LF characters, and there are represented by LF characters, and there are never any CR
are never any CR characters in the input to the characters in the input to the <a href=#tokenization>tokenization</a> stage.</p>
<a href=#tokenization>tokenization</a> stage.</p>


<p>The <dfn id=next-input-character>next input character</dfn> is the first character in the <p>The <dfn id=next-input-character>next input character</dfn> is the first character in the
input stream that has not yet been <dfn id=consumed>consumed</dfn>. Initially, <a href=#input-stream>input stream</a> that has not yet been <dfn id=consumed>consumed</dfn>
the <i><a href=#next-input-character>next input character</a></i> is the first character in the or explicit ignored by the requirements in this section. Initially,
input. The <dfn id=current-input-character>current input character</dfn> is the last character the <i><a href=#next-input-character>next input character</a></i> is the first character in the input.
to have been <i><a href=#consumed>consumed</a></i>.</p> The <dfn id=current-input-character>current input character</dfn> is the last character to have
been <i><a href=#consumed>consumed</a></i>.</p>


<p>The <dfn id=insertion-point>insertion point</dfn> is the position (just before a <p>The <dfn id=insertion-point>insertion point</dfn> is the position (just before a
character or just before the end of the input stream) where content character or just before the end of the input stream) where content
Expand All @@ -81920,9 +81929,9 @@ <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preproce
undefined.</p> undefined.</p>


<p>The "EOF" character in the tables below is a conceptual character <p>The "EOF" character in the tables below is a conceptual character
representing the end of the <a href=#the-input-stream>input stream</a>. If the parser representing the end of the <a href=#input-stream>input stream</a>. If the parser
is a <a href=#script-created-parser>script-created parser</a>, then the end of the is a <a href=#script-created-parser>script-created parser</a>, then the end of the
<a href=#the-input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF" <a href=#input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
character</dfn> (inserted by the <code title=dom-document-close><a href=#dom-document-close>document.close()</a></code> method) is character</dfn> (inserted by the <code title=dom-document-close><a href=#dom-document-close>document.close()</a></code> method) is
consumed. Otherwise, the "EOF" character is not a real character in consumed. Otherwise, the "EOF" character is not a real character in
the stream, but rather the lack of any further characters.</p> the stream, but rather the lack of any further characters.</p>
Expand Down Expand Up @@ -88477,7 +88486,7 @@ <h4 id=the-end><span class=secno>12.2.6 </span>The end</h4>
</ol><p>When the user agent is to <dfn id=abort-a-parser>abort a parser</dfn>, it must run </ol><p>When the user agent is to <dfn id=abort-a-parser>abort a parser</dfn>, it must run
the following steps:</p> the following steps:</p>


<ol><li><p>Throw away any pending content in the <a href=#the-input-stream>input <ol><li><p>Throw away any pending content in the <a href=#input-stream>input
stream</a>, and discard any future content that would have been stream</a>, and discard any future content that would have been
added to it.</li> added to it.</li>


Expand Down Expand Up @@ -89291,7 +89300,7 @@ <h3 id=serializing-html-fragments><span class=secno>12.3 </span>Serializing HTML


<li> <li>


<p>Place into the <a href=#the-input-stream>input stream</a> for the <a href=#html-parser>HTML <p>Place into the <a href=#input-stream>input stream</a> for the <a href=#html-parser>HTML
parser</a> just created the <var title="">input</var>. The parser</a> just created the <var title="">input</var>. The
encoding <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is encoding <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
<i>irrelevant</i>.</p> <i>irrelevant</i>.</p>
Expand Down
Binary file modified images/parsing-model-overview.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0a42fa6

Please sign in to comment.