Permalink
Browse files

[giow] (2) Rejig the wording of the character encoding section to mak…

…e it more precise and in particular to not make CR processing require look-ahead.

Affected topics: HTML, HTML Syntax and Parsing

git-svn-id: http://svn.whatwg.org/webapps@6991 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information...
Hixie committed Feb 13, 2012
1 parent de2b5d8 commit 0a42fa65a4a49cdbcebad2dac8c6f830d065aeba
Showing with 209 additions and 182 deletions.
  1. +73 −64 complete.html
  2. BIN images/parsing-model-overview.png
  3. +73 −64 index
  4. +63 −54 source
@@ -1115,7 +1115,7 @@ <h2 class="no-num no-toc">Living Standard &mdash; Last Updated 13 February 2012<
<li><a href=#parsing><span class=secno>12.2 </span>Parsing HTML documents</a>
<ol>
<li><a href=#overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</a></li>
<li><a href=#the-input-stream><span class=secno>12.2.2 </span>The input stream</a>
<li><a href=#the-input-byte-stream><span class=secno>12.2.2 </span>The input byte stream</a>
<ol>
<li><a href=#determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</a></li>
<li><a href=#character-encodings-0><span class=secno>12.2.2.2 </span>Character encodings</a></li>
@@ -13639,7 +13639,7 @@ <h4 id=opening-the-input-stream><span class=secno>3.4.1 </span>Opening the input

<p>If the document has an <a href=#active-parser>active parser</a> that isn't a
<a href=#script-created-parser>script-created parser</a>, and the <a href=#insertion-point>insertion
point</a> associated with that parser's <a href=#the-input-stream>input
point</a> associated with that parser's <a href=#input-stream>input
stream</a> is not undefined (that is, it <em>does</em> point to
somewhere in the input stream), then the method does
nothing. Abort these steps and return the <code><a href=#document>Document</a></code>
@@ -13783,7 +13783,7 @@ <h4 id=opening-the-input-stream><span class=secno>3.4.1 </span>Opening the input
entry.</li>

<li><p>Finally, set the <a href=#insertion-point>insertion point</a> to point at
just before the end of the <a href=#the-input-stream>input stream</a> (which at this
just before the end of the <a href=#input-stream>input stream</a> (which at this
point will be empty).</li>

<li><p>Return the <code><a href=#document>Document</a></code> on which the method was
@@ -13833,7 +13833,7 @@ <h4 id=closing-the-input-stream><span class=secno>3.4.2 </span>Closing the input
with the document, then abort these steps.</li>

<li><p>Insert an <a href=#explicit-eof-character>explicit "EOF" character</a> at the end
of the parser's <a href=#the-input-stream>input stream</a>.</li>
of the parser's <a href=#input-stream>input stream</a>.</li>

<li><p>If there is a <a href=#pending-parsing-blocking-script>pending parsing-blocking script</a>,
then abort these steps.</li>
@@ -13922,14 +13922,14 @@ <h4 id=document.write()><span class=secno>3.4.3 </span><code title=dom-document-
the user <a href=#refused-to-allow-the-document-to-be-unloaded>refused to allow the document to be
unloaded</a>, then abort these steps. Otherwise, the
<a href=#insertion-point>insertion point</a> will point at just before the end of
the (empty) <a href=#the-input-stream>input stream</a>.</p>
the (empty) <a href=#input-stream>input stream</a>.</p>

</li>

<li>

<p>Insert the string consisting of the concatenation of all the
arguments to the method into the <a href=#the-input-stream>input stream</a> just
arguments to the method into the <a href=#input-stream>input stream</a> just
before the <a href=#insertion-point>insertion point</a>.</p>

</li>
@@ -64273,12 +64273,12 @@ <h4 id=read-html><span class=secno>6.5.2 </span><dfn title=navigate-html>Page lo
an <a href=#html-documents title="HTML documents">HTML document</a>, set its <a href=#concept-document-content-type title=concept-document-content-type>content type</a> to "<code title="">text/html</code>", create an <a href=#html-parser>HTML parser</a>, and
associate it with the document. Each <a href=#concept-task title=concept-task>task</a> that the <a href=#networking-task-source>networking task
source</a> places on the <a href=#task-queue>task queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a> runs must then fill the
parser's <a href=#the-input-stream>input stream</a> with the fetched bytes and cause
the <a href=#html-parser>HTML parser</a> to perform the appropriate processing
of the input stream.</p>
parser's <a href=#the-input-byte-stream>input byte stream</a> with the fetched bytes and
cause the <a href=#html-parser>HTML parser</a> to perform the appropriate
processing of the input stream.</p>

<p class=note>The <a href=#the-input-stream>input stream</a> converts bytes into
characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
<p class=note>The <a href=#the-input-byte-stream>input byte stream</a> converts bytes
into characters for use in the <a href=#tokenization title=tokenization>tokenizer</a>. This process relies, in part,
on character encoding information found in the real <a href=#content-type title=Content-Type>Content-Type metadata</a> of the resource;
the "sniffed type" is not used for this purpose.</p>

@@ -64377,9 +64377,9 @@ <h4 id=read-text><span class=secno>6.5.4 </span><dfn title=navigate-text>Page lo
state</a>. Each <a href=#concept-task title=concept-task>task</a> that the
<a href=#networking-task-source>networking task source</a> places on the <a href=#task-queue>task
queue</a> while the <a href=#fetch title=fetch>fetching algorithm</a>
runs must then fill the parser's <a href=#the-input-stream>input stream</a> with the
fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform the
appropriate processing of the input stream.</p>
runs must then fill the parser's <a href=#the-input-byte-stream>input byte stream</a> with
the fetched bytes and cause the <a href=#html-parser>HTML parser</a> to perform
the appropriate processing of the input stream.</p>

<p>The rules for how to convert the bytes of the plain text document
into actual characters, and the rules for actually rendering the
@@ -81111,13 +81111,13 @@ <h3 id=parsing><span class=secno>12.2 </span>Parsing HTML documents</h3>

<h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of the parsing model</h4>

<p class=overview><object data=images/parsing-model-overview.svg height=450 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>
<p class=overview><object data=images/parsing-model-overview.svg height=535 width=345><img alt="" height=450 src=http://images.whatwg.org/parsing-model-overview.png width=345></object></p>

<p>The input to the HTML parsing process consists of a stream of
Unicode code points, which is passed through a
<a href=#tokenization>tokenization</a> stage followed by a <a href=#tree-construction>tree
construction</a> stage. The output is a <code><a href=#document>Document</a></code>
object.</p>
<a href=#unicode-code-point title="Unicode code point">Unicode code points</a>, which
is passed through a <a href=#tokenization>tokenization</a> stage followed by a
<a href=#tree-construction>tree construction</a> stage. The output is a
<code><a href=#document>Document</a></code> object.</p>

<p class=note>Implementations that <a href=#non-scripted>do not
support scripting</a> do not have to actually create a DOM
@@ -81157,21 +81157,50 @@ <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of
</div>



<div class=impl>

<h4 id=the-input-stream><span class=secno>12.2.2 </span>The <dfn>input stream</dfn></h4>
<h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>

<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
particular <em>character encoding</em>, which the user agent must
use to decode the bytes into characters.</p>
particular <i>character encoding</i>, which the user agent must use
to decode the bytes into characters.</p>

<p class=note>For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. <a href=#refsXML>[XML]</a></p>

<p>The <a href=#encoding-sniffing-algorithm>encoding sniffing algorithm</a> defined below is
used to determine the character encoding.</p>

<p>Given an encoding, the bytes in the <a href=#the-input-byte-stream>input byte
stream</a> must be converted to Unicode code points for the
tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
that encoding, except that the leading U+FEFF BYTE ORDER MARK
character, if any, must not be stripped by the encoding layer (it is
stripped by the rule below).</p> <!-- this is to prevent two leading
BOMs from being both stripped, once by the decoder, and once by the
parser -->

<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>

<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>

<p>Any byte or sequence of bytes in the original byte stream that is
<a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
error</a>.</p>


<h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>

@@ -81428,7 +81457,7 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
</ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately
be set to the value returned from this algorithm, at the same time
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
use for the input byte stream.</p>

<hr><p>When an algorithm requires a user agent to <dfn id=prescan-a-byte-stream-to-determine-its-encoding>prescan a byte
stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
@@ -81438,7 +81467,7 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
<ol><li>

<p>Let <var title="">position</var> be a pointer to a byte in the
input stream, initially pointing at the first byte. If at any
input byte stream, initially pointing at the first byte. If at any
point during these steps the user agent either runs out of bytes
or reaches its <var title="">end condition</var>, then abort the
<a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its encoding</a>
@@ -81575,8 +81604,8 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
</dl></li>

<li><i>Next byte</i>: Move <var title="">position</var> so it
points at the next byte in the input stream, and return to the step
above labeld <i>loop</i>.</li>
points at the next byte in the input byte stream, and return to the
step above labeld <i>loop</i>.</li>

</ol><p>When the <a href=#prescan-a-byte-stream-to-determine-its-encoding>prescan a byte stream to determine its
encoding</a> algorithm says to <dfn id=concept-get-attributes-when-sniffing title=concept-get-attributes-when-sniffing>get an attribute</dfn>,
@@ -81851,32 +81880,12 @@ <h5 id=character-encodings-0><span class=secno>12.2.2.2 </span>Character encodin

<h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preprocessing the input stream</h5>

<p>Given an encoding, the bytes in the input stream must be
converted to Unicode code points for the tokenizer, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).</p> <!-- this is to
prevent two leading BOMs from being both stripped, once by the
decoder, and once by the parser -->

<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>

<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.</p>

<p>Any byte or sequence of bytes in the original byte stream that is
<a href=#misinterpreted-for-compatibility>misinterpreted for compatibility</a> is a <a href=#parse-error>parse
error</a>.</p>
<p>The <dfn id=input-stream>input stream</dfn> consists of the characters pushed
into it as the <a href=#the-input-byte-stream>input byte stream</a> is decoded or from the
various APIs that directly manipulate the input stream.</p>

<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.</p>
any are present in the <a href=#input-stream>input stream</a>.</p>

<p class=note>The requirement to strip a U+FEFF BYTE ORDER MARK
character regardless of whether that character was used to determine
@@ -81898,18 +81907,18 @@ <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preproce
undefined Unicode characters (noncharacters).</p>

<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
followed by LF characters must be removed, and any CR characters not
followed by LF characters must be converted to LF characters. Thus,
newlines in HTML DOMs are represented by LF characters, and there
are never any CR characters in the input to the
<a href=#tokenization>tokenization</a> stage.</p>
characters are treated specially. All CR characters must be
converted to LF characters, and any LF characters that immediately
follow a CR character must be ignored. Thus, newlines in HTML DOMs
are represented by LF characters, and there are never any CR
characters in the input to the <a href=#tokenization>tokenization</a> stage.</p>

<p>The <dfn id=next-input-character>next input character</dfn> is the first character in the
input stream that has not yet been <dfn id=consumed>consumed</dfn>. Initially,
the <i><a href=#next-input-character>next input character</a></i> is the first character in the
input. The <dfn id=current-input-character>current input character</dfn> is the last character
to have been <i><a href=#consumed>consumed</a></i>.</p>
<a href=#input-stream>input stream</a> that has not yet been <dfn id=consumed>consumed</dfn>
or explicit ignored by the requirements in this section. Initially,
the <i><a href=#next-input-character>next input character</a></i> is the first character in the input.
The <dfn id=current-input-character>current input character</dfn> is the last character to have
been <i><a href=#consumed>consumed</a></i>.</p>

<p>The <dfn id=insertion-point>insertion point</dfn> is the position (just before a
character or just before the end of the input stream) where content
@@ -81920,9 +81929,9 @@ <h5 id=preprocessing-the-input-stream><span class=secno>12.2.2.3 </span>Preproce
undefined.</p>

<p>The "EOF" character in the tables below is a conceptual character
representing the end of the <a href=#the-input-stream>input stream</a>. If the parser
representing the end of the <a href=#input-stream>input stream</a>. If the parser
is a <a href=#script-created-parser>script-created parser</a>, then the end of the
<a href=#the-input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
<a href=#input-stream>input stream</a> is reached when an <dfn id=explicit-eof-character>explicit "EOF"
character</dfn> (inserted by the <code title=dom-document-close><a href=#dom-document-close>document.close()</a></code> method) is
consumed. Otherwise, the "EOF" character is not a real character in
the stream, but rather the lack of any further characters.</p>
@@ -88477,7 +88486,7 @@ <h4 id=the-end><span class=secno>12.2.6 </span>The end</h4>
</ol><p>When the user agent is to <dfn id=abort-a-parser>abort a parser</dfn>, it must run
the following steps:</p>

<ol><li><p>Throw away any pending content in the <a href=#the-input-stream>input
<ol><li><p>Throw away any pending content in the <a href=#input-stream>input
stream</a>, and discard any future content that would have been
added to it.</li>

@@ -89291,7 +89300,7 @@ <h3 id=serializing-html-fragments><span class=secno>12.3 </span>Serializing HTML

<li>

<p>Place into the <a href=#the-input-stream>input stream</a> for the <a href=#html-parser>HTML
<p>Place into the <a href=#input-stream>input stream</a> for the <a href=#html-parser>HTML
parser</a> just created the <var title="">input</var>. The
encoding <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is
<i>irrelevant</i>.</p>
Binary file not shown.
Oops, something went wrong.

0 comments on commit 0a42fa6

Please sign in to comment.