Skip to content
Permalink
Browse files

[giow] (2) Strip a leading BOM from scripts in workers, if any. Also,…

… use more of the encoding spec.

Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839
Affected topics: DOM APIs, HTML, HTML Syntax and Parsing, Offline Web Applications, Workers

git-svn-id: http://svn.whatwg.org/webapps@7782 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Mar 29, 2013
1 parent 0fbdead commit 5b130bc0958d71763bf91c57e3d4b947f7c49f1b
Showing with 176 additions and 252 deletions.
  1. +59 −78 complete.html
  2. +59 −78 index
  3. +58 −96 source
<p class=note>This complexity results from the historical decision to define the DOM API in
terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>

<p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>




<ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>

<li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>

<li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
returns a character stream

</ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
algorithm</i>. The latter is not used by this specification.</p>
<li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any

</ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
former.</p>

</dd>

<code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
<a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
</p>


<p>To obtain the Unicode string, the user agent run the following steps:</p>

<ol><li><p>For each of the rows in the following table, starting with the first one and going
down, if the file has as many or more bytes available than the number of bytes in the
first column, and the first bytes of the file match the bytes given in the first column,
then set <var title="">character encoding</var> to the encoding given in the cell in the
second column of that row, and jump to the bottom step in this series of steps:</p>

<!-- this table is present in several forms in this file; keep them in sync -->
<table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
<th>Encoding
<tbody><!-- nobody uses this
<tr>
<td>00 00 FE FF
<td>UTF-32BE
<tr>
<td>FF FE 00 00
<td>UTF-32LE
--><tr><td>FE FF
<td>Big-endian UTF-16
<tr><td>FF FE
<td>Little-endian UTF-16
<tr><td>EF BB BF
<td>UTF-8
<!-- nobody uses this
<tr>
<td>DD 73 66 73
<td>UTF-EBCDIC
-->
</table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>

</li>

<li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
<ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
series of steps.</li>

<li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
character encoding</a></var>.</li>

<li><p>Convert the file to Unicode using <var>character encoding</var>, following the
rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
type</a></var>.</li>
<li>

<p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
decoding files in that format to Unicode, follow them, using <var>character
encoding</var> as the character encoding specified by higher-level protocols, if
necessary.</p> <!-- e.g. XML -->

<p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
encoding</var> as the fallback encoding.</p>

<p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
encoding</var> if the file contains a BOM.</p>

</li>

</ol></dd>

<p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
following steps:</p>

<ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
the same anyway)--></li>
<ol><li>

<p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>

<p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>

<!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
both will be treated the same anyway)-->

</li>

<li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
manifest.</li>
<li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
pointing at the first character.</li>

<li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
then advance <var title="">position</var> to the next character.</li>

<li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
a simple event</a> named <code title=event-error>error</code> at that object. Abort these
steps.</p>

<p>If the attempt succeeds, then let <var title="">source</var> be the script resource
<a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
</p>
<p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
<a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>

<p>Let <var title="">language</var> be JavaScript.</p>

<code><a href=#networkerror>NetworkError</a></code> exception and abort all these
steps.</p>

<p>If the attempt succeeds, then let <var title="">source</var> be
the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
handling</a>.
</p>
<p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
<a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>

<p>Let <var title="">language</var> be JavaScript.</p>


<h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>

<p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
handling</a>.
</p>
<p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>

<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
<p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
(BOM), if any.</p>

<p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
<code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>

</ol><div class=warning>


<h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>

<!--CLEANUP-->
<p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
<p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
stream</a> must be converted to Unicode code points for the
tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
that encoding, except that the leading U+FEFF BYTE ORDER MARK
character, if any, must not be stripped by the encoding layer (it is
stripped by the rule below).</p> <!-- this is to prevent two leading
BOMs from being both stripped, once by the decoder, and once by the
parser -->

<p>Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode code points must be converted to
U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
handling">decoded with the error handling</a> defined in this
specification.</p>
that encoding's <a href=#decoder>decoder</a>.</p>

<p class=note>Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification (e.g.
invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.</p>

<p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
are stripped by the algorithm below.</p>

<p class=warning>The decoder algorithms describe how to handle invalid input; for security
reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
sequences are handled can result in, amongst other problems, script injection vulnerabilities
("XSS").</p>


<h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>

UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
in implementations of this specification.</p>

<p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
agents must default to little-endian UTF-16.</p>
<p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
has been found, user agents must default to little-endian UTF-16.</p>

<p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
<a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy

0 comments on commit 5b130bc

Please sign in to comment.
You can’t perform that action at this time.