Skip to content

Commit

Permalink
Let the Encoding standard deal with the BOM
Browse files Browse the repository at this point in the history
The Encoding standard has a decode algorithm that lets the BOM override
any input encoding and also skips the BOM. These are the semantics
shared by a variety of formats, including text/html.

With this change HTML hooks into that directly rather than duplicating
the prose.

Fixes part of #657.
  • Loading branch information
annevk committed Feb 9, 2016
1 parent c485b70 commit 83ebb72
Showing 1 changed file with 6 additions and 69 deletions.
75 changes: 6 additions & 69 deletions source
Original file line number Diff line number Diff line change
Expand Up @@ -2615,8 +2615,7 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d

<li><dfn data-noexport="" data-x-href="https://encoding.spec.whatwg.org/#concept-encoding-get">Getting an encoding</dfn>

<li>The <dfn data-noexport="">encoder</dfn> and <dfn data-noexport="">decoder</dfn> algorithms
for various encodings
<li>The <dfn data-noexport="">encoder</dfn> algorithm for various encodings

<li>The generic <dfn data-noexport=""
data-x-href="https://encoding.spec.whatwg.org/#decode">decode</dfn> algorithm which takes a
Expand Down Expand Up @@ -96745,16 +96744,16 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
character encoding.</p>

<p>Given a character encoding, the bytes in the <span>input byte stream</span> must be converted
to Unicode code points for the tokenizer's <span>input stream</span>, as described by the rules
for that encoding's <span>decoder</span>.</p>
to characters for the tokenizer's <span>input stream</span>, by passing the <span>input byte
stream</span> and character encoding to <span>decode</span>.</p>

<p class="note">A leading Byte Order Mark (BOM) causes the character encoding argument to be
ignored and will itself be skipped.</p>

<p class="note">Bytes or sequences of bytes in the original byte stream that did not conform to
the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors
that conformance checkers are expected to report. <ref spec=ENCODING></p>

<p class="note">Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
are stripped by the algorithm below.</p>

<p class="warning">The decoder algorithms describe how to handle invalid input; for security
reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
sequences are handled can result in, amongst other problems, script injection vulnerabilities
Expand Down Expand Up @@ -96830,60 +96829,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

</li>

<li>

<!-- Doing this step before honouring HTTP is important for supporting
http://kb.dsqq.cn/html/2012-09/16/node_193.htm
which is encoded as UTF-8 but is incorrectly labeled as
Content-Type: text/html; charset=GB2312
-->

<p>For each of the rows in the following table, starting with the first one and going down, if
there are as many or more bytes available than the number of bytes in the first column, and the
first bytes of the file match the bytes given in the first column, then return the encoding
given in the cell in the second column of that row, with the <span
data-x="concept-encoding-confidence">confidence</span> <i>certain</i>, and abort these steps:</p>

<!-- this table is present in several forms in this file; keep them in sync -->
<table>
<thead>
<tr>
<th>Bytes in Hexadecimal
<th>Encoding
<tbody>
<!-- nobody uses this
<tr>
<td>00 00 FE FF
<td>UTF-32BE
<tr>
<td>FF FE 00 00
<td>UTF-32LE
-->
<tr>
<td>FE FF
<td>Big-endian UTF-16
<tr>
<td>FF FE
<td>Little-endian UTF-16
<tr>
<td>EF BB BF
<td>UTF-8
<!-- nobody uses this
<tr>
<td>DD 73 66 73
<td>UTF-EBCDIC
-->
</table>

<p class="note">This step looks for Unicode Byte Order Marks (BOMs).</p>

<p class="note">That this step happens before the next one honoring the HTTP
`<code>Content-Type</code>` header is a <span>willful violation</span> of the HTTP

This comment has been minimized.

Copy link
@foolip

foolip Feb 9, 2016

Member

This willful violation is still there, just less obvious. Do you think it's worth keeping this note in some form?

specification, motivated by a desire to be maximally compatible with legacy content. <ref
spec=HTTP></p>

</li>

<li><p>If the transport layer specifies a character encoding, and it is supported, return that
encoding with the <span data-x="concept-encoding-confidence">confidence</span> <i>certain</i>, and
abort these steps.</p></li>
Expand Down Expand Up @@ -97788,14 +97733,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<p>The <dfn>input stream</dfn> consists of the characters pushed into it as the <span>input byte
stream</span> is decoded or from the various APIs that directly manipulate the input stream.</p>

<p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present in the
<span>input stream</span>.</p>

<p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether
that character was used to determine the byte order is a <span>willful violation</span> of
Unicode, motivated by a desire to increase the resilience of user agents in the face of na&iuml;ve
transcoders.</p>

<p>Any occurrences of any characters in the ranges U+0001 to U+0008, <!-- HT, LF allowed --> <!--
U+000B is in the next list --> <!-- FF, CR allowed --> U+000E to U+001F, <!-- ASCII allowed -->
U+007F <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0 to U+FDEF, and
Expand Down

0 comments on commit 83ebb72

Please sign in to comment.