Let the Encoding standard deal with the BOM

The Encoding standard has a decode algorithm that lets the BOM override any input encoding and also skips the BOM. These are the semantics shared by a variety of formats, including text/html. With this change HTML hooks into that directly rather than duplicating the prose. Fixes part of #657.
whatwg · Feb 9, 2016 · 83ebb72 · foolip · Feb 9, 2016 · 83ebb72
1 parent c485b70
commit 83ebb72
Showing 1 changed file with 6 additions and 69 deletions.
diff --git a/source b/source
@@ -2615,8 +2615,7 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
 
      <li><dfn data-noexport="" data-x-href="https://encoding.spec.whatwg.org/#concept-encoding-get">Getting an encoding</dfn>
 
-     <li>The <dfn data-noexport="">encoder</dfn> and <dfn data-noexport="">decoder</dfn> algorithms
-     for various encodings
+     <li>The <dfn data-noexport="">encoder</dfn> algorithm for various encodings
 
      <li>The generic <dfn data-noexport=""
      data-x-href="https://encoding.spec.whatwg.org/#decode">decode</dfn> algorithm which takes a
@@ -96745,16 +96744,16 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
   character encoding.</p>
 
   <p>Given a character encoding, the bytes in the <span>input byte stream</span> must be converted
-  to Unicode code points for the tokenizer's <span>input stream</span>, as described by the rules
-  for that encoding's <span>decoder</span>.</p>
+  to characters for the tokenizer's <span>input stream</span>, by passing the <span>input byte
+  stream</span> and character encoding to <span>decode</span>.</p>
+
+  <p class="note">A leading Byte Order Mark (BOM) causes the character encoding argument to be
+  ignored and will itself be skipped.</p>
 
   <p class="note">Bytes or sequences of bytes in the original byte stream that did not conform to
   the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors
   that conformance checkers are expected to report. <ref spec=ENCODING></p>
 
-  <p class="note">Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
-  are stripped by the algorithm below.</p>
-
   <p class="warning">The decoder algorithms describe how to handle invalid input; for security
   reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
   sequences are handled can result in, amongst other problems, script injection vulnerabilities
@@ -96830,60 +96829,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
 
    </li>
 
-   <li>
-
-    <!-- Doing this step before honouring HTTP is important for supporting
-            http://kb.dsqq.cn/html/2012-09/16/node_193.htm
-         which is encoded as UTF-8 but is incorrectly labeled as
-            Content-Type: text/html; charset=GB2312
-    -->
-
-    <p>For each of the rows in the following table, starting with the first one and going down, if
-    there are as many or more bytes available than the number of bytes in the first column, and the
-    first bytes of the file match the bytes given in the first column, then return the encoding
-    given in the cell in the second column of that row, with the <span
-    data-x="concept-encoding-confidence">confidence</span> <i>certain</i>, and abort these steps:</p>
-
-    <!-- this table is present in several forms in this file; keep them in sync -->
-    <table>
-     <thead>
-      <tr>
-       <th>Bytes in Hexadecimal
-       <th>Encoding
-     <tbody>
-<!-- nobody uses this
-      <tr>
-       <td>00 00 FE FF
-       <td>UTF-32BE
-      <tr>
-       <td>FF FE 00 00
-       <td>UTF-32LE
--->
-      <tr>
-       <td>FE FF
-       <td>Big-endian UTF-16
-      <tr>
-       <td>FF FE
-       <td>Little-endian UTF-16
-      <tr>
-       <td>EF BB BF
-       <td>UTF-8
-<!-- nobody uses this
-      <tr>
-       <td>DD 73 66 73
-       <td>UTF-EBCDIC
--->
-    </table>
-
-    <p class="note">This step looks for Unicode Byte Order Marks (BOMs).</p>
-
-    <p class="note">That this step happens before the next one honoring the HTTP
-    `<code>Content-Type</code>` header is a <span>willful violation</span> of the HTTP
-    specification, motivated by a desire to be maximally compatible with legacy content. <ref
-    spec=HTTP></p>
-
-   </li>
-
    <li><p>If the transport layer specifies a character encoding, and it is supported, return that
    encoding with the <span data-x="concept-encoding-confidence">confidence</span> <i>certain</i>, and
    abort these steps.</p></li>
@@ -97788,14 +97733,6 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
   <p>The <dfn>input stream</dfn> consists of the characters pushed into it as the <span>input byte
   stream</span> is decoded or from the various APIs that directly manipulate the input stream.</p>
 
-  <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present in the
-  <span>input stream</span>.</p>
-
-  <p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether
-  that character was used to determine the byte order is a <span>willful violation</span> of
-  Unicode, motivated by a desire to increase the resilience of user agents in the face of na&iuml;ve
-  transcoders.</p>
-
   <p>Any occurrences of any characters in the ranges U+0001 to U+0008, <!-- HT, LF allowed --> <!--
   U+000B is in the next list --> <!-- FF, CR allowed --> U+000E to U+001F, <!-- ASCII allowed -->
   U+007F <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0 to U+FDEF, and