Skip to content

Commit

Permalink
Clean up requirements around encodings
Browse files Browse the repository at this point in the history
Rely on the Encoding standard to define all encodings. Forbid other
encodings explicitly.

Fixes https://www.w3.org/Bugs/Public/show_bug.cgi?id=24120
  • Loading branch information
annevk committed Sep 3, 2015
1 parent eed97f0 commit a731806
Showing 1 changed file with 42 additions and 81 deletions.
123 changes: 42 additions & 81 deletions source
Expand Up @@ -2249,25 +2249,15 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
<dfn data-x="encoding label">encoding labels</dfn>, referred to as the encoding's <i>name</i> and
<i>labels</i> in the Encoding standard. <ref spec=ENCODING></p>

<p>An <dfn>ASCII-compatible character encoding</dfn> is a single-byte or variable-length
<span>encoding</span> in which the bytes 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C -
0x3F, 0x41 - 0x5A, and 0x61 - 0x7A<!-- is that list ok? do any character sets we want to support
do things outside that range? -->, ignoring bytes that are the second and later bytes of multibyte
sequences, all correspond to single-byte sequences that map to the same Unicode characters as
those bytes in Windows-1252<!--ANSI_X3.4-1968 (US-ASCII)-->. <ref spec=ENCODING></p>

<p class="note">This includes such encodings as Shift_JIS, HZ-GB-2312, and variants of ISO-2022,
even though it is possible in these encodings for bytes like 0x70 to be part of longer sequences
that are unrelated to their interpretation as ASCII. It excludes UTF-16 variants, as well as
obsolete legacy encodings such as UTF-7, GSM03.38, and EBCDIC variants.</p>
<p>A <dfn>UTF-16 encoding</dfn> is UTF-16BE or UTF-16LE. <ref spec=ENCODING></p>

<!--
We'll have to change that if anyone comes up with a way to have a document that is valid as two
different encodings at once, with different <meta charset> elements applying in each case.
-->
<p>An <dfn>ASCII-compatible encoding</dfn> is any <span>encoding</span> that is not a
<span>UTF-16 encoding</span>. <ref spec=ENCODING></p>

<p>The term <dfn>a UTF-16 encoding</dfn> refers to any variant of UTF-16: UTF-16LE or UTF-16BE,
regardless of the presence or absence of a BOM. <ref spec=ENCODING></p>
<p class="note">Since support for encodings that are not defined in the WHATWG Encoding standard
is prohibited, <span data-x="UTF-16 encoding">UTF-16 encodings</span> are the only encodings that
this specification needs to treat as not being <span
data="ASCII-compatible encoding">ASCII-compatible encodings</span>.

<p>The term <dfn>code unit</dfn> is used as defined in the Web IDL specification: a 16 bit
unsigned integer, the smallest atomic component of a <code
Expand Down Expand Up @@ -6557,8 +6547,8 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d

</li>

<li><p>If <var>encoding</var> is <span>a UTF-16 encoding</span>, then change the value
of <var>encoding</var> to UTF-8.</p></li>
<li><p>If <var>encoding</var> is a <span>UTF-16 encoding</span>, then change the value of
<var>encoding</var> to UTF-8.</p></li>

<li>

Expand Down Expand Up @@ -14628,7 +14618,7 @@ people expect to have work and what is necessary.
<span>encoding</span> is not explicitly given by <span data-x="Content-Type">Content-Type
metadata</span>, and the document is not <span>an <code>iframe</code> <code
data-x="attr-iframe-srcdoc">srcdoc</code> document</span>, then the character encoding used must be
an <span>ASCII-compatible character encoding</span>, and the encoding must be specified using a
an <span>ASCII-compatible encoding</span>, and the encoding must be specified using a
<code>meta</code> element with a <code data-x="attr-meta-charset">charset</code> attribute or a
<code>meta</code> element with an <code data-x="attr-meta-http-equiv">http-equiv</code> attribute
in the <span data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>.</p>
Expand All @@ -14647,7 +14637,7 @@ people expect to have work and what is necessary.
with a <code data-x="attr-meta-charset">charset</code> attribute or a <code>meta</code> element
with an <code data-x="attr-meta-http-equiv">http-equiv</code> attribute in the <span
data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character
encoding used must be an <span>ASCII-compatible character encoding</span>.</p>
encoding used must be an <span>ASCII-compatible encoding</span>.</p>

<p>Authors should use UTF-8. Conformance checkers may advise authors against using legacy
encodings. <ref spec=ENCODING></p>
Expand All @@ -14658,38 +14648,14 @@ people expect to have work and what is necessary.

</div>

<p>Encodings in which a series of bytes in the range 0x20 to 0x7E can encode characters other than
the corresponding characters in the range U+0020 to U+007E represent a potential security
vulnerability: a user agent that does not support the encoding (or does not support the label used
to declare the encoding, or does not use the same mechanism to detect the encoding of unlabeled
content as another user agent) might end up interpreting technically benign plain text content as
HTML tags and JavaScript. Authors should therefore not use these encodings. For example, this
applies to encodings in which the bytes corresponding to "<code data-x="">&lt;script></code>" in
ASCII can encode a different string. Authors should not use such encodings, which are known to
include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->,
HZ-GB-2312<!-- has crazy handling of ASCII "~" -->, JOHAB <!-- a supplementary encoding in KS C
5601-1992 Annex 3 (= KS X 1001:1998 Annex 3) --> (Windows code page 1361), encodings based on
ISO-2022<!-- http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
https://lists.w3.org/Archives/Public/public-whatwg-archive/2009Oct/0527.html -->, and encodings
based on EBCDIC. Furthermore, authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings,
which also fall into this category; these encodings were never intended for use for Web content.
<ref spec=RFC1345><!-- for the JIS types -->
<ref spec=RFC1842><!-- HZ-GB-2312 -->
<ref spec=RFC1468><!-- ISO-2022-JP -->
<ref spec=RFC2237><!-- ISO-2022-JP-1 -->
<ref spec=RFC1554><!-- ISO-2022-JP-2 -->
<ref spec=CP50220><!-- CP50220, the compatibility replacement for ISO-2022-JP -->
<ref spec=RFC1922><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<ref spec=RFC1557><!-- ISO-2022-KR -->
<ref spec=CESU8>
<ref spec=UTF7>
<ref spec=BOCU1>
<ref spec=SCSU>
<!-- no idea what to reference for JOHAB or EBCDIC, so... -->
</p>
<p>Authors must not use encodings that are not defined in the WHATWG Encoding standard.
Additionally, authors should not use ISO-2022-JP. <ref spec=ENCODING></p>

<p>Authors should not use UTF-32, as the encoding detection algorithms described in this
specification intentionally do not distinguish it from UTF-16. <ref spec=UNICODE></p>
<p class="note">Some encodings that are not defined in the WHATWG Encoding standard use bytes in
the range 0x20 to 0x7E, inclusive, to encode characters other than the corresponding characters in
the range U+0020 to U+007E, inclusive, and represent a potential security vulnerability: A user
agent might end up interpreting supposedly benign plain text content as HTML tags and
JavaScript.</p>

<p class="note">Using non-UTF-8 encodings can have unexpected results on form submission and URL
encodings, which use the <span>document's character encoding</span> by default.</p>
Expand Down Expand Up @@ -43333,7 +43299,7 @@ interface <dfn>HTMLFormElement</dfn> : <span>HTMLElement</span> {
character encodings that are to be used for the submission. If specified, the value must be an
<span>ordered set of unique space-separated tokens</span> that are <span>ASCII
case-insensitive</span>, and each token must be an <span>ASCII case-insensitive</span> match for
one of the <span data-x="encoding label">labels</span> of an <span>ASCII-compatible character
one of the <span data-x="encoding label">labels</span> of an <span>ASCII-compatible
encoding</span>. <ref spec=ENCODING></p>

<p>The <dfn><code data-x="attr-form-name">name</code></dfn> attribute represents the
Expand Down Expand Up @@ -57926,8 +57892,8 @@ fur
<span>encoding</span> to <var>candidate encodings</var>.</p></li>

<li><p>If the <i>allow non-ASCII-compatible encodings</i> flag is not set, remove any encodings
that are not <span data-x="ASCII-compatible character encoding">ASCII-compatible character
encodings</span> from <var>candidate encodings</var>.</p></li>
that are not <span data-x="ASCII-compatible encoding">ASCII-compatible encodings</span> from
<var>candidate encodings</var>.</p></li>

<li><p>If <var>candidate encodings</var> is empty, return UTF-8 and abort these
steps.</p></li>
Expand Down Expand Up @@ -57981,8 +57947,8 @@ fur

<p>Otherwise, if the <code>form</code> element has no <code
data-x="attr-form-accept-charset">accept-charset</code> attribute, but the <span>document's
character encoding</span> is an <span>ASCII-compatible character encoding</span>, then that is
the selected character encoding.</p>
character encoding</span> is an <span>ASCII-compatible encoding</span>, then that is the
selected character encoding.</p>

<p>Otherwise, let the selected character encoding be UTF-8.</p>

Expand Down Expand Up @@ -58231,7 +58197,7 @@ fur

<p>Otherwise, if the <code>form</code> element has no <code
data-x="attr-form-accept-charset">accept-charset</code> attribute, but the <span>document's
character encoding</span> is an <span>ASCII-compatible character encoding</span>, then that is
character encoding</span> is an <span>ASCII-compatible encoding</span>, then that is
the selected character encoding.</p>

<p>Otherwise, let the selected character encoding be UTF-8.</p>
Expand Down Expand Up @@ -99548,8 +99514,8 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
substeps.</p></li>

<li><p>If <var>parent document</var>'s <span data-x="document's character
encoding">character encoding</span> is not an <span>ASCII-compatible character encoding</span>,
then abort these substeps.</p></li>
encoding">character encoding</span> is not an <span>ASCII-compatible encoding</span>, then
abort these substeps.</p></li>

<li><p>Return <var>parent document</var>'s <span data-x="document's character
encoding">character encoding</span>, with the <span
Expand Down Expand Up @@ -100118,7 +100084,7 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

<!-- the next two steps are redundant with steps in the 'change the encoding' algorithm -->

<li><p>If <var>charset</var> is <span>a UTF-16 encoding</span>, change the value of
<li><p>If <var>charset</var> is a <span>UTF-16 encoding</span>, change the value of
<var>charset</var> to UTF-8.</p></li>

<li><p>If <var>charset</var> is the x-user-defined encoding, change the value of
Expand Down Expand Up @@ -100343,17 +100309,12 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<h5>Character encodings</h5>

<p>User agents must support the encodings defined in the WHATWG Encoding standard. User agents
should not support other encodings.</p>

<p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings. <ref spec=CESU8> <ref spec=UTF7> <ref spec=BOCU1> <ref spec=SCSU></p>

<p>Support for encodings based on EBCDIC is especially discouraged. This encoding is rarely used
for publicly-facing Web content. Support for UTF-32 is also especially discouraged. This encoding
is rarely used, and frequently implemented incorrectly.</p>
must not support other encodings.</p>

<p class="note">This specification does not make any attempt to support EBCDIC-based encodings and
UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behaviour
in implementations of this specification.</p>
<p class="note">The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC,
and UTF-32. This specification does not make any attempt to support prohibited encodings in its
algorithms; support and use of prohibited encodings would thus lead to unexpected behaviour. <ref
spec=CESU8> <ref spec=UTF7> <ref spec=BOCU1> <ref spec=SCSU></p>


<h5>Changing the encoding while parsing</h5>
Expand All @@ -100365,15 +100326,15 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

<ol>

<li><p>If the encoding that is already being used to interpret the input stream is <span>a UTF-16
<li><p>If the encoding that is already being used to interpret the input stream is a <span>UTF-16
encoding</span>, then set the <span data-x="concept-encoding-confidence">confidence</span> to
<i>certain</i> and abort these steps. The new encoding is ignored; if it was anything but the
same encoding, then it would be clearly incorrect.</p></li>

<!-- the next two steps are redundant with similar logic in the sniffer -->
<!-- if you add anything else here, then factor it out into a common algorithm -->

<li><p>If the new encoding is <span>a UTF-16 encoding</span>, change it to UTF-8.</p></li>
<li><p>If the new encoding is a <span>UTF-16 encoding</span>, change it to UTF-8.</p></li>

<li><p>If the new encoding is the x-user-defined encoding, change it to Windows-1252. <ref spec=ENCODING></p></li> <!-- apparently this was a Chrome invention, later
picked up by Mozilla -->
Expand Down Expand Up @@ -104169,18 +104130,18 @@ document.body.appendChild(text);

<p id="meta-charset-during-parse">If the element has a <code
data-x="attr-meta-charset">charset</code> attribute, and <span>getting an encoding</span> from
its value results in a supported <span>ASCII-compatible character encoding</span> or <span>a
UTF-16 encoding</span>, and the <span data-x="concept-encoding-confidence">confidence</span> is
currently <i>tentative</i>, then <span>change the encoding</span> to the resulting encoding.</p>
its value results in an <span>encoding</span>, and the
<span data-x="concept-encoding-confidence">confidence</span> is currently <i>tentative</i>, then
<span>change the encoding</span> to the resulting encoding.</p>

<p>Otherwise, if the element has an <code data-x="attr-meta-http-equiv">http-equiv</code>
attribute whose value is an <span>ASCII case-insensitive</span> match for the string "<code
data-x="">Content-Type</code>", and the element has a <code
data-x="attr-meta-content">content</code> attribute, and applying the <span>algorithm for
extracting a character encoding from a <code>meta</code> element</span> to that attribute's
value returns a supported <span>ASCII-compatible character encoding</span> or <span>a UTF-16
encoding</span>, and the <span data-x="concept-encoding-confidence">confidence</span> is
currently <i>tentative</i>, then <span>change the encoding</span> to the extracted encoding.</p>
value returns an <span>encoding</span>, and the
<span data-x="concept-encoding-confidence">confidence</span> is currently <i>tentative</i>, then
<span>change the encoding</span> to the extracted encoding.</p>

</dd>

Expand Down Expand Up @@ -113734,7 +113695,7 @@ if (s = prompt('What is your name?')) {
<dt>Optional parameters:</dt>
<dd>No parameters</dd>
<dt>Encoding considerations:</dt>
<dd>7bit (US-ASCII encoding of octets that themselves can be encoding text using any <span>ASCII-compatible character encoding</span>)</dd>
<dd>7bit (US-ASCII encoding of octets that themselves can be encoding text using any <span>ASCII-compatible encoding</span>)</dd>
<!--ADD-TOPIC:Security-->
<dt>Security considerations:</dt>
<dd>
Expand Down Expand Up @@ -116175,7 +116136,7 @@ if (s = prompt('What is your name?')) {
<th> <code data-x="">accept-charset</code>
<td> <code data-x="attr-form-accept-charset">form</code>
<td> Character encodings to use for <span>form submission</span>
<td> <span>Ordered set of unique space-separated tokens</span>, <span>ASCII case-insensitive</span>, consisting of <span data-x="encoding label">labels</span> of <span data-x="ASCII-compatible character encoding">ASCII-compatible character encodings</span>*
<td> <span>Ordered set of unique space-separated tokens</span>, <span>ASCII case-insensitive</span>, consisting of <span data-x="encoding label">labels</span> of <span data-x="ASCII-compatible encoding">ASCII-compatible encodings</span>*
<tr>
<th> <code data-x="">accesskey</code>
<td> <span data-x="attr-accesskey">HTML elements</span>
Expand Down

0 comments on commit a731806

Please sign in to comment.