The document's encoding doesn't properly get overwritten by a BOM #1910

domenic · 2016-10-15T22:50:52Z

(This is a spec factoring issue, mostly.)

The only place in the spec that I can find that sets the document's character encoding is https://html.spec.whatwg.org/#determining-the-character-encoding:document's-character-encoding-3 which says

The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.

Here, "this algorithm" is the " encoding sniffing algorithm". However, this algorithm doesn't deal with BOMs. BOMs are dealt with later, in https://html.spec.whatwg.org/#the-input-byte-stream:encoding-sniffing-algorithm:

Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.

Given a character encoding, the bytes in the input byte stream must be converted to characters for the tokenizer's input stream, by passing the input byte stream and character encoding to decode.

So per spec, if you have a page with, say, Content-Type: text/html;charset=windows-1252 whose first few bytes are a UTF-8 BOM:

The encoding sniffing algorithm detects windows-1252 as "the character encoding"
It passes the input stream bytes + windows-1252 to the Encoding Standard's decode algorithm, which then decodes as UTF-8.
However, per the first quote, the document's character encoding is immediately set to windows-1252, not UTF-8, since the BOM override is hidden inside the decode algorithm and invisible to the HTML spec.

One way of fixing this is to make the decode algorithm return both an output stream and an encoding, but I guess that will involve updating a lot of call sites, and is fairly inelegant. Another is to have a special operation that HTML uses instead of decode, which returns those two things. Maybe there is something better.

The text was updated successfully, but these errors were encountered:

domenic · 2016-10-15T22:59:53Z

Maybe best would be if https://html.spec.whatwg.org/#prescan-a-byte-stream-to-determine-its-encoding looked for the BOMs first (before the loop) and bailed out then. It duplicates things a bit, but seems less invasive.

Edit: no, that would allow the transport-layer encoding to override the BOM, which is incorrect. Nevermind. It would have to be in https://html.spec.whatwg.org/#encoding-sniffing-algorithm if we want to do it that way.

domenic · 2016-10-15T23:27:02Z

How I ended up factoring this for jsdom:

I added a new "get BOM encoding" mechanism that performs step 4 of https://encoding.spec.whatwg.org/#decode
I made both Encoding's "decode" and HTML's "encoding sniffing algorithm" call out to "get BOM encoding".

ArkadiuszMichalski · 2016-10-16T02:17:54Z

Simillar bug: #1077

domenic · 2016-10-16T02:20:53Z

Oh, thanks, this looks like a dupe of that.

domenic added the clarification Standard could be clearer label Oct 15, 2016

domenic assigned annevk Oct 15, 2016

domenic closed this as completed Oct 16, 2016

andreubotella mentioned this issue Mar 9, 2020

Cleanup around the HTML parser and how it sniffs encodings #1077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The document's encoding doesn't properly get overwritten by a BOM #1910

The document's encoding doesn't properly get overwritten by a BOM #1910

domenic commented Oct 15, 2016

domenic commented Oct 15, 2016 •

edited

Loading

domenic commented Oct 15, 2016

ArkadiuszMichalski commented Oct 16, 2016

domenic commented Oct 16, 2016

The document's encoding doesn't properly get overwritten by a BOM #1910

The document's encoding doesn't properly get overwritten by a BOM #1910

Comments

domenic commented Oct 15, 2016

domenic commented Oct 15, 2016 • edited Loading

domenic commented Oct 15, 2016

ArkadiuszMichalski commented Oct 16, 2016

domenic commented Oct 16, 2016

domenic commented Oct 15, 2016 •

edited

Loading