Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The document's encoding doesn't properly get overwritten by a BOM #1910

Closed
domenic opened this issue Oct 15, 2016 · 4 comments
Closed

The document's encoding doesn't properly get overwritten by a BOM #1910

domenic opened this issue Oct 15, 2016 · 4 comments
Assignees
Labels
clarification Standard could be clearer

Comments

@domenic
Copy link
Member

domenic commented Oct 15, 2016

(This is a spec factoring issue, mostly.)

The only place in the spec that I can find that sets the document's character encoding is https://html.spec.whatwg.org/#determining-the-character-encoding:document's-character-encoding-3 which says

The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.

Here, "this algorithm" is the " encoding sniffing algorithm". However, this algorithm doesn't deal with BOMs. BOMs are dealt with later, in https://html.spec.whatwg.org/#the-input-byte-stream:encoding-sniffing-algorithm:

Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.

Given a character encoding, the bytes in the input byte stream must be converted to characters for the tokenizer's input stream, by passing the input byte stream and character encoding to decode.

So per spec, if you have a page with, say, Content-Type: text/html;charset=windows-1252 whose first few bytes are a UTF-8 BOM:

  1. The encoding sniffing algorithm detects windows-1252 as "the character encoding"
  2. It passes the input stream bytes + windows-1252 to the Encoding Standard's decode algorithm, which then decodes as UTF-8.
  3. However, per the first quote, the document's character encoding is immediately set to windows-1252, not UTF-8, since the BOM override is hidden inside the decode algorithm and invisible to the HTML spec.

One way of fixing this is to make the decode algorithm return both an output stream and an encoding, but I guess that will involve updating a lot of call sites, and is fairly inelegant. Another is to have a special operation that HTML uses instead of decode, which returns those two things. Maybe there is something better.

@domenic domenic added the clarification Standard could be clearer label Oct 15, 2016
@domenic
Copy link
Member Author

domenic commented Oct 15, 2016

Maybe best would be if https://html.spec.whatwg.org/#prescan-a-byte-stream-to-determine-its-encoding looked for the BOMs first (before the loop) and bailed out then. It duplicates things a bit, but seems less invasive.

Edit: no, that would allow the transport-layer encoding to override the BOM, which is incorrect. Nevermind. It would have to be in https://html.spec.whatwg.org/#encoding-sniffing-algorithm if we want to do it that way.

@domenic
Copy link
Member Author

domenic commented Oct 15, 2016

How I ended up factoring this for jsdom:

  • I added a new "get BOM encoding" mechanism that performs step 4 of https://encoding.spec.whatwg.org/#decode
  • I made both Encoding's "decode" and HTML's "encoding sniffing algorithm" call out to "get BOM encoding".

@ArkadiuszMichalski
Copy link

Simillar bug: #1077

@domenic
Copy link
Member Author

domenic commented Oct 16, 2016

Oh, thanks, this looks like a dupe of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

No branches or pull requests

3 participants