Cleanup around the HTML parser and how it sniffs encodings #1077

ArkadiuszMichalski · 2016-04-20T10:32:32Z

Context
From actual description I can't determine behavior for Document.characterSet|charset|inputEncoding when loading a document. In addition, @annevk indicates that getting an output encoding should be used only when encoding, not decoding, but algorithms that cooperate with the parser use it. Would be cleare to write "getting an output encoding to encode" because "encoding" means both (thing and action). Looks like unnecessary change was done here 6a31c26. You can revert or define "getting an output encoding to decode" (maps UTF-16 to UTF-8 and x-user-defined to Windows-1252) and use it if necessary.

But still wondering where exactly document's encoding is set when input has BOM. Where is the prose for that action? Sniffing algorithm doesn't mention anything about the BOM and decode doesn't change document's encoding.

The text was updated successfully, but these errors were encountered:

In 6a31c26 I applied “get an output encoding” a bit enthusiastically, which gives the wrong result for the replacement encoding. It is only to be used when encoding (e.g., <form> submission and URLs), not decoding. This fixes the regression part of #1077.

annevk · 2016-04-24T14:28:14Z

I created a fix for the regression.

The other problem is that the Encoding standard ultimately decides what encoding is used for encoding, but has no means to communicate that back to the caller (the caller just gets a stream of bytes).

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

ArkadiuszMichalski · 2016-04-25T05:21:05Z

In HTML decode is used only by HTM parser and when fetching script. In first case BOM should correct document's encoding, but for script I don't know if BOM affects any properties or slots.
Basically you can just correct this prose to be more clear, add "Also when decode change encoding (due to BOM) then...".

Decode is also used in XHR (text response) and once again I don't know if BOM affects any properties or slots. But more interesting is document response, I wondering if BOM has any effect, must find some test.

In 6a31c26 I applied “get an output encoding” a bit enthusiastically, which gives the wrong result for the replacement encoding. It is only to be used when encoding (e.g., <form> submission and URLs), not decoding. This fixes the regression part of #1077.

inexorabletash · 2016-04-25T19:35:55Z

Re: Encoding - can the "decode" hook just return a tuple, i.e. "Return output and encoding" ? I see more work in HTML to refer to e.g. the "sniffed encoding" vs. the "used encoding" and to actually assign the latter to the document's character encoding, like @ArkadiuszMichalski mentions.

And then update XHR to ignore/use the returned encoding (depending on results of @ArkadiuszMichalski's test). Or, if we don't want to touch other dependent specs, then same thing we'd do in a real language: rename "decode" to "detect encoding and decode" and add a "decode" with the old signature.

(But that seems obvious so I feel like I'm missing something.)

annevk · 2016-04-26T07:16:56Z

@inexorabletash I guess it is rather obvious, was just curious whether you had a preference.

ArkadiuszMichalski · 2016-04-26T13:24:59Z

Backing to XHR, text response is correct, browsers recognize BOM at the begining, remove them and change encoding, Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works), and test was created.

For document response (step 5.) Firefox and Chrome works the same as for HTML parser, so respect <meta> and when BOM exist then change encoding and set document's encoding to correct value. In short term:
UTF-8(encoding in editor) iso-8859-16(<meta>) = document.characterSet: iso-8859-16
UTF-8(encoding in editor) iso-8859-16(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE
UTF-16LE(encoding in editor) UTF-16LE(<meta>) = document.characterSet: UTF-8
UTF-16LE(encoding in editor) UTF-16LE(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE

annevk · 2016-04-26T14:40:00Z

FWIW, "encoding in editor" has no effect here. It's not an input to the algorithm since it cannot be determined.

hsivonen · 2016-04-26T15:19:51Z

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

I think the main problem with the integration of HTML and the Encoding Standard is that the Encoding Standard doesn't accommodate 1) incremental decoding and 2) internal encoding declarations.

The Encoding Standard makes BOM sniffing part of "decode" and the HTML Standard does not sniff for the BOM on its own, so spec-wise the UA is supposed to buffer up to 1024 bytes and sniff it for before letting the "decode" step sniff for the BOM.

This does not match a reasonable implementation, such as Gecko. Gecko sniffs for the BOM incrementally as bytes arrive (doesn't even wait for 3 bytes like the Encoding Standard says if fewer bytes make a decision possible) and skips sniffing if a BOM was found.

I think the Encoding Standard should cater to the existence of languages like HTML and CSS that support non-BOM internal encoding declarations and allow the language-specific sniffing algorithms to be plugged in between BOM sniffing and the actual decode phase.

Spec-wise, if you want the Encoding Standard to continue to own BOM sniffing, I suggest extending the current "decode" algorithm such that it takes three inputs: byte stream, fallback encoding and a (potentially null) sniffing state machine for an internal encoding declaration. Then make it output a stream of characters and the encoding actually used. Require that the sniffing state machine be able to operate incrementally so that it yields an encoding as soon as it has seen enough and make the "decode" algorithm feed it up to 1024 bytes (as soon as the bytes arrive--not waiting for 1024 bytes first; it would be nice for the BOM sniffing not to buffer 3 bytes first if the first byte already can't be the first byte of an UTF-8 BOM).

Then have HTML and CSS specify their calls to decode as passing in a sniffing state machine.

Implementation wise, I wouldn't make users of a streaming encoding conversion API pass in the state machine but make them perform BOM sniffing first, search for an internal encoding declaration second and then invoke an incremental decoder.

FWIW, encoding-rs provides a low-level streaming API for encoders and decoders as well as high-level non-streaming convenience methods. The convenience methods decode and decode_with_replacement return a pair whose second item is the encoding that was actually used. (Likewise with encode_with_replacement. There's currently no convenience method for encoding without replacement, because I'm not aware of a use case in the Web Platform considering that encoding to UTF-8 is the same in both cases as UTF-8 has no unmappables.)

Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works)

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

hsivonen · 2016-04-26T15:21:23Z

Oh, and the above doesn't account for the HTML spec currently allowing heuristic sniffing. If you still want to allow that, it's probably easier to make the HTML spec use a distinct BOM sniffing hook than to let BOM sniffing in the HTML context be baked into "decode".

ArkadiuszMichalski · 2016-04-26T16:37:42Z

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

Ah, you're right, all BOM didn't work, but when I get UTF-8 file with UTF-8 BOM (first 3 byte), then this BOM is somehow removed, because first char when reading responseText is not BOM (this confused me), so I set header to other encoding (like ISO-8859-2) and when reading responseText this BOM exist but encoding was not change to UTF-8 (is still ISO-8859-2).

andreubotella · 2020-03-09T15:16:49Z

I was recently pointed to this issue, and I find it surprising that this problem hasn't yet been fixed in the specifications, even if it's not an issue in all three major browser engines, where the document's character encoding follows the BOM.

To recap, the several fixes proposed, from easier to harder to implement, are:

Add a new hook to the Encoding standard that takes a token stream and returns the BOM's encoding, if any. This could then be used in both Encoding's decode and HTML's encoding sniffing algorithms. @domenic implemented this in jsdom (The document's encoding doesn't properly get overwritten by a BOM #1910 (comment)), and from a conversation on IRC he still considers it a valid solution for the specifications.
Have Encoding's decode algorithm return a tuple of a token stream and an encoding, as proposed by @inexorabletash for example.
Finally, @hsivonen proposes that decode takes an algorithm / state machine as an additional optional parameter, so that the sniffing for format-specific encoding declarations takes place at the same time as the BOM sniffing, in order to better match implementations.

While it's worth matching the internal behavior of implementations, any of these options would agree with the observable behavior. Furthermore, option 3 fails to take into account MIME Sniffing's algorithm to sniff a mislabeled feed, which should be implemented in parallel with HTML's encoding sniffing (though I haven't checked what actual browsers do).

In any case, if a BOM is found, the encoding's confidence should be certain. While there is no real need to prevent user agents from changing the encoding while parsing, since the effective encoding would be identical, this helps make clear that there's no need to reparse.

annevk · 2020-03-09T16:17:30Z

I like 1. It adds some redundancy, but that's okay as these are legacy callers. New places would exclusively use UTF-8 anyway.

andreubotella · 2020-03-14T10:52:07Z

I'm working on PR's for fix # 1 which I'll be opening in a few days if no one disagrees. @inexorabletash? @hsivonen?

This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte-order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Closes whatwg#1077.

This change moves the BOM splitting part of the decode hook into a separate hook which does not consume any bytes of the token stream. This will allow fixing a long-standing issue in the HTML encoding sniffing algorithm with the document's character encoding being set to the wrong result when there is a BOM: whatwg/html#1077. Closes #128.

This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Tests: web-platform-tests/wpt#22276. Closes #1077.

ArkadiuszMichalski mentioned this issue Apr 20, 2016

dom/nodes/Document-characterSet-normalization.html has a ton of casing issues web-platform-tests/wpt#2453

Closed

domenic assigned annevk Apr 20, 2016

annevk mentioned this issue Apr 24, 2016

Regression: do not use "get an output encoding" for decoding #1118

Merged

ArkadiuszMichalski mentioned this issue Oct 16, 2016

The document's encoding doesn't properly get overwritten by a BOM #1910

Closed

domenic added the clarification Standard could be clearer label Oct 16, 2016

annevk mentioned this issue Dec 20, 2016

Decode without BOM whatwg/encoding#85

Closed

This was referenced Dec 12, 2017

Two "streams" confusion ricea/encoding-streams#1

Open

Figure out if anything is needed for better HTML integration whatwg/encoding#128

Closed

annevk mentioned this issue Jan 13, 2018

Getting all bytes in a body whatwg/fetch#661

Closed

annevk mentioned this issue Mar 6, 2019

Shouldn’t BOM be handled in the byte stream prescan part? #4413

Closed

andreubotella mentioned this issue Mar 16, 2020

Add a BOM sniffing hook for better integration with HTML whatwg/encoding#203

Merged

andreubotella mentioned this issue Mar 16, 2020

Make document's character encoding reflect byte-order marks #5359

Merged

3 tasks

annevk closed this as completed in #5359 Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup around the HTML parser and how it sniffs encodings #1077

Cleanup around the HTML parser and how it sniffs encodings #1077

ArkadiuszMichalski commented Apr 20, 2016 •

edited

Loading

annevk commented Apr 24, 2016

ArkadiuszMichalski commented Apr 25, 2016 •

edited

Loading

inexorabletash commented Apr 25, 2016

annevk commented Apr 26, 2016

ArkadiuszMichalski commented Apr 26, 2016 •

edited

Loading

annevk commented Apr 26, 2016

hsivonen commented Apr 26, 2016

hsivonen commented Apr 26, 2016

ArkadiuszMichalski commented Apr 26, 2016 •

edited

Loading

andreubotella commented Mar 9, 2020

annevk commented Mar 9, 2020

andreubotella commented Mar 14, 2020 •

edited

Loading

Cleanup around the HTML parser and how it sniffs encodings #1077

Cleanup around the HTML parser and how it sniffs encodings #1077

Comments

ArkadiuszMichalski commented Apr 20, 2016 • edited Loading

annevk commented Apr 24, 2016

ArkadiuszMichalski commented Apr 25, 2016 • edited Loading

inexorabletash commented Apr 25, 2016

annevk commented Apr 26, 2016

ArkadiuszMichalski commented Apr 26, 2016 • edited Loading

annevk commented Apr 26, 2016

hsivonen commented Apr 26, 2016

hsivonen commented Apr 26, 2016

ArkadiuszMichalski commented Apr 26, 2016 • edited Loading

andreubotella commented Mar 9, 2020

annevk commented Mar 9, 2020

andreubotella commented Mar 14, 2020 • edited Loading

ArkadiuszMichalski commented Apr 20, 2016 •

edited

Loading

ArkadiuszMichalski commented Apr 25, 2016 •

edited

Loading

ArkadiuszMichalski commented Apr 26, 2016 •

edited

Loading

ArkadiuszMichalski commented Apr 26, 2016 •

edited

Loading

andreubotella commented Mar 14, 2020 •

edited

Loading