Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup around the HTML parser and how it sniffs encodings #1077

Closed
ArkadiuszMichalski opened this issue Apr 20, 2016 · 12 comments · Fixed by #5359
Closed

Cleanup around the HTML parser and how it sniffs encodings #1077

ArkadiuszMichalski opened this issue Apr 20, 2016 · 12 comments · Fixed by #5359
Assignees
Labels
clarification Standard could be clearer

Comments

@ArkadiuszMichalski
Copy link

ArkadiuszMichalski commented Apr 20, 2016

Context
From actual description I can't determine behavior for Document.characterSet|charset|inputEncoding when loading a document. In addition, @annevk indicates that getting an output encoding should be used only when encoding, not decoding, but algorithms that cooperate with the parser use it. Would be cleare to write "getting an output encoding to encode" because "encoding" means both (thing and action). Looks like unnecessary change was done here 6a31c26. You can revert or define "getting an output encoding to decode" (maps UTF-16 to UTF-8 and x-user-defined to Windows-1252) and use it if necessary.

But still wondering where exactly document's encoding is set when input has BOM. Where is the prose for that action? Sniffing algorithm doesn't mention anything about the BOM and decode doesn't change document's encoding.

annevk added a commit that referenced this issue Apr 24, 2016
In 6a31c26 I applied “get an output
encoding” a bit enthusiastically, which gives the wrong result for the
replacement encoding. It is only to be used when encoding (e.g., <form>
submission and URLs), not decoding.

This fixes the regression part of #1077.
@annevk
Copy link
Member

annevk commented Apr 24, 2016

I created a fix for the regression.

The other problem is that the Encoding standard ultimately decides what encoding is used for encoding, but has no means to communicate that back to the caller (the caller just gets a stream of bytes).

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

@ArkadiuszMichalski
Copy link
Author

ArkadiuszMichalski commented Apr 25, 2016

In HTML decode is used only by HTM parser and when fetching script. In first case BOM should correct document's encoding, but for script I don't know if BOM affects any properties or slots.
Basically you can just correct this prose to be more clear, add "Also when decode change encoding (due to BOM) then...".

Decode is also used in XHR (text response) and once again I don't know if BOM affects any properties or slots. But more interesting is document response, I wondering if BOM has any effect, must find some test.

annevk added a commit that referenced this issue Apr 25, 2016
In 6a31c26 I applied “get an output encoding” a bit enthusiastically, which gives the wrong result for the replacement encoding. It is only to be used when encoding (e.g., <form> submission and URLs), not decoding.

This fixes the regression part of #1077.
@inexorabletash
Copy link
Member

Re: Encoding - can the "decode" hook just return a tuple, i.e. "Return output and encoding" ? I see more work in HTML to refer to e.g. the "sniffed encoding" vs. the "used encoding" and to actually assign the latter to the document's character encoding, like @ArkadiuszMichalski mentions.

And then update XHR to ignore/use the returned encoding (depending on results of @ArkadiuszMichalski's test). Or, if we don't want to touch other dependent specs, then same thing we'd do in a real language: rename "decode" to "detect encoding and decode" and add a "decode" with the old signature.

(But that seems obvious so I feel like I'm missing something.)

@annevk
Copy link
Member

annevk commented Apr 26, 2016

@inexorabletash I guess it is rather obvious, was just curious whether you had a preference.

@ArkadiuszMichalski
Copy link
Author

ArkadiuszMichalski commented Apr 26, 2016

Backing to XHR, text response is correct, browsers recognize BOM at the begining, remove them and change encoding, Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works), and test was created.

For document response (step 5.) Firefox and Chrome works the same as for HTML parser, so respect <meta> and when BOM exist then change encoding and set document's encoding to correct value. In short term:
UTF-8(encoding in editor) iso-8859-16(<meta>) = document.characterSet: iso-8859-16
UTF-8(encoding in editor) iso-8859-16(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE
UTF-16LE(encoding in editor) UTF-16LE(<meta>) = document.characterSet: UTF-8
UTF-16LE(encoding in editor) UTF-16LE(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE

@annevk
Copy link
Member

annevk commented Apr 26, 2016

FWIW, "encoding in editor" has no effect here. It's not an input to the algorithm since it cannot be determined.

@hsivonen
Copy link
Member

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

I think the main problem with the integration of HTML and the Encoding Standard is that the Encoding Standard doesn't accommodate 1) incremental decoding and 2) internal encoding declarations.

The Encoding Standard makes BOM sniffing part of "decode" and the HTML Standard does not sniff for the BOM on its own, so spec-wise the UA is supposed to buffer up to 1024 bytes and sniff it for before letting the "decode" step sniff for the BOM.

This does not match a reasonable implementation, such as Gecko. Gecko sniffs for the BOM incrementally as bytes arrive (doesn't even wait for 3 bytes like the Encoding Standard says if fewer bytes make a decision possible) and skips sniffing if a BOM was found.

I think the Encoding Standard should cater to the existence of languages like HTML and CSS that support non-BOM internal encoding declarations and allow the language-specific sniffing algorithms to be plugged in between BOM sniffing and the actual decode phase.

Spec-wise, if you want the Encoding Standard to continue to own BOM sniffing, I suggest extending the current "decode" algorithm such that it takes three inputs: byte stream, fallback encoding and a (potentially null) sniffing state machine for an internal encoding declaration. Then make it output a stream of characters and the encoding actually used. Require that the sniffing state machine be able to operate incrementally so that it yields an encoding as soon as it has seen enough and make the "decode" algorithm feed it up to 1024 bytes (as soon as the bytes arrive--not waiting for 1024 bytes first; it would be nice for the BOM sniffing not to buffer 3 bytes first if the first byte already can't be the first byte of an UTF-8 BOM).

Then have HTML and CSS specify their calls to decode as passing in a sniffing state machine.

Implementation wise, I wouldn't make users of a streaming encoding conversion API pass in the state machine but make them perform BOM sniffing first, search for an internal encoding declaration second and then invoke an incremental decoder.

FWIW, encoding-rs provides a low-level streaming API for encoders and decoders as well as high-level non-streaming convenience methods. The convenience methods decode and decode_with_replacement return a pair whose second item is the encoding that was actually used. (Likewise with encode_with_replacement. There's currently no convenience method for encoding without replacement, because I'm not aware of a use case in the Web Platform considering that encoding to UTF-8 is the same in both cases as UTF-8 has no unmappables.)

Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works)

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

@hsivonen
Copy link
Member

Oh, and the above doesn't account for the HTML spec currently allowing heuristic sniffing. If you still want to allow that, it's probably easier to make the HTML spec use a distinct BOM sniffing hook than to let BOM sniffing in the HTML context be baked into "decode".

@ArkadiuszMichalski
Copy link
Author

ArkadiuszMichalski commented Apr 26, 2016

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

Ah, you're right, all BOM didn't work, but when I get UTF-8 file with UTF-8 BOM (first 3 byte), then this BOM is somehow removed, because first char when reading responseText is not BOM (this confused me), so I set header to other encoding (like ISO-8859-2) and when reading responseText this BOM exist but encoding was not change to UTF-8 (is still ISO-8859-2).

@andreubotella
Copy link
Member

I was recently pointed to this issue, and I find it surprising that this problem hasn't yet been fixed in the specifications, even if it's not an issue in all three major browser engines, where the document's character encoding follows the BOM.

To recap, the several fixes proposed, from easier to harder to implement, are:

  1. Add a new hook to the Encoding standard that takes a token stream and returns the BOM's encoding, if any. This could then be used in both Encoding's decode and HTML's encoding sniffing algorithms. @domenic implemented this in jsdom (The document's encoding doesn't properly get overwritten by a BOM #1910 (comment)), and from a conversation on IRC he still considers it a valid solution for the specifications.
  2. Have Encoding's decode algorithm return a tuple of a token stream and an encoding, as proposed by @inexorabletash for example.
  3. Finally, @hsivonen proposes that decode takes an algorithm / state machine as an additional optional parameter, so that the sniffing for format-specific encoding declarations takes place at the same time as the BOM sniffing, in order to better match implementations.

While it's worth matching the internal behavior of implementations, any of these options would agree with the observable behavior. Furthermore, option 3 fails to take into account MIME Sniffing's algorithm to sniff a mislabeled feed, which should be implemented in parallel with HTML's encoding sniffing (though I haven't checked what actual browsers do).

In any case, if a BOM is found, the encoding's confidence should be certain. While there is no real need to prevent user agents from changing the encoding while parsing, since the effective encoding would be identical, this helps make clear that there's no need to reparse.

@annevk
Copy link
Member

annevk commented Mar 9, 2020

I like 1. It adds some redundancy, but that's okay as these are legacy callers. New places would exclusively use UTF-8 anyway.

@andreubotella
Copy link
Member

andreubotella commented Mar 14, 2020

I'm working on PR's for fix # 1 which I'll be opening in a few days if no one disagrees. @inexorabletash? @hsivonen?

andreubotella pushed a commit to andreubotella/html that referenced this issue Mar 16, 2020
This change fixes a bug where document's character encoding was set to
the return value of the encoding sniffing algorithm rather than to the
actual encoding used, which differed when the stream started with a
byte-order mark. This change incorporates BOM sniffing into the encoding
sniffing algorithm, ensuring both encodings are identical.

Closes whatwg#1077.
andreubotella pushed a commit to andreubotella/html that referenced this issue Mar 16, 2020
This change fixes a bug where document's character encoding was set to
the return value of the encoding sniffing algorithm rather than to the
actual encoding used, which differed when the stream started with a
byte-order mark. This change incorporates BOM sniffing into the encoding
sniffing algorithm, ensuring both encodings are identical.

Closes whatwg#1077.
annevk pushed a commit to whatwg/encoding that referenced this issue Mar 24, 2020
This change moves the BOM splitting part of the decode hook into a separate hook which does not consume any bytes of the token stream.

This will allow fixing a long-standing issue in the HTML encoding sniffing algorithm with the document's character encoding being set to the wrong result when there is a BOM: whatwg/html#1077.

Closes #128.
annevk pushed a commit that referenced this issue Mar 24, 2020
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical.

Tests: web-platform-tests/wpt#22276.

Closes #1077.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

Successfully merging a pull request may close this issue.

6 participants