Implement character encoding detection / conversion #18

kmcallister · 2014-07-31T20:59:41Z

See the HTML spec and the WHATWG Encoding spec. This also entails noticing <meta charset=...> and <meta http-equiv="Content-Type"> as we parse.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2014-09-24T19:28:48Z

rust-encoding implements the Encoding spec. We already use it in Servo for CSS.

HTML’s prescan the byte stream to determine its encoding algorithm (the one that looks for <meta charset=...> and <meta http-equiv="Content-Type">) works on the byte stream (as the name says), before tokenization starts.

I believe html5ever should implement the encoding sniffing algorithm (of which the prescan is part of) independently of the tokenizer and parser (although they might share code internally) so that the overall parsing is, conceptually:

Determine (sniff) the encoding from bytes and metadata
Decode bytes to Unicode text (using rust-encoding) with the encoding chosen at step 1
Tokenize text to tokens
Parse tokens into a tree

(It would also be nice to have some kind of Unicode stream: rust-lang/rfcs#57)

SimonSapin · 2015-08-03T15:50:46Z

Per the spec, the tree builder can also reload the document with a different encoding when it encounters a relevant <meta> tag: https://html.spec.whatwg.org/multipage/#parsing-main-inhead:change-the-encoding

This seems annoying, given that html5ever is not necessarily running in Servo. @hsivonen suggests on #whatwg @ Freenode that we might get away with ignoring that part of the spec:

15:13 SimonSapin hsivonen: Do you know if there is a way around keeping unbounded amount of input in memory in case the parser decides to "change the encoding", when parsing HTML from bytes?
15:14 hsivonen SimonSapin: there intentionally is not supposed to be one
15:14 SimonSapin and, to test if browsers do it, are you aware of a case where "prescan a byte stream to determine its encoding" would fail to find an encoding when given 1024 bytes, but tree construction would still "change the encoding" based on a meta start tag in the first 1024 bytes?
15:16 SimonSapin hsivonen: intentionally? Why?
15:19 hsivonen SimonSapin: to answer the previous question: after the 1024-byte boundary, the parser instance commits to one encoding. However, a late meta or a Japanese/Russian/Ukrainian detector can still trigger a reload with a different encoding
15:20 hsivonen SimonSapin: in which case a new parser instance starts a new parse
15:20 SimonSapin hsivonen: I’m trying to decide what to do in html5ever, which doesn’t necessarily have a notion of reload
15:21 hsivonen SimonSapin: IIRC, WebKit/Blink doesn't support late triggering a reload. I don't know if they do it for their Japanese detection
15:22 hsivonen SimonSapin: as for "intentionally", the intention of the 1024-byte boundary is precisely to make sure that the parser doesn't keep buffering forever and not produce any output
15:22 hsivonen SimonSapin: I suggest committing to an encoding at the latest when you've seen 1024 bytes
15:23 hsivonen SimonSapin: I can't recall why I implemented the late thing in the new parser
15:23 hsivonen SimonSapin: initially, I make the detectors see at most 1024 bytes so that they couldn't trigger a reload
15:23 hsivonen SimonSapin: but that broke Japanese Planet Debian
15:23 hsivonen SimonSapin: and people get really nervous if you break a Japanese site
15:24 hsivonen so...
15:24 hsivonen Japanese Planet Debian has since been fixed
15:25 hsivonen it's quite possible that we could get rid of the Russian and Ukrainian detectors and limit the Japanese detector to 1024 bytes and the sky wouldn't fall
15:25 SimonSapin hsivonen: I see, thanks. So only run the byte-based prescanner, or can tree construction find meta tags that the pre-scanner doesn’t?
15:26 SimonSapin hsivonen: I’m referring to https://html.spec.whatwg.org/mult...main-inhead:change-the-encoding
15:26 hsivonen SimonSapin: I suggest only running the prescanner. (but I bet it's possible to construct something that the prescanner doesn't see but the tree builder sees)
15:27 annevk hsivonen: WebKit only has a Japanese detector iirc
15:27 hsivonen SimonSapin: oh. the reason I added support for late may be that the spec said so!
15:27 hsivonen SimonSapin: but IIRC, WebKit doesn't honor the spec there
15:27 annevk we should fix the spec
15:28 hsivonen SimonSapin: it quite possible that the spec says so because the old parser in Gecko behaved like that
15:28 hsivonen SimonSapin: I'm not sure what IE did at the time the spec was written, but my vague recollection is that it supported late
15:30 hsivonen hmm. an obvious way to create a seen by the tree builder but not by the prescanner is, of course, document.write
15:30 jgraham This wasn't one of the cases where Hixie was concerned about the security impact of an attacker that could cause early termination of the byte stream?
15:30 hsivonen jgraham: I don't recall this topic co-occurring with that topic
15:31 * jgraham isn't quite sure what such an attack would look like given incremental parsing
15:31 hsivonen jgraham: that was about comments and scripts
15:32 jgraham OK
15:32 hsivonen SimonSapin: so I suggest 1) implementing just the prescan until 1024 bytes, 2) being aware that you might end up having to implement something that allows you to signal to the browsing context to reload if # 1 Breaks the Web, 3) researching if old IE actually supports late and if it doesn't, filing a spec bug
15:33 hsivonen it's possible that a spec bug is warranted just based on the success of WebKit, though
15:33 SimonSapin hsivonen: Chrome doesn’t reaload, it switches encodings mid-stream: https://gist.github.com/anonymous/addad9f51781a6cd2cf9
15:33 SimonSapin Firefox reloads
15:33 hsivonen SimonSapin: whoa!
15:34 SimonSapin Firefox makes two HTTP requests
15:34 hsivonen SimonSapin: Firefox making two requests is expected
15:34 hsivonen SimonSapin: the Chrome behavior is news to me
15:35 SimonSapin Chrome 46 dev, don’t have Release at hand
15:37 hsivonen annevk: my current assumption is that the Russian and Ukrainian detectors misfiring is a greater problem than the problems they fix, but I don't have proof
15:38 hsivonen annevk: I want to get rid of those two detectors but I feel I need something more concrete than a guess that they have negative utility

In the test case below, Chrome 46 dev switches encodings mid-stream (but not exactly at the point of the <meta> tag) rather than reloading: it renders Ã¨, then Ã¨èé. Firefox 39 renders Ã¨, then Ã¨Ã¨, then reloads (makes a new HTTP request), renders èè, then èèé.

import time
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/html')]
    start_response(status, headers)
    yield u"""
<!DOCTYPE html>
<!-- 1024 bytes:
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
-->
è""".encode('utf8')
    time.sleep(2)
    yield u"è<meta charset=utf-8>é".encode('utf8')


httpd = make_server('', 8000, simple_app)
print("Listening on port 8000....")
httpd.serve_forever()

gsnedders · 2015-12-02T23:03:15Z

@SimonSapin what happens when it's in response to a POST request?

@nox

Rewrite the high-level API (driver module) to use TendrilSink This depends on servo/tendril#23. This also adds an API to parse from bytes, which is part of #18. r? @nox  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/html5ever/188)

SimonSapin · 2016-02-23T16:46:54Z

#188 added a from_bytes method to html5ever::driver::Parser that does some of this (pick an encoding from Content-Type charset or a BOM, and use rust-encoding).

It should be possible to modify the html5ever::driver::detect_encoding function to do the rest of this (in particular scanning for <meta> elements) without further API changes.

Bump version

shinglyu · 2017-08-24T04:11:03Z

You mentioned

pick an encoding from Content-Type charset

But I tried

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=big5" />

I still got diamonds

jdm · 2017-08-24T05:05:29Z

@shinglyu Servo doesn't process meta tags yet.

shinglyu · 2017-08-24T06:23:04Z

@jdm: I see, so the parser has the ability to do so but it's not hooked up in servo yet?

SimonSapin · 2017-08-24T07:55:31Z

I meant Content-Type the HTTP header. And yes, using that API in Servo never landed because tests failed because </script> handling was incorrect.

shinglyu · 2017-08-24T08:09:00Z

@SimonSapin Is servo/servo#9730 what you were talking about?

SimonSapin · 2017-08-24T08:11:47Z

Yes.

shinglyu · 2017-08-24T08:56:37Z

I wrote a test server to have Content-Type=text/html; charset=big5 in my HTTP header. (see the screenshot from webkit devtool)

But servo still renders it incorrectly.

A dumb question: how do I break in the html5ever code using gdb from servo? I tried b driver.rs:<line no> but didn't work. I want to check if the encoding is correct, or is there other place in servo I can break to check the encoding parsing? (Looks like the document.characterSet() API is hardcoded to UTF-8?)

SimonSapin · 2017-08-24T12:56:15Z

No need for a debugger, I can confirm that current Servo always uses UTF-8 for HTML. (servo/servo#9730 never landed.)

https://github.com/servo/servo/blob/1aacbce053/components/script/dom/servoparser/mod.rs#L346-L365

What happened with #9730 is that I first tried to build an abstraction in html5ever with a nice simplified API, and then realized it didn’t fit what Servo needs. So this time around I suggest first implementing the encoding sniffing algorithm in Servo, and then later see what kind of API we can build to move it into html5ever.

And we should use encoding_rs instead of rust-encoding, now that Gecko ships it. I have some ideas for adding encoding_rs support in Tendril.

kmcallister added for-servo and removed for-servo labels Jul 31, 2014

SimonSapin mentioned this issue Sep 24, 2014

Accept raw bytes #34

Closed

kmcallister mentioned this issue Mar 25, 2015

Implement the "sixth era" string proposal #115

Closed

SimonSapin mentioned this issue Jun 18, 2015

Support non-UTF-8 character encodings for HTML servo/servo#6414

Open

This was referenced Jan 24, 2016

Make kuchiki more forgiving about invalid encodings kuchiki-rs/kuchiki#18

Closed

Rewrite the high-level API (driver module) to use TendrilSink #188

Merged

SimonSapin pushed a commit that referenced this issue Feb 23, 2017

Merge pull request #18 from mbrubeck/master

98b538f

Bump version

hansihe mentioned this issue Jun 1, 2017

Parsing non-UTF-8 pages rusterlium/html5ever_elixir#6

Open

shinglyu mentioned this issue Aug 24, 2017

CJK support servo/servo#18202

Open

6 tasks

snshn mentioned this issue Jan 23, 2020

leave html entities untouched, it breaks pages with weird encodings Y2Z/monolith#112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement character encoding detection / conversion #18

Implement character encoding detection / conversion #18

kmcallister commented Jul 31, 2014

SimonSapin commented Sep 24, 2014

SimonSapin commented Aug 3, 2015 •

edited by notriddle

Loading

gsnedders commented Dec 2, 2015

SimonSapin commented Feb 23, 2016

shinglyu commented Aug 24, 2017

jdm commented Aug 24, 2017

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017 •

edited

Loading

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017

Implement character encoding detection / conversion #18

Implement character encoding detection / conversion #18

Comments

kmcallister commented Jul 31, 2014

SimonSapin commented Sep 24, 2014

SimonSapin commented Aug 3, 2015 • edited by notriddle Loading

gsnedders commented Dec 2, 2015

SimonSapin commented Feb 23, 2016

shinglyu commented Aug 24, 2017

jdm commented Aug 24, 2017

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017 • edited Loading

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017

shinglyu commented Aug 24, 2017

SimonSapin commented Aug 24, 2017

SimonSapin commented Aug 3, 2015 •

edited by notriddle

Loading

SimonSapin commented Aug 24, 2017 •

edited

Loading