Implement character encoding detection / conversion #18

Open
kmcallister opened this Issue Jul 31, 2014 · 4 comments

Projects

None yet

3 participants

@kmcallister
Contributor

See the HTML spec and the WHATWG Encoding spec. This also entails noticing <meta charset=...> and <meta http-equiv="Content-Type"> as we parse.

@SimonSapin
Member

rust-encoding implements the Encoding spec. We already use it in Servo for CSS.

HTML’s prescan the byte stream to determine its encoding algorithm (the one that looks for <meta charset=...> and <meta http-equiv="Content-Type">) works on the byte stream (as the name says), before tokenization starts.

I believe html5ever should implement the encoding sniffing algorithm (of which the prescan is part of) independently of the tokenizer and parser (although they might share code internally) so that the overall parsing is, conceptually:

  1. Determine (sniff) the encoding from bytes and metadata
  2. Decode bytes to Unicode text (using rust-encoding) with the encoding chosen at step 1
  3. Tokenize text to tokens
  4. Parse tokens into a tree

(It would also be nice to have some kind of Unicode stream: rust-lang/rfcs#57)

@SimonSapin
Member

Per the spec, the tree builder can also reload the document with a different encoding when it encounters a relevant <meta> tag: https://html.spec.whatwg.org/multipage/#parsing-main-inhead:change-the-encoding

This seems annoying, given that html5ever is not necessarily running in Servo. @hsivonen suggests on #whatwg @ Freenode that we might get away with ignoring that part of the spec:

15:13 SimonSapin hsivonen: Do you know if there is a way around keeping unbounded amount of input in memory in case the parser decides to "change the encoding", when parsing HTML from bytes?
15:14 hsivonen SimonSapin: there intentionally is not supposed to be one
15:14 SimonSapin and, to test if browsers do it, are you aware of a case where "prescan a byte stream to determine its encoding" would fail to find an encoding when given 1024 bytes, but tree construction would still "change the encoding" based on a meta start tag in the first 1024 bytes?
15:16 SimonSapin hsivonen: intentionally? Why?
15:19 hsivonen SimonSapin: to answer the previous question: after the 1024-byte boundary, the parser instance commits to one encoding. However, a late meta or a Japanese/Russian/Ukrainian detector can still trigger a reload with a different encoding
15:20 hsivonen SimonSapin: in which case a new parser instance starts a new parse
15:20 SimonSapin hsivonen: I’m trying to decide what to do in html5ever, which doesn’t necessarily have a notion of reload
15:21 hsivonen SimonSapin: IIRC, WebKit/Blink doesn't support late triggering a reload. I don't know if they do it for their Japanese detection
15:22 hsivonen SimonSapin: as for "intentionally", the intention of the 1024-byte boundary is precisely to make sure that the parser doesn't keep buffering forever and not produce any output
15:22 hsivonen SimonSapin: I suggest committing to an encoding at the latest when you've seen 1024 bytes
15:23 hsivonen SimonSapin: I can't recall why I implemented the late thing in the new parser
15:23 hsivonen SimonSapin: initially, I make the detectors see at most 1024 bytes so that they couldn't trigger a reload
15:23 hsivonen SimonSapin: but that broke Japanese Planet Debian
15:23 hsivonen SimonSapin: and people get really nervous if you break a Japanese site
15:24 hsivonen so...
15:24 hsivonen Japanese Planet Debian has since been fixed
15:25 hsivonen it's quite possible that we could get rid of the Russian and Ukrainian detectors and limit the Japanese detector to 1024 bytes and the sky wouldn't fall
15:25 SimonSapin hsivonen: I see, thanks. So only run the byte-based prescanner, or can tree construction find meta tags that the pre-scanner doesn’t?
15:26 SimonSapin hsivonen: I’m referring to https://html.spec.whatwg.org/mult...main-inhead:change-the-encoding
15:26 hsivonen SimonSapin: I suggest only running the prescanner. (but I bet it's possible to construct something that the prescanner doesn't see but the tree builder sees)
15:27 annevk hsivonen: WebKit only has a Japanese detector iirc
15:27 hsivonen SimonSapin: oh. the reason I added support for late may be that the spec said so!
15:27 hsivonen SimonSapin: but IIRC, WebKit doesn't honor the spec there
15:27 annevk we should fix the spec
15:28 hsivonen SimonSapin: it quite possible that the spec says so because the old parser in Gecko behaved like that
15:28 hsivonen SimonSapin: I'm not sure what IE did at the time the spec was written, but my vague recollection is that it supported late
15:30 hsivonen hmm. an obvious way to create a seen by the tree builder but not by the prescanner is, of course, document.write
15:30 jgraham This wasn't one of the cases where Hixie was concerned about the security impact of an attacker that could cause early termination of the byte stream?
15:30 hsivonen jgraham: I don't recall this topic co-occurring with that topic
15:31 * jgraham isn't quite sure what such an attack would look like given incremental parsing
15:31 hsivonen jgraham: that was about comments and scripts
15:32 jgraham OK
15:32 hsivonen SimonSapin: so I suggest 1) implementing just the prescan until 1024 bytes, 2) being aware that you might end up having to implement something that allows you to signal to the browsing context to reload if #1 Breaks the Web, 3) researching if old IE actually supports late and if it doesn't, filing a spec bug
15:33 hsivonen it's possible that a spec bug is warranted just based on the success of WebKit, though
15:33 SimonSapin hsivonen: Chrome doesn’t reaload, it switches encodings mid-stream: https://gist.github.com/anonymous/addad9f51781a6cd2cf9
15:33 SimonSapin Firefox reloads
15:33 hsivonen SimonSapin: whoa!
15:34 SimonSapin Firefox makes two HTTP requests
15:34 hsivonen SimonSapin: Firefox making two requests is expected
15:34 hsivonen SimonSapin: the Chrome behavior is news to me
15:35 SimonSapin Chrome 46 dev, don’t have Release at hand
15:37 hsivonen annevk: my current assumption is that the Russian and Ukrainian detectors misfiring is a greater problem than the problems they fix, but I don't have proof
15:38 hsivonen annevk: I want to get rid of those two detectors but I feel I need something more concrete than a guess that they have negative utility

In the test case below, Chrome 46 dev switches encodings mid-stream (but not exactly at the point of the <meta> tag) rather than reloading: it renders è, then èèé. Firefox 39 renders è, then èè, then reloads (makes a new HTTP request), renders èè, then èèé.

import time
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/html')]
    start_response(status, headers)
    yield u"""
<!DOCTYPE html>
<!-- 1024 bytes:
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
-->
è""".encode('utf8')
    time.sleep(2)
    yield u"è<meta charset=utf-8>é".encode('utf8')


httpd = make_server('', 8000, simple_app)
print("Listening on port 8000....")
httpd.serve_forever()
@gsnedders

@SimonSapin what happens when it's in response to a POST request?

@bors-servo bors-servo added a commit that referenced this issue Jan 26, 2016
@bors-servo bors-servo Auto merge of #188 - servo:tendrilsink, r=nox
Rewrite the high-level API (driver module) to use TendrilSink

This depends on servo/tendril#23.

This also adds an API to parse from bytes, which is part of #18.

r? @nox

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/html5ever/188)
<!-- Reviewable:end -->
2d9d8ac
@SimonSapin
Member

#188 added a from_bytes method to html5ever::driver::Parser that does some of this (pick an encoding from Content-Type charset or a BOM, and use rust-encoding).

It should be possible to modify the html5ever::driver::detect_encoding function to do the rest of this (in particular scanning for <meta> elements) without further API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment