Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement character encoding detection / conversion #18

Open
kmcallister opened this issue Jul 31, 2014 · 12 comments
Open

Implement character encoding detection / conversion #18

kmcallister opened this issue Jul 31, 2014 · 12 comments

Comments

@kmcallister
Copy link
Contributor

See the HTML spec and the WHATWG Encoding spec. This also entails noticing <meta charset=...> and <meta http-equiv="Content-Type"> as we parse.

@SimonSapin
Copy link
Member

rust-encoding implements the Encoding spec. We already use it in Servo for CSS.

HTML’s prescan the byte stream to determine its encoding algorithm (the one that looks for <meta charset=...> and <meta http-equiv="Content-Type">) works on the byte stream (as the name says), before tokenization starts.

I believe html5ever should implement the encoding sniffing algorithm (of which the prescan is part of) independently of the tokenizer and parser (although they might share code internally) so that the overall parsing is, conceptually:

  1. Determine (sniff) the encoding from bytes and metadata
  2. Decode bytes to Unicode text (using rust-encoding) with the encoding chosen at step 1
  3. Tokenize text to tokens
  4. Parse tokens into a tree

(It would also be nice to have some kind of Unicode stream: rust-lang/rfcs#57)

@SimonSapin
Copy link
Member

SimonSapin commented Aug 3, 2015

Per the spec, the tree builder can also reload the document with a different encoding when it encounters a relevant <meta> tag: https://html.spec.whatwg.org/multipage/#parsing-main-inhead:change-the-encoding

This seems annoying, given that html5ever is not necessarily running in Servo. @hsivonen suggests on #whatwg @ Freenode that we might get away with ignoring that part of the spec:

15:13 SimonSapin hsivonen: Do you know if there is a way around keeping unbounded amount of input in memory in case the parser decides to "change the encoding", when parsing HTML from bytes?
15:14 hsivonen SimonSapin: there intentionally is not supposed to be one
15:14 SimonSapin and, to test if browsers do it, are you aware of a case where "prescan a byte stream to determine its encoding" would fail to find an encoding when given 1024 bytes, but tree construction would still "change the encoding" based on a meta start tag in the first 1024 bytes?
15:16 SimonSapin hsivonen: intentionally? Why?
15:19 hsivonen SimonSapin: to answer the previous question: after the 1024-byte boundary, the parser instance commits to one encoding. However, a late meta or a Japanese/Russian/Ukrainian detector can still trigger a reload with a different encoding
15:20 hsivonen SimonSapin: in which case a new parser instance starts a new parse
15:20 SimonSapin hsivonen: I’m trying to decide what to do in html5ever, which doesn’t necessarily have a notion of reload
15:21 hsivonen SimonSapin: IIRC, WebKit/Blink doesn't support late triggering a reload. I don't know if they do it for their Japanese detection
15:22 hsivonen SimonSapin: as for "intentionally", the intention of the 1024-byte boundary is precisely to make sure that the parser doesn't keep buffering forever and not produce any output
15:22 hsivonen SimonSapin: I suggest committing to an encoding at the latest when you've seen 1024 bytes
15:23 hsivonen SimonSapin: I can't recall why I implemented the late thing in the new parser
15:23 hsivonen SimonSapin: initially, I make the detectors see at most 1024 bytes so that they couldn't trigger a reload
15:23 hsivonen SimonSapin: but that broke Japanese Planet Debian
15:23 hsivonen SimonSapin: and people get really nervous if you break a Japanese site
15:24 hsivonen so...
15:24 hsivonen Japanese Planet Debian has since been fixed
15:25 hsivonen it's quite possible that we could get rid of the Russian and Ukrainian detectors and limit the Japanese detector to 1024 bytes and the sky wouldn't fall
15:25 SimonSapin hsivonen: I see, thanks. So only run the byte-based prescanner, or can tree construction find meta tags that the pre-scanner doesn’t?
15:26 SimonSapin hsivonen: I’m referring to https://html.spec.whatwg.org/mult...main-inhead:change-the-encoding
15:26 hsivonen SimonSapin: I suggest only running the prescanner. (but I bet it's possible to construct something that the prescanner doesn't see but the tree builder sees)
15:27 annevk hsivonen: WebKit only has a Japanese detector iirc
15:27 hsivonen SimonSapin: oh. the reason I added support for late may be that the spec said so!
15:27 hsivonen SimonSapin: but IIRC, WebKit doesn't honor the spec there
15:27 annevk we should fix the spec
15:28 hsivonen SimonSapin: it quite possible that the spec says so because the old parser in Gecko behaved like that
15:28 hsivonen SimonSapin: I'm not sure what IE did at the time the spec was written, but my vague recollection is that it supported late
15:30 hsivonen hmm. an obvious way to create a seen by the tree builder but not by the prescanner is, of course, document.write
15:30 jgraham This wasn't one of the cases where Hixie was concerned about the security impact of an attacker that could cause early termination of the byte stream?
15:30 hsivonen jgraham: I don't recall this topic co-occurring with that topic
15:31 * jgraham isn't quite sure what such an attack would look like given incremental parsing
15:31 hsivonen jgraham: that was about comments and scripts
15:32 jgraham OK
15:32 hsivonen SimonSapin: so I suggest 1) implementing just the prescan until 1024 bytes, 2) being aware that you might end up having to implement something that allows you to signal to the browsing context to reload if # 1 Breaks the Web, 3) researching if old IE actually supports late and if it doesn't, filing a spec bug
15:33 hsivonen it's possible that a spec bug is warranted just based on the success of WebKit, though
15:33 SimonSapin hsivonen: Chrome doesn’t reaload, it switches encodings mid-stream: https://gist.github.com/anonymous/addad9f51781a6cd2cf9
15:33 SimonSapin Firefox reloads
15:33 hsivonen SimonSapin: whoa!
15:34 SimonSapin Firefox makes two HTTP requests
15:34 hsivonen SimonSapin: Firefox making two requests is expected
15:34 hsivonen SimonSapin: the Chrome behavior is news to me
15:35 SimonSapin Chrome 46 dev, don’t have Release at hand
15:37 hsivonen annevk: my current assumption is that the Russian and Ukrainian detectors misfiring is a greater problem than the problems they fix, but I don't have proof
15:38 hsivonen annevk: I want to get rid of those two detectors but I feel I need something more concrete than a guess that they have negative utility

In the test case below, Chrome 46 dev switches encodings mid-stream (but not exactly at the point of the <meta> tag) rather than reloading: it renders è, then èèé. Firefox 39 renders è, then èè, then reloads (makes a new HTTP request), renders èè, then èèé.

import time
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/html')]
    start_response(status, headers)
    yield u"""
<!DOCTYPE html>
<!-- 1024 bytes:
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
-->
è""".encode('utf8')
    time.sleep(2)
    yield u"è<meta charset=utf-8>é".encode('utf8')


httpd = make_server('', 8000, simple_app)
print("Listening on port 8000....")
httpd.serve_forever()

@gsnedders
Copy link

@SimonSapin what happens when it's in response to a POST request?

bors-servo pushed a commit that referenced this issue Jan 26, 2016
Rewrite the high-level API (driver module) to use TendrilSink

This depends on servo/tendril#23.

This also adds an API to parse from bytes, which is part of #18.

r? @nox

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/html5ever/188)
<!-- Reviewable:end -->
@SimonSapin
Copy link
Member

#188 added a from_bytes method to html5ever::driver::Parser that does some of this (pick an encoding from Content-Type charset or a BOM, and use rust-encoding).

It should be possible to modify the html5ever::driver::detect_encoding function to do the rest of this (in particular scanning for <meta> elements) without further API changes.

@shinglyu
Copy link

You mentioned

pick an encoding from Content-Type charset

But I tried

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=big5" />

I still got diamonds

selection_136

@jdm
Copy link
Member

jdm commented Aug 24, 2017

@shinglyu Servo doesn't process meta tags yet.

@shinglyu
Copy link

@jdm: I see, so the parser has the ability to do so but it's not hooked up in servo yet?

@SimonSapin
Copy link
Member

SimonSapin commented Aug 24, 2017

I meant Content-Type the HTTP header. And yes, using that API in Servo never landed because tests failed because </script> handling was incorrect.

@shinglyu
Copy link

@SimonSapin Is servo/servo#9730 what you were talking about?

@SimonSapin
Copy link
Member

Yes.

@shinglyu
Copy link

I wrote a test server to have Content-Type=text/html; charset=big5 in my HTTP header. (see the screenshot from webkit devtool)
selection_137

But servo still renders it incorrectly.

A dumb question: how do I break in the html5ever code using gdb from servo? I tried b driver.rs:<line no> but didn't work. I want to check if the encoding is correct, or is there other place in servo I can break to check the encoding parsing? (Looks like the document.characterSet() API is hardcoded to UTF-8?)

@SimonSapin
Copy link
Member

No need for a debugger, I can confirm that current Servo always uses UTF-8 for HTML. (servo/servo#9730 never landed.)

https://github.com/servo/servo/blob/1aacbce053/components/script/dom/servoparser/mod.rs#L346-L365

What happened with #9730 is that I first tried to build an abstraction in html5ever with a nice simplified API, and then realized it didn’t fit what Servo needs. So this time around I suggest first implementing the encoding sniffing algorithm in Servo, and then later see what kind of API we can build to move it into html5ever.

And we should use encoding_rs instead of rust-encoding, now that Gecko ships it. I have some ideas for adding encoding_rs support in Tendril.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants