The argument to document.write is a sequence of UCS-2 code units and we need a way to interface this with the UTF-8 parser. My plan is:
(Edit: Largely superseded by this proposal)
As much as I’d like to, I don’t know that we can convince other implementations to replace lone surrogates with U+FFFD. For those that use UCS-2 internally (every one but us), this is pure overhead and has a performance cost.
And it’s not just document.write. Lone surrogates can end up anywhere in the DOM through APIs, and other browsers happily keep them there.
Another solution could be WTF-8: rust-lang/rust#12056 (comment). It’s a superset of UTF-8 (like UTF-8 is a superset of ASCII) that allows surrogates, but only if they’re unpaired. (Concatenating two WTF-8 strings is not just concatenating the bytes, but also needs to check for newly-paired surrogates at the boundary and converts them to the UTF-8 representation of a single code point.)
Is it out of the question that the spec would allow but not mandate U+FFFD replacement? When I brought this up before people seemed to think it was enough of a corner case that we could get away with it (spec wording changes or no)
“Allow but not mandate” sounds bad for interop on principle, though I don’t know how much it really matters here. But even if we replace in document.write, surrogates can still get in through DOM or CSSOM APIs.
When this was brought up in CSS WG to replace in CSSOM, the conclusion was "no change". (Though it’s not clear to me the arguments for change were well represented then. I was in the meeting remotely in audio only with very bad sound quality.)
WTF-8 is a thing now: http://email@example.com/msg00921.html
I’ve changed my mind on the above. I’d like Servo to try UTF-8 everywhere in the DOM and what you first suggested here for document.write.
https://github.com/kmcallister/tendril encompasses my latest proposal.