-
Notifications
You must be signed in to change notification settings - Fork 240
Closed
Labels
Description
See servo/servo#3704.
The argument to document.write
is a sequence of UCS-2 code units and we need a way to interface this with the UTF-8 parser. My plan is:
(Edit: Largely superseded by this proposal)
- Convert to UTF-8 as soon as possible.
- Convert invalid surrogate sequences to U+FFFD 'REPLACEMENT CHARACTER'. This is a deviation from the spec, but nobody has objected strongly in the course of various discussions. There was even talk of amending the spec to allow this behavior, since it's currently written under the assumption that all parsers use UCS-2 natively.
- If a
document.write
input ends with a leading surrogate, we can't convert it yet, so save this singleu16
in theBufferQueue
alongside the UTF-8 buffers. - If a
document.write
input starts with a trailing surrogate, and there's a saved leading surrogate in theBufferQueue
, then replace both with the appropriate Unicode character as UTF-8. - If the parser receives any other input and there's a saved leading surrogate, drop the saved surrogate and prepend U+FFFD to the input. (This means that a script split an invalid surrogate sequence across multiple
document.write
calls, or wrote a lone leading surrogate and then finished.)