javascript URL: define JS string-to-byte conversion better #1129

domenic · 2016-04-26T13:25:23Z

Response bodies are bytes, but the algorithm (as of #1107) uses a JS string as a response body.

Black box testing plan

At #1107 (comment) Boris gave a test plan that would allow us to figure out the string -> byte conversion in a black box way:

In terms of test matrix, if we need to determine this in a black-box way, it seems to me that the following are somewhat useful cases to test:

Return string is all ASCII (charCodeAt() < 128 for all indices).

Return string has charCodeAt() < 256 for all indices, but does not fall into case 1.

Return string has does not have any charCodeAt values corresponding to UTF-16 surrogate code unit values, but does not fall into cases 1 or 2.

Return string has surrogate code units which are all paired properly.

Return string has unpaired surrogate code units.

Each of these should be tested in situations in which the source of the javascript: URL is either UTF-8 or ISO-8859-1/Windows-1252. That is, either an iframe in a document with that encoding with src pointing to a javascript: URL, or a link in a document with that encoding with href pointing to a javascript: URL. Probably test both scenarios.

The tests should look for the following things:

What is the document.body.textContent of the resulting document?

What is the document.charset of the resulting document?

Relevant implementer reports

Gecko

From #1107 (comment)

What Gecko does in terms of conversion to bytes is that it examines the returned string to see whether all charCodeAt() values are 255 or less. If so, the string is treated as byte-inflated ISO-8859-1 data (and a response is synthesized which has "ISO-8859-1" as its encoding, with the byte-deflated bytes as data). This allows generation of non-text data, dating back to when we supported javascript: in <img>, say.

Otherwise the return value is treated as a sequence of UTF-16 code units encoding a Unicode string and converted to UTF-8 bytes (insert handwaving about what happens to unpaired surrogates here). The synthesized response has "UTF-8" as its encoding.

Blink

From #1107 (comment) with further analysis by Boris in #1107 (comment):

I don't think Blink does conversion to bytes at all here. See the FIXME comment in https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/loader/DocumentWriter.cpp&l=75&ct=xref_jump_to_def&cl=GROK&gsn=appendReplacingData as called from https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/loader/DocumentLoader.cpp&sq=package:chromium&l=684&rcl=1461583037 (DocumentLoader::replaceDocumentWhileExecutingJavaScriptURL).

EdgeHTML

From #1107 (comment)

In Edge we specify two code pages for transformation. The first is the calculated code page which is always CP_UCS_2 which translates to Unicode, ISO 10646 according to comments. We then specify as the source code page CPSRC_NATIVEDATA which means native data, known to be CP_UCS_2 so don't allow any sort of fallbacks.

The text was updated successfully, but these errors were encountered:

Fixes #301, by aligning with the 3/4 browser majority and checking the type of the completion value, turning non-strings and thrown errors into 204s. (Thrown errors are still reported, however.) While working on this algorithm, we fix #945 by copying the HTTPS state to response. This also does some minor cleanup to clarify that "run a classic script" returns undefined when scripting is disabled. #1129 was opened to track a remaining open issue discovered, which is exactly how the JS string completion value becomes a response body. For now the spec includes a warning saying that this is underspecified.

annevk · 2017-07-26T14:08:34Z

Here's a WPT for this (you can replace the encoding windows-1252 for the other test):

<meta charset=UTF-8>
<title>javascript URL string return values</title>
<script src=/resources/testharness.js></script>
<script src=/resources/testharnessreport.js></script>
<div id=log></div>
<script>
const testInputs = [
  [0x41],
  [0x80,0xFF],
  [0x80,0xFF,0x100],
  [0xD83D,0xDE0D],
  [0xDE0D,0x41]
];
testInputs.forEach(input => {
  async_test(t => {
    const frame = document.createElement("iframe");
    //t.add_cleanup(() => frame.remove());
    frame.src = "javascript:[" + input + "].map(b => String.fromCharCode(b)).join('')";
    t.step_timeout(() => {
      assert_equals(frame.contentDocument.body.textContent, input.map(b => String.fromCharCode(b)).join(""));
      assert_equals(document.charset, frame.contentDocument.charset);
      t.done();
    }, 200);
    document.body.appendChild(frame);
  });
});
</script>

Chrome and Safari pass these. (That means lone surrogates remain lone surrogates.)

Firefox doesn't:

For UTF-8 harness: returns windows-1252 for the first test; decodes using windows-1252 for the second test so gets € rather than U+0080 (it also reports windows-1252 in that test, but that point is not normally reached); replaces the lone surrogate with U+FFFD in the last test.
For windows-1252 harness: same failures, but also returns UTF-8 for the last three tests (arguably better?).

Edge doesn't:

For UTF-8 harness: returns unicode for the first test instead of UTF-8; replaces the lone surrogate with U+FFFD in the last test.
For windows-1252 harness: returns unicode for the first test instead of windows-1252; gives Chinese/Japanese code points and a question mark back for the remainder of the tests. I can investigate in more detail, but I'm not sure we care.

I'll try add the <a href> tests now.

annevk · 2017-07-26T14:24:35Z

For those I get similar results in Firefox and Edge, but both Chrome and Safari cannot be tested in this way because they do not respect the target attribute on such hyperlinks?!

<meta charset=UTF-8>
<title>javascript URL string return values</title>
<script src=/resources/testharness.js></script>
<script src=/resources/testharnessreport.js></script>
<div id=log></div>
<script>
const testInputs = [
  [0x41],
  [0x80,0xFF],
  [0x80,0xFF,0x100],
  [0xD83D,0xDE0D],
  [0xDE0D,0x41]
];
testInputs.forEach(input => {
  const javascriptURL = "javascript:[" + input + "].map(b => String.fromCharCode(b)).join('')",
        output = input.map(b => String.fromCharCode(b)).join("");

  async_test(t => {
    const frame = document.createElement("iframe");
    t.add_cleanup(() => frame.remove());
    frame.src = javascriptURL;
    t.step_timeout(() => {
      assert_equals(frame.contentDocument.body.textContent, output);
      assert_equals(frame.contentDocument.charset, document.charset);
      t.done();
    }, 200);
    document.body.appendChild(frame);
  });

  async_test(t => {
    const frame = document.createElement("iframe"),
          href = document.createElement("a");
    t.add_cleanup(() => { frame.remove(); href.remove(); });
    frame.name = "hi" + input;
    href.target = "hi" + input;
    href.href = javascriptURL;
    t.step_timeout(() => {
      assert_equals(frame.contentDocument.body.textContent, output);
      assert_equals(frame.contentDocument.charset, document.charset);
      t.done();
    }, 200)
    document.body.appendChild(frame);
    document.body.appendChild(href);
    href.click();
  });
});
</script>

annevk · 2017-07-26T15:00:39Z

By the way, if we actually need this to be about bytes, my recommendation would be always UTF-8 and UTF-8 encode. If we can somehow deal with a response containing a JavaScript string, we could just match Chrome/WebKit. (Possible, but probably ugly. Much better to require UTF-8 encode I think.)

annevk · 2017-07-27T11:58:21Z

Issues filed for target attribute not working:

TimothyGu · 2019-06-11T22:20:31Z

Some updates on the Blink front. After @natechapin's chromium/chromium@9d459e0 (scheduled for Chrome 77), Chrome now always does UTF-8 encoding/decoding for javascript URLs rather than using the JavaScript string directly, which is the best case scenario. So the last test case (unpaired surrogates) in #1129 (comment) now fails. Additionally, the window-1252 case has been failing for a while now (at least since Chrome 75) as Chrome has been forcing UTF-8 charset on the resulting document.

TimothyGu · 2019-06-11T22:40:22Z

Additionally, the <a> tests in #1129 (comment) now run in Chrome, as they fixed the target bug in Chrome 76. However, in 77 because they started doing UTF-8 encoding/decoding, the very last test case (unpaired surrogates) also started failing.

Closes #1129.

domenic · 2021-06-16T21:41:10Z

So it looks like @annevk's tests were committed to WPT at some point, yay: results.

However I'm trying to spec this and I think I must be confused on the desired behavior for the [0xDE0D,0x41] test case.

If we spec the body as doing a UTF-8 encode of the JS string '\uDE0D\u41', we get the bytes 0xEF 0xBF 0xBD 0x41, right?

But when you feed that body into a UTF-8 decoder, which is what I presume would happen (we'd force the charset to UTF-8 via Content-Type), then the result would be the JS string '\uFFFDA'. Right?

However Chrome seems to instead produce the JS string '\uFFFD\uFFFD\uFFFDA'.

Is my understanding correct?

Closes #1129.

TimothyGu · 2021-06-17T05:43:58Z

@domenic This looks potentially like a Blink bug. When we set a breakpoint at where UTF-16 JavaScript string is converted to UTF-8, we can see that the converted result is actually ED B8 8D 41, where the unpaired surrogate is translated to UTF-8 directly without any checking. This is probably an artifact of kLenientUTF8Conversion, where kStrictUTF8ConversionReplacingUnpairedSurrogatesWithFFFD would do the right thing. Filed https://crbug.com/1221018.

Closes #1129.

Closes whatwg#1129.

domenic added the compat Standard is not web compatible or proprietary feature needs standardizing label Apr 26, 2016

domenic mentioned this issue Apr 26, 2016

Fix #301: detail how javascript: return values become response bodies #1107

Merged

domenic added the topic: navigation label May 2, 2016

annevk changed the title ~~Define the JS string-to-byte conversion for javascript: URL results better~~ javascript URL: define JS string-to-byte conversion better Jul 27, 2017

domenic added the topic: javascript: URLs label Jun 15, 2018

domenic mentioned this issue Mar 11, 2021

Fix same-URL navigation session history handling #6476

Merged

domenic added a commit that referenced this issue Jun 16, 2021

Treat javascript: URL results as UTF-8 bytes

9c881cf

Closes #1129.

domenic mentioned this issue Jun 16, 2021

Treat javascript: URL results as UTF-8 bytes #6781

Merged

3 tasks

domenic added a commit that referenced this issue Jun 16, 2021

Treat javascript: URL results as UTF-8 bytes

0b99c86

Closes #1129.

domenic closed this as completed in #6781 Jun 22, 2021

domenic added a commit that referenced this issue Jun 22, 2021

Treat javascript: URL results as UTF-8 bytes

66605d1

Closes #1129.

mfreed7 pushed a commit to mfreed7/html that referenced this issue Jun 3, 2022

Treat javascript: URL results as UTF-8 bytes

6c55326

Closes whatwg#1129.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

javascript URL: define JS string-to-byte conversion better #1129

javascript URL: define JS string-to-byte conversion better #1129

domenic commented Apr 26, 2016

annevk commented Jul 26, 2017

annevk commented Jul 26, 2017

annevk commented Jul 26, 2017 •

edited

annevk commented Jul 27, 2017

TimothyGu commented Jun 11, 2019 •

edited

TimothyGu commented Jun 11, 2019

domenic commented Jun 16, 2021

TimothyGu commented Jun 17, 2021 •

edited

javascript URL: define JS string-to-byte conversion better #1129

javascript URL: define JS string-to-byte conversion better #1129

Comments

domenic commented Apr 26, 2016

Black box testing plan

Relevant implementer reports

Gecko

Blink

EdgeHTML

annevk commented Jul 26, 2017

annevk commented Jul 26, 2017

annevk commented Jul 26, 2017 • edited

annevk commented Jul 27, 2017

TimothyGu commented Jun 11, 2019 • edited

TimothyGu commented Jun 11, 2019

domenic commented Jun 16, 2021

TimothyGu commented Jun 17, 2021 • edited

annevk commented Jul 26, 2017 •

edited

TimothyGu commented Jun 11, 2019 •

edited

TimothyGu commented Jun 17, 2021 •

edited