Unescaped '>' should probably not be allowed in URLs #291

bzbarsky · 2017-04-05T21:04:46Z

The standard way, going back to at least the mid-90s, to mark up URLs in text is <url>. This, of course, relies on unescaped > not being allowed in URLs. This is clearly stated, with exactly this rationale, in RFC 1738 section 2.2. The URL standard should have similar provisions.

I don't know what that should mean for URL parsing, but in terms of serialization '>' should always be escaped in URLs, imo.

I just tested browser behavior, and:

Firefox consistently escapes '>' in path, userinfo, query, fragment. '>' in host or port cause parsing failure.
Safari escapes '>' in path, userinfo, query. It allows '>' unchanged in host and fragment. '>' in port causes parsing failure.
Chrome escapes '>' in path, userinfo, query, host. It allows '>' unchanged in fragment. '>' in port causes parsing failure.
Edge escapes '>' in path and host. It allows '>' unchanged in fragment and query. '>' in port causes parsing failure. Presence of userinfo causes parsing failure no matter what.

Testcase used:

<pre><script>
  var strs = [
    "http://test>test/foo\\bar",
    "http://a>b@test/foo\\bar",
    "http://test/foo\\bar/#a>b",
    "http://test/foo\\bar/?a=c>d",
    "http://test:2>3/foo\\bar",
    "http://test/foo>bar\\baz",
  ];
  for (var str of strs) {
    var a = document.createElement("a");
    a.setAttribute("href", str);
    var href;
    try {
      href = a.href;
    } catch(e) {
      href = "href getter threw";
    }
    var url;
    try {
      url = (new URL(str).href);
    } catch(e) {
      url = "constructor threw";
    }
    document.writeln(str, " -- ", href, " -- ", url);
  }
</script>

with the \\ bits in there a way to tell whether parsing failed in the href case.

The text was updated successfully, but these errors were encountered:

bzbarsky · 2017-04-05T21:08:46Z

Note also that there are various other standards (e.g. the one for the Link HTTP header) that rely on being able to put <> around a URL to delimit it.

Currently, we percent-encode characters in "fragment state" using the C0 control percent-encode set. Firefox encodes more than that, and it seems reasonable to align around that behavior for reasons spelled out in #291 and the comments of #344. This patch adds a new "fragment percent-encode set" which contains the C0 control percent-encode set, along with: * 0x20 (SP) * 0x22 (") * 0x3C (<) * 0x3E (>) * 0x60 (`) Tests: web-platform-tests/wpt#7776. Closes #344.

annevk · 2020-05-06T16:41:28Z

Apart from host it seems this is in order: https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9leGFtcGxlLmNvbS88Pj88PiM8Pg==&base=YWJvdXQ6Ymxhbms=. Probably due to #347.

Host is tracked by #458.

annevk added the topic: parser label Apr 10, 2017

annevk mentioned this issue Sep 15, 2017

Consider percent-encoding more characters in "fragment state" #344

Closed

mikewest mentioned this issue Oct 9, 2017

Percent-encode additional characters in "fragment state". #347

Merged

annevk closed this as completed May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescaped '>' should probably not be allowed in URLs #291

Unescaped '>' should probably not be allowed in URLs #291

bzbarsky commented Apr 5, 2017

bzbarsky commented Apr 5, 2017

annevk commented May 6, 2020

Unescaped '>' should probably not be allowed in URLs #291

Unescaped '>' should probably not be allowed in URLs #291

Comments

bzbarsky commented Apr 5, 2017

bzbarsky commented Apr 5, 2017

annevk commented May 6, 2020