Should we unescape characters in path? #606

TimothyGu · 2021-05-21T09:08:08Z

Consider

<a href="https://jsdom.github.io/whatwg-url/">normal</a>,
<a href="https://jsdom.github.io/wh%61twg-url/">encoded 'a'</a>

In Chrome, both a elements have pathname "/whatwg-url/", indicating that the %61 was unescaped.

In Safari, the second a element has pathname "/wh%61twg-url/", but when navigating, /whatwg-url/ is actually the destination.

In Firefox, the second a element has pathname "/wh%61twg-url/", and that pathname is used when navigating (resulting in a 404 error with GitHub pages). However, confusingly, the address bar in the browser chrome has the valid "whatwg-url", so if you go to the address bar and press Enter it works:

The spec currently doesn't do any unescaping. Should we?

The text was updated successfully, but these errors were encountered:

karwa · 2021-05-21T11:15:04Z

It sounds like a good idea to decode them, IMO. The latest HTTP semantics draft spec says:

Scheme-based normalization (Section 6.2.3 of [RFC3986]) of "http" and "https" URIs involves the following additional rules:
...
Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [RFC3986]).

(I believe this would also apply to at least the query)

We already do the other HTTP-specific normalisations (removing default ports, root path instead of empty, lowercased host name), as well as other normalisations (e.g. exotic IP addresses), so I think it makes sense to do this, too. Some part of the system will have to - best to do it as soon as possible at the URL level to avoid mismatches like those you’ve described.

alwinb · 2021-07-07T06:27:37Z

Related issues: #369, #565 and #87. Quick questions, probably discussed there:

Which set of characters (code points?) would then be unescaped (aka. the unreserved set)
What then to do with invalid sequences becoming valid sequences after unescaping, such as e.g. http://example/%1%61

I think #87 gives the reasons for not doing this.

It is still important to define the semantics of escape sequences, for server-side URL handling and for interoperability though. Currently the standard does not discuss that.

philloooo · 2021-09-23T21:11:10Z

Filed a bug in Chromium to track this. https://crbug.com/1252531

annevk · 2021-09-24T07:05:00Z

FWIW, I'm not sure I agree with the notion that we need to define the semantics. Or at least in such a way that there is only one valid interpretation. Servers can interpret URLs however they wish and they do not necessarily need to agree with each other on that. If one server considers a and A the same and another does not, that's perfectly acceptable.

alwinb · 2021-09-25T10:50:04Z

What about the semantics of percent encoded sequences though? Not the additional protocol- or application specific normalisations, but the analogue of Percent-Encoding Normalization in the RFC.

I don't think the URL Standard currently states that %61 may be considered equivalent to a, other than in the domain (where it is obligatory). It might even make sense to mandate that, so that it is safe to assume that e.g. http://example/%61 and http://example/a refer to the same resource, whether browsers normalise them to the same URL or not.

If we do make the semantics explicit then the question arises of how to correct for invalid escape sequences. So that %6%31 does not end up being (percent-encoding-) equivalent with %61 and then transitively with a.

karwa · 2021-09-25T12:01:46Z

We already define that %2E and %2e are equivalent to . in path components (even a mixed component like .%2E is treated equivalently to ..).

We also add percent-encoding for certain characters - both in the parser and via the component setters. It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.

Basically, if we're not happy with saying that %61 must always be the same as a (and that is the only interpretation), then the following operation:

var url = new URL("http://example.com/foo/bar");

url.href; // "http://example.com/foo/bar"

url.pathname = "what should this do???";

url.href; // "http://example.com/what%20should%20this%20do%3F%3F%3F"

Should also fail. Otherwise, the string what%20should%20this%20do%3F%3F%3F is not necessarily the same as what should this do???.

I don't think that's workable. At least for ASCII code-points, they must be equivalent.

annevk · 2021-09-27T07:00:37Z

It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.

Maybe? That really depends on whether the user knows what the parser will do.

I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work, but I'm not sure why we'd mandate things we cannot really require. If a server wants to treat %61 and a differently, it can.

valenting · 2021-09-27T08:20:13Z

I agree with Anne. Ideally we'd do as little processing as possible on a URL, and let the server handle them as well as they can.
There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.

alwinb · 2021-09-29T11:51:01Z

I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work

A good way to go about that is to point out in the standard that there are multiple normalisations/ equivalence relations on URLs that are in common use, and add a section, maybe? to explain that. I suppose, in the WHATWG style it would contain algorithms that compute them. (With a statement that they are not normative for browsers, where this is the case).

Percent encoding normalisation / equivalence is a good one to start with.

Collapsing slashes could be another one to mention. IIRC this is relevant also to another open issue about JS module specifiers.

annevk · 2021-09-29T12:00:47Z

I suppose we could add something to https://url.spec.whatwg.org/#url-equivalence that is very clearly scoped to non-browser contexts. There's also query parameter order and such (to the extent we want to acknowledge query parameters there, not sure).

guybedford · 2021-10-08T13:52:04Z

Treating a feature like this as part of equivalence instead of canonical serialization would mean it definitely wouldn't apply to JS module system instance identification. For example, all JS module systems compare the href not based on the equivalence algorithm and for this reason fragments result in separate module instances already, and users have already adopted this as a feature. Not sure which way I stand on that but just noting that these decisions should always be considered with reference to their modules implications at this point please.

annevk · 2021-10-08T16:46:43Z

@guybedford yeah, the same is true for many other systems out there. And to be clear, it wouldn't change the default equivalence algorithm, it would just be an option if you want more URLs to be the same (that also can reasonably be argued to the be same).

malaire · 2021-10-15T18:49:40Z

The issue #565 which was just closed as duplicate has some additional discussion which might be worth checking out by those following this issue. But I agree on keeping any further discussion here.

alwinb · 2021-10-24T10:16:09Z

Also reading #369 again, it has an especially good description. That issue also explains the address bar behaviour of Firefox.

My take is that we have to make a decision about how to handle invalid escape sequences. Not knowing/ deciding on how to handle those leads to the current situation by default rather than by making a deliberate decision. And if it ends up not being used by browsers, then it is still valuable to specify a recommended behaviour for other applications.

To move on with #606 (comment), I think it is needed. And I think we have to revisit the reserved characters. Unless we’d go for a comparison on URL records with fully decoded components. But I’m not sure that is desirable, especially in the query string.

By the way, I think adding something there is really great!

Jikstra · 2021-12-28T15:45:49Z

Any news on this? The regression around actix/actix-web#2398 is sadly holding us back updating actix-web to the latest beta. svenstaro/miniserve#677

karwa · 2022-01-23T18:16:57Z

I'm currently exploring implementing this in Swift, as over-encoding/removing over-encoding is an important feature for interop with our existing RFC-2396 URL type, as well as a generally useful feature. Having looked a the previous issues, I'm reasonably convinced this is possible. I'm not seeing any insurmountable challenges.

Maybe? That really depends on whether the user knows what the parser will do.

I don't really find this very satisfying; the same argument could be made the other way. If the user is expected to have a deep and detailed understanding of the parser, any behaviour is reasonable and nothing needs to be justified. It's a kind of cyclical reasoning where things happen because they happen.

If a server wants to treat %61 and a differently, it can.

On the one hand, this is is demonstrably true because - well, form-encoding 😔. A + and a %2B may certainly be different depending on how the query is interpreted.

On the other hand, at least for some characters in some components, that behaviour would not appear to be web compatible. Routers, caches and CDNs will sometimes decode these bytes, and expect that they do not change the meaning of the URL. The discussions in previous issues seems to indicate that many browsers very much do expect these to be equivalent. This leads to the idea that we need some kind of "unreserved set" (perhaps per-component).

Such a server would serve different resources to different browsers for the same URL, which seems at-odds with the idea of interoperability or the web as a platform. The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server. If GHP is indeed entitled to behave that way, it suggests that all browsers which successfully navigate to that URL are wrong - which again, does not seem to be a web-compatible position.

There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.

The difference, IMO, is that the URL parser does not add or remove empty path components (any more! It used to do that to file URLs). It does, however, add and remove percent-encoding, meaning there is already implicit acceptance that doing so does not change the meaning of the URL.

By definition, if the parser does something (e.g. turning http://ex%61mple.com -> http://example.com), it must preserve meaning, as any attempt to utter the former as a URL record results in the latter, and URLs are records:

A URL is a struct that represents a universal identifier. To disambiguate from a valid URL string it can also be referred to as a URL record.

We are forced to accept that the web's model of URLs, as defined by the various browser implementations over the decades, includes this assumption that percent-encoding may be safely added or removed in certain circumstances, and that a standard which attempts to describe that model must define that process and the circumstances where it applies.

annevk · 2022-12-16T13:31:13Z

The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server.

I don't think it does? Browsers that do not behave like Firefox (and Safari behaves like Firefox so something changed since OP) would not be able to visit the 404 at https://jsdom.github.io/wh%61twg-url/ as they would instead get the resource at https://jsdom.github.io/whatwg-url/ which is different.

I'm going to treat this as a clarification issue as per #606 (comment). PRs welcome.

gibson042 · 2022-12-16T16:52:14Z

@annevk By "treat this as a clarification issue", you mean continuing to preserve percent-encoding of characters outside the RFC 3986 reserved set in https://url.spec.whatwg.org/#path-state ? That is permitted by the current HTTP semantics document (which does not require normalization but does make clear that such gratuitous encoding maintains the interpretation of a URI, i.e. it still identifies the same resource), although it does put more burden on user code that is trying to be robust. How would you feel about an issue requesting a normalize method?

annevk · 2022-12-16T16:59:03Z

Yeah. A new API seems fair (though see also https://whatwg.org/faq#adding-new-features and https://whatwg.org/working-mode#changes). Deciding on the semantics might be tricky, but hopefully we can figure something out. (Would be nice if that also tackled query params and such.)

This CL is part of the URL interop 2023 effort. "Intent to Implement and Ship" is [1]. Currently, when Chrome parses a URL, it decodes percent-encoded ASCII characters in URL path. However, this behavior doesn't align with the URL Standard [2]. The CL fixes this behavior to retain percent-encoded ASCII characters in URL's path. Before: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/A" After: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/%41" Interoperability: - Chrome isn't compliant, while Firefox and Safari are compliant. - I've tested URL APIs in non-browser environments and libraries, such as Deno's `URL` implementation [3] and Rust's `url` crate [4], both of which are standard-compliant. Background: The existing behavior seems to be a result of past decisions. The comment in `url_canon_path.cc` states: > // This table was used to be designed to match exactly what IE did > // with the characters. Impact: Regarding implementation, web-exported URL APIs, GURL, and KURL share the same URL parser and canonicalization backend. Given that these URL classes are widely used both internally or externally, predicting all possible consequences and risks is challenging. Given the very low user metrics [5], we received approval to land [1], but with a kill switch in place. UMA: Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023) This number isn't specific to any particular use case and represents a an upper bound. The actual impact is likely lower. Interaction with web servers: Before: When a user enters "https://example.com/%41" in the address bar or clicks a link like <a href="https://example.com/%41">, Chrome sends "/A" to the server. After:: Chrome now sends "/%41" to the server, without decoding, similar to Safari and Firefox. Note that Chrome's address bar will still display "https://example.com/A" because the address bar formats URLs in their own way. For websites, how to handle percent-encoded characters in a URL's path is up to each website. Since they can receive such URLs from various clients, not just Chrome, this isn't a new issue for most websites. They typically decode a URL's path before processing. Another concern relates to Chromium's internal code or developers who rely on the current behavior, intentionally or not. For example, this CL might lead to issues in cases like: ``` const hash = {}; const url1 = new URL("http://example.com/%41"); hash[url1.href] = "v1"; // ... const url2 = new URL("http://example.com/A"); hash[url2.href] // Assumed that "v1" is retrieved, but this is no longer true. ``` According to the URL Standard, `url1` and `url2` are not equivalent [6], but some clients might depend on Chrome's current behavior as a feature. This presents a risk. Additional notes: - This change only affects the URL's path. Other parts like the host are not impacted. - There was a discussion about Chrome's behavior [7]. The consensus is that Chrome's behavior should be fixed for better interoperability. - There's a proposal to add a normalization interface [8] to URL. - [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ - [2] https://url.spec.whatwg.org/#url-parsing - [3] https://deno.land/api@v1.34.3?s=URL - [4] https://docs.rs/url/latest/url/ - [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62 - [6] https://url.spec.whatwg.org/#url-equivalence - [7] whatwg/url#606 - [8] whatwg/url#729 Bug: 1252531 Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744 Reviewed-by: Daniel Cheng <dcheng@chromium.org> Reviewed-by: James Lee <ljjlee@google.com> Reviewed-by: Avi Drissman <avi@chromium.org> Reviewed-by: Emily Stark <estark@chromium.org> Commit-Queue: Hayato Ito <hayato@chromium.org> Cr-Commit-Position: refs/heads/main@{#1191900}

TimothyGu added the topic: parser label May 21, 2021

philloooo mentioned this issue Sep 23, 2021

Add id member to manifest w3c/manifest#988

Merged

5 tasks

alwinb mentioned this issue Oct 8, 2021

Proposal: file: URL separator deduplication #552

Open

TimothyGu mentioned this issue Oct 15, 2021

How should parser handle percent-encoded characters like %66 U+0066 (f) in path segments? #565

Closed

robjtede mentioned this issue Oct 26, 2021

files: percent-decode url path actix/actix-web#2398

Merged

5 tasks

annevk added clarification Standard could be clearer and removed topic: parser labels Dec 16, 2022

gibson042 mentioned this issue Dec 16, 2022

Proposal: Add a normalization interface #729

Open

lemire mentioned this issue Mar 2, 2023

Additional tests related to pathToFileURL ada-url/ada#248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we unescape characters in path? #606

Should we unescape characters in path? #606

TimothyGu commented May 21, 2021

karwa commented May 21, 2021 •

edited

Loading

alwinb commented Jul 7, 2021 •

edited

Loading

philloooo commented Sep 23, 2021

annevk commented Sep 24, 2021

alwinb commented Sep 25, 2021

karwa commented Sep 25, 2021 •

edited

Loading

annevk commented Sep 27, 2021

valenting commented Sep 27, 2021

alwinb commented Sep 29, 2021

annevk commented Sep 29, 2021

guybedford commented Oct 8, 2021

annevk commented Oct 8, 2021

malaire commented Oct 15, 2021 •

edited

Loading

alwinb commented Oct 24, 2021 •

edited

Loading

Jikstra commented Dec 28, 2021

karwa commented Jan 23, 2022 •

edited

Loading

annevk commented Dec 16, 2022

gibson042 commented Dec 16, 2022

annevk commented Dec 16, 2022

Should we unescape characters in path? #606

Should we unescape characters in path? #606

Comments

TimothyGu commented May 21, 2021

karwa commented May 21, 2021 • edited Loading

alwinb commented Jul 7, 2021 • edited Loading

philloooo commented Sep 23, 2021

annevk commented Sep 24, 2021

alwinb commented Sep 25, 2021

karwa commented Sep 25, 2021 • edited Loading

annevk commented Sep 27, 2021

valenting commented Sep 27, 2021

alwinb commented Sep 29, 2021

annevk commented Sep 29, 2021

guybedford commented Oct 8, 2021

annevk commented Oct 8, 2021

malaire commented Oct 15, 2021 • edited Loading

alwinb commented Oct 24, 2021 • edited Loading

Jikstra commented Dec 28, 2021

karwa commented Jan 23, 2022 • edited Loading

annevk commented Dec 16, 2022

gibson042 commented Dec 16, 2022

annevk commented Dec 16, 2022

karwa commented May 21, 2021 •

edited

Loading

alwinb commented Jul 7, 2021 •

edited

Loading

karwa commented Sep 25, 2021 •

edited

Loading

malaire commented Oct 15, 2021 •

edited

Loading

alwinb commented Oct 24, 2021 •

edited

Loading

karwa commented Jan 23, 2022 •

edited

Loading