Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) #369

Closed
mgiuca opened this issue Jan 19, 2018 · 19 comments · Fixed by #513
Closed
Labels
topic: model For issues with the abstract-but-normative bits

Comments

@mgiuca
Copy link
Collaborator

mgiuca commented Jan 19, 2018

This is a bit of a jumble of related issues that all stem from one root problem: URL Standard (unlike RFC 3986) does not have a concept of an "unreserved" character set. Apologies that this is a bit of an essay, but since these are all inter-related, I thought I would just group them into one discussion.

Why an unreserved set?

To give some background, RFC 3986's unreserved set (ASCII alphanumeric plus -._~) is the set of characters that are interchangeable in their percent-encoded and non-encoded forms: "URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource." (The earlier RFC 2396 defined a slightly larger unreserved set: ASCII alphanumeric plus !'()*-._~, which will be relevant later.)

In other words, the RFC divides the set of valid URL characters into two subsets: reserved and unreserved. Percent-encoding or percent-decoding a reserved character may change the meaning of the URL (e.g., "?abc=def" and "?abc%3Ddef" have different meanings). Percent-encoding or percent-decoding an unreserved character does not change the meaning of the URL (e.g., "/abc/" and "/%61%62%63/" should be considered equivalent, with "/abc/" being the normalized representation).

URL Standard does not have an equivalent concept, and this manifests as several problems (each of which could have its own bug, but I think it helps to group these together):

  1. URL equivalence is broken. RFC 3986 considers "/abc/" and "/%61%62%63/" to be equivalent; URL Standard does not. URL Standard treats "=" / "%3D" the same way as "a" / "%61" --- in both cases, it does not consider these equivalent. It needs to recognise the equivalence of "a" and "%61". Furthermore, it doesn't even recognise equivalence of uppercase and lowercase percent-encoded bytes: "%3D" and "%3d" are not considered equivalent, because the URL Parser does not normalize lowercase percent-encoded bytes to uppercase. (Note: I like that URL Parser serves as a "normalization" pass, with equivalence being trivially defined as compare serialization for equality. Therefore, I would like to address this in the Parser, rather than modifying the equivalence algorithm itself.)
  2. URL rendering is broken. Currently, the standard just says to percent decode all sequences ("unless that renders those sequences invisible"). This displays URLs ambiguously, because for example "?abc=def" and "?abc%3Ddef" will both be displayed as "?abc=def". It's impossible for the reader to know whether that "=" represents a syntactic "=" or a literal U+003D character represented by "%3D". It's only safe to percent decode characters when rendering if we're sure it doesn't change the semantics of the URL. (A good rule would be that Parse(Serialize(url)) == Parse(Render(url)) should be true for all URLs.) Right now, there is no code point (with a few exceptions, but not the cases I'm talking about) that can be decoded without changing the way the URL parses.
  3. There is no well-defined algorithm or character set for safely encoding an arbitrary string into a URL. Of course, there is encodeURIComponent (defined in ECMAScript), but I'm not sure how to reference an ECMAScript API from a web standard. I think URL Standard itself should define how to safely encode a string for a URL. Case in point: registerProtocolHandler is incorrectly specified as using the "default encode set", which is the old name for the path percent-encode set, which doesn't encode enough characters (in particular, '&' and '='). I'll file a separate bug on HTML, but the fix is quite difficult to define (other than "use encodeURIComponent") because URL Standard doesn't define an equivalent set of characters.

So how does adding an unreserved set help with these?

  1. The URL Parser would be updated to decode any percent-encode sequences in the unreserved set. This fixes equivalence because "%61" would decode to "a", and thus "a" would be equivalent to "%61". (We should also fix it so it normalizes "%3d" to "%3D", but that's a separate issue.)
  2. URL Rendering would be updated to only decode percent-encoded bytes above 0x7f (i.e., non-ASCII characters). This is because the parser would have already decoded all unreserved characters (the only ASCII characters that are safe to decode). The only job for the renderer is to decode the non-ASCII characters (which are technically unreserved, but don't appear in the unreserved set, so that the parser produces ASCII-only strings).
  3. We would also define a "default decode set" as the complement of the unreserved set. Other specs (like registerProtocolHandler, and my draft Web Share Target API) would be encouraged to percent-encode all characters in this set.

What should be in the unreserved set?

So what characters should be in the unreserved set? I propose three alternatives (from largest to smallest):

  1. ASCII alphanum plus !$'()*,-.;_~. This reserves the bare minimum set of characters. I compiled this list by carefully reading the URL standard and deciding whether each ASCII character has any special meaning. The above list of characters have no intrinsic meaning anywhere in the URL standard (note that '.' has special meaning in single- and double-dot path segments, but "." and "%2E" are already considered equivalent in that regard).
  2. ASCII alphanum plus !'()*-._~. This matches RFC 2396, the older IETF standard.
  3. ASCII alphanum plus -._~. This matches RFC 3986.

Of these, I prefer option 2 (match RFC 2396). Option 1 is the most "logical" because it can be directly derived from reading the rest of the spec, but it doesn't leave any room for either this spec, or any individual schemes, to add their own special meaning to any new characters (which was the purpose of reserved characters in the first place). Option 3 matches the most recent IETF URL specification, which deliberately moved !'()* into the reserved set, but I don't think this move had much impact on implementations. For example, encodeURIComponent still uses the reserved set from RFC 2396. Option 2 exactly matches the encode set of encodeURIComponent. Furthermore, choosing Option 2 more or less matches Chrome's current behaviour (though it differs from one context to another, as discussed below).

An open question is whether non-ASCII characters should appear in the unreserved set. This mostly doesn't matter, because all non-ASCII characters are in all of the percent-encode sets, so they always get normalized into encoded form. Technically, they act like unreserved characters because the URL semantics doesn't change as you encode/decode them. But I am leaving them out because the unreserved characters should be those that normalize to being decoded.

Study of current implementations

A WHATWG standard is supposed to describe how implementations actually behave. My experiments with Chrome 63 and Firefox 52 suggest that implementations do not follow the current URL standard at all, and are much closer to matching what I suggest above. (Disclaimer: I work for Google on the Chrome team.)

URL equivalence

I can't find a good built-in way on the browser side to test URL equivalence (since the URL class has no equivalence method). But we can use this function to test equivalence of URL strings, based on the browser's implementation of URL parsing and serializing:

function urlStringsEquivalent(a, b) {
    return new URL(a, document.baseURI).href == new URL(b, document.baseURI).href;
}

Here, Chrome mostly matches RFC 3986's notion of syntax-based equivalence:

  • urlStringsEquivalent('a', 'a'): true
  • urlStringsEquivalent('a', '%61'): true (normalized to 'a')
  • urlStringsEquivalent('~', '%7E'): true (normalized to '~')
  • urlStringsEquivalent('=', '%3D'): false (not normalized)
  • urlStringsEquivalent('*', '%2A'): false (not normalized)
  • urlStringsEquivalent('<', '%3C'): true (normalized to '%3C')

Specifically, Chrome's URL parser decodes all characters in the RFC 3986 unreserved set: ASCII alphanum plus -._~.

But Chrome also fails to normalize case when it doesn't decode a percent-encoded sequence:

  • urlStringsEquivalent('%6e', '%6E'): true (normalized to 'n')
  • urlStringsEquivalent('%3d', '%3D'): false (not normalized)

Firefox, on the other hand, follows the current URL standard:

  • urlStringsEquivalent('a', 'a'): true
  • urlStringsEquivalent('a', '%61'): false (not normalized)
  • urlStringsEquivalent('~', '%7E'): false (not normalized)
  • urlStringsEquivalent('=', '%3D'): false (not normalized)
  • urlStringsEquivalent('*', '%2A'): false (not normalized)
  • urlStringsEquivalent('<', '%3C'): true (normalized to '%3C')
  • urlStringsEquivalent('%6e', '%6E'): false (not normalized)
  • urlStringsEquivalent('%3d', '%3D'): false (not normalized)

In my opinion, the spec and Firefox should change so that these "unreserved" characters (particularly alphanumeric) are equivalent to their percent-encoded counterparts. Though I think instead of using Chrome's set (RFC 3986), we should use RFC 2396, for compatibility with encodeURIComponent (hence urlStringsEquivalent('*', '%2A') should return true).

URL rendering

Paste this URL in the address bar:

https://example.com/%20%21%22%23%24%25%26%27%28%29%2a%2b%2c%2d%2e%2f%30%31%32%33%34%35%36%37%38%39%3a%3b%3c%3d%3e%3f%40%41%42%43%44%45%46%47%48%49%4a%4b%4c%4d%4e%4f%50%51%52%53%54%55%56%57%58%59%5a%5b%5c%5d%5e%5f%60%61%62%63%64%65%66%67%68%69%6a%6b%6c%6d%6e%6f%70%71%72%73%74%75%76%77%78%79%7a%7b%7c%7d%7e%7f%ce%a9

Chrome decodes the following characters: ASCII alphanum, non-ASCII, and "-.<>_~. All other characters remain encoded. This is the RFC 3986 unreserved set, plus "<>, which is the intersection of the URL Standard fragment and query encode sets (those three characters are always encoded by the parser, so like unreserved characters, they have the same semantics whether encoded or not). Parse(Serialize(url)) == Parse(Render(url)) is true for Chrome for all URLs.

Firefox decodes the following characters: ASCII alphanum, non-ASCII, backtick, and !"'()*-.<>[\]^_{|}~. All other characters remain encoded. This is the RFC 2396 unreserved set, plus backtick, and "<>[\]^_{|}. I'm not sure what the rationale behind Firefox's decode set is.

Parse(Serialize(url)) == Parse(Render(url)) is not true for Firefox. For example, the URL "https://example.com/%2A": Parse(Serialize(url)) gives "https://example.com/%2A", while Parse(Render(url)) gives "https://example.com/*".

Clearly, neither of these implementations follow the standard, which says to decode all characters. Therefore, the spec should change to more closely match implementations. Preferably RFC 2396's unreserved set, for consistency. We could also throw in ", < and >, since these will be re-encoded upon parsing. Whatever is decided, it should be the case that Parse(Serialize(url)) == Parse(Render(url)).

Encoding arbitrary strings

Taking a look at registerProtocolHandler's escaping behaviour when a URL is escaped before being substituted into the "%s" template string. The spec says to escape it with the "default encode set", which no longer exists but links to path percent-encode set, which is: C0 control chars, space, backtick, non-ASCII, and "#<>?{}.

I'll test this by navigating to httpbin and running this code in the Console:

navigator.registerProtocolHandler("mailto", "/get?address=%s", "httpbin")

Now a malicious site can inject other query parameters by linking you to "mailto:foo@example.com&launchmissiles=true".

According to the spec, this is supposed to open https://httpbin.org/get?address=mailto:foo@example.com&launchmissiles=true. That's a query parameter injection attack. httpbin displays:

  "args": {
    "address": "mailto:foo@example.com", 
    "launchmissiles": "true"
  },

Fortunately, Chrome and Firefox both encode many more characters. In both cases, they open https://httpbin.org/get?address=mailto%3Afoo%40example.com%26launchmissiles%3Dtrue, so the '&' and '=' are correctly interpreted as part of the email address, not separate arguments. httpbin displays:

  "args": {
    "address": "mailto:foo@example.com&launchmissiles=true"
  },

Chrome uses the RFC 2396 reserved set, matching encodeURIComponent. Firefox leaves off a few more characters: <>[\]{|} (but nothing important).

I think the correct fix is to change registerProtocolHandler's spec (in HTML) to match encodeURIComponent. However, there isn't an easy way to do that, short of calling into the ECMAScript-defined encodeURIComponent method, or explicitly listing all characters. If we had an appropriate "reserved set" or "default encode set" (the complement of the unreserved set) in the URL Standard, then registerProtocolHandler can just use that.

Note that I am developing the Web Share Target API and need basically the same thing as registerProtocolHandler. At the moment, I've written it as "userinfo percent-encode set", but that still doesn't capture all the characters I need (especially '&').

Recommendations

Given all of the above, I would like to make the following changes to URL Standard:

  1. Define an "unreserved set", probably as ASCII alphanumeric plus !'()*-._~ (which matches RFC 2396, but the exact set is debatable).
  2. Define a "reserved set" or "default encode set" as the complement of unreserved. (This set would include all C0 control chars, as well as all non-ASCII characters.) (If it's called "default encode set", then registerProtocolHandler is automatically fixed. Otherwise we have to update registerProtocolHandler to use the reserved set's name instead.)
  3. Add a recommendation that other standards use the default encode set for sanitizing strings before inserting them into a URL. Note that this is equivalent to ECMAScript's encodeURIComponent function.
  4. The URL Parser needs to be updated to normalize percent-encoded sequences: values in the unreserved set need to be decoded. Values not in the unreserved set need to have their hex digits uppercased. (This is actually fairly hard, due to the way the parser is written. Some refactoring required, but doable.) Note that this automatically fixes equivalence.
  5. The URL Rendering algorithm needs to be updated. Instead of decoding all characters, only decode non-ASCII characters, and optionally, ", < and > (the intersection of the fragment and query encode sets). Note that this algorithm should satisfy Parse(Serialize(url)) == Parse(Render(url)) for all URLs.

Doing so would solve a number of issues outlined above, and bring the spec much closer to existing implementations. It would then make sense to update implementations to match the new spec.

I am quite familiar with the URL Standard and am volunteering to make the required changes, if there is consensus. Also, I don't strictly need there to be a reserved / unreserved set. These three problems could be fixed individually. But it makes the most sense to conceptualize this as reserved vs unreserved, and then tie a bunch of other definitions off of those concepts.

Regards,

Matt

@rmisev
Copy link
Member

rmisev commented Jan 19, 2018

  1. The URL Parser would be updated to decode any percent-encode sequences in the unreserved set. This fixes equivalence because "%61" would decode to "a", and thus "a" would be equivalent to "%61".

This decoding can cause a reparse problem, see #87 (comment) :

...
I disagree with this proposal and rfc3986 because I think a canonicalized URL should always be untouched when reparsed. With this proposal, "http://host/%%36%31" would be canonicalized to "http://host/%61" which when reparsed would become "http://host/a" which is bad. Right now Chrome percent-encodes the first '%' in "http://host/%%36%31" which is strange, and Edge throws an exception.
...

It's problematic because alone % are still allowed. I think the problem can be solved by percent encoding % to %25 when it isn't part of valid percent encode sequence (see issue: #170), but this can be problematic as well, see bug: https://bugzilla.mozilla.org/show_bug.cgi?id=61269

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jan 22, 2018

Update: Someone else coincidentally filed the registerProtocolHandler issue at whatwg/html#3377.

This decoding can cause a reparse problem, see #87 (comment).

I see. This issue is more complex than I thought (mostly because of the nested-escape issue).

It's problematic because alone % are still allowed. I think the problem can be solved by percent encoding % to %25 when it isn't part of valid percent encode sequence (see issue: #170), but this can be problematic as well

Yeah, we can't do that. It would break registerProtocolHandler, which is based on the (IMHO rather flimsy) assumption that "%s" parses as "%s" (despite a validation error). I was encouraged to rely on the same mechanism in w3c/web-share-target#31.

Having said that, I think we can solve it basically the way that Chrome solved it. Having proper equivalence defined for URLs is kind of important. I don't think we should jettison that concept because there is an edge case that causes trouble.

Here's what I'm proposing:

A "decodable percent sequence" is a "%" followed by two hex digits representing a byte in the unreserved set.

  • When the URL Parser encounters a %, it consumes it, then looks ahead at the following characters.
  • If it finds two hexdigits, it consumes them. Then,
    • If those hexdigits are a byte value in the unreserved set, it emits that byte value's code point.
    • Else, it emits those hexdigits, converted to uppercase.
  • Else, if it finds a hexdigit followed by a decodable percent sequence, a decodable percent sequence followed by a hexdigit, or two decodable percent sequences, validation error, and it emits "%25" without consuming those tokens.
  • Else, validation error, and it emits "%".

Test cases:

  • "%61" -> "a"
  • "%3d" -> "%3D"
  • "%%361" -> "%2561" (with validation error)
  • "%6%31" -> "%2561" (with validation error)
  • "%6%3D" -> "%6%3D" (with validation error)
  • "%%36%31" -> "%2561" (with validation error)
  • "%6%%331" -> "%6%2531" (with validation error)
  • "%6%2531" -> "%6%2531" (with validation error)
  • "%s" -> "%s" (with validation error)

I think that covers it, including nested cases. What we gain from this is that we can define a set of characters that must be considered equivalent.

Let me further justify why we need this.

@annevk (on #87)
the mapping of an HTTP request to a resource on disk is not exactly governed by the URL Standard. How servers deal with URLs and what amount of normalization they apply is very much up to them.
...
What I'm favor of is that clients do not normalize and treat your given examples as distinct. This follows from how we define the URL parser and then pass the URL to the networking subsystem, etc.

A server however can still see those paths and treat them as equivalent. That would be up to the server library that maps paths to resources. I don't think we have to take a stance on whether such mapping takes place in the URL Standard.

I agree that servers (let's call them "URL processors" -- any application that breaks down a URL and uses its pieces, whether mapping it onto a file system, or otherwise) should be free to treat certain characters, such as '$', as equivalent to their encoded counterparts, or not, as they wish. What we're missing is a mandate that URL processors must treat other characters, such as 'a', as equivalent to their encoded counterparts.

Let's call these two character sets "reserved" and "unreserved". Encoding or decoding a reserved character may or may not change the meaning of the URL (depending on the processor). Encoding or decoding an unreserved character does not change the meaning of the URL. These sets impact rendering and encoding as follows:

  • URL rendering should not be allowed to decode any character in the reserved set, because that would change the meaning of the URL and present the reader with an ambiguous string (that could represent one of many URLs). Conversely, URL rendering can freely decode any character in the unreserved set.
  • When encoding an arbitrary string to be inserted into a URL (for example: with registerProtocolHandler, but also any time this is done by an application), any character in the reserved set must be encoded, so that it surely represents the literal character, and not some syntactic component of either the URL syntax, or a quirk of some unknown URL processor.

The current status quo is essentially that the "unreserved" set is the null set. This means that:

  • URL rendering cannot decode any character, because even decoding "%61" to "a" could change the meaning of the URL.
  • When encoding an arbitrary string, every character must be encoded. e.g., constructing a query string "name=%s", if the name is "Matt", we have to produce "name=%4D%61%74%74"; otherwise some URL processor could treat the letter "a" specially. Encoding it as "%61" is the only way to be sure it's treated as a literal.

Putting those two together, a URL with "name=%4D%61%74%74" has to be rendered as "name=%4D%61%74%74", so all URLs are ugly and impossible for a human to read.

Now you may be saying: "Come on, don't be so pedantic. No URL processor is going to treat "a" differently to "%61", so surely we don't need to encode it!" OK, but how do I choose which characters need to be encoded and which don't? How do I know which characters will be treated equivalently to their encoded versions, and which won't? I have no more faith that a URL processor will consider "a" and "%61" equivalent than I do for "=" and "%3D". In order to know what characters need to be encoded, the URL specification needs to explicitly state which characters a URL processor is allowed to treat specially (the reserved set) and which it isn't (the unreserved set).

Corollary: If we don't define an unreserved set, we still need to define some set of characters that registerProtocolHandler (and Web Share Target) should encode. What is that set? It can't be any of the existing percent-encode sets, since they don't encode enough characters. Sure, we could throw in a few more characters like '&' and '=', but what is the right answer to "which characters need to be encoded to ensure the characters in the string aren't treated specially by the URL processor?". My search for an answer to this question led me to the conclusion that we need to bring back the RFC 3986 concept of an unreserved set.

@annevk
Copy link
Member

annevk commented Jan 22, 2018

https://%6d%6D/%6d%6D?%6d%6D#%6d%6D yields https://mm/mm?%6d%6D#%6d%6D in Chrome, meaning URL processors would still face the choices you suggest they'd no longer have to face if we change something here (or are you indeed proposing to normalize all and change Chrome as well?). Given that Edge normalizes the path too it seems reasonable to consider that, but it does not really address your larger point, which I'm not quite sure I fully understand.

@bsittler
Copy link

Slightly aside the point of this issue but here goes anyhow: is there any good reason to even allow %-encoding of ASCII alphanumerics? Is there actually enough legitimate usage or an otherwise-impossible scenario reliant on this feature to justify it? It seems to me like it's primarily allowing naïve filters to be bypassed, similar to overlong UTF-8 encodings -- which are thankfully banned on the web for reasons of security. Is there any reason we cannot likewise ban these?

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jan 22, 2018

@annevk:
https://%6d%6D/%6d%6D?%6d%6D#%6d%6D yields https://mm/mm?%6d%6D#%6d%6D in Chrome, ... (or are you indeed proposing to normalize all and change Chrome as well?).

Right. I don't see any reason to not normalize in the query and fragment as well, and update Chrome to match.

Theoretically, it shouldn't matter whether any particular URL processor normalizes "%6D" as "m" or not, because "%6D" should be considered equivalent to "m". The only problem is a technicality that equivalence is defined by the URL parser, so we need to specify that "%6D" decodes to "m" in the parser, otherwise it isn't considered equivalent.

but it does not really address your larger point, which I'm not quite sure I fully understand.

I'll go from my most pragmatic concern to most ideological:

  1. If you have an arbitrary string and need to insert it into a URL (as we do in registerProtocolHandler and Web Share Target), it is difficult to find the right set of characters to percent-encode. None of the encoding sets defined in URL Standard are appropriate. This leads directly to navigator.registerProtocolHandler specification divergent from both implementations html#3377 (registerProtocolHandler's specification is wrong). Fixing the spec would basically be changing it to say "encode all code points that are not in the RFC 2986 unreserved set". We could just do that, and stop there, but that feels unsatisfactory.
  2. The reason it feels unsatisfactory is that there is no logical reason (based on the text of URL Standard) why '$' should be encoded but 'a' should not. It "feels right" to encode '$' but not 'a', but that's because we're used to decades of both software and specs doing so. There's nothing in URL Standard to differentiate these characters. Also, if you carefully read the spec, there is nothing special about '&', outside of x-www-form-urlencoded and the URLSearchParams JavaScript API. Nothing in the core definition of a URL (the parser, serializer and equivalence logic) treats '&' any differently to '$' or 'a'. So I would have no reason for registerProtocolHandler to encode '&' other than "we all know that '&' delimits query parameters". I feel like the URL standard needs to have a much clearer policy on what characters should be encoded and what characters are safe to leave bare.
  3. If there's nothing in the spec that explicitly says "a and %61 mean exactly the same thing", then theoretically I can't be sure that a URL processor won't use 'a' as some kind of delimiter, while "%61" is used to represent a literal 'a', just as URL processors commonly use '&' as a delimiter, while "%26" is used to represent a literal '&'. Even though it's unlikely that 'a' would be used as a syntax character, the spec allows it. Thus, theoretically, a general algorithm for encoding a string for insertion into a URL must encode all characters, just to be sure. While this may be only a theoretical concern, the specification document for a core technology of the Internet should not allow for implementations that break everybody's expectations while still following the letter of the law. By contrast, RFC 3986 gives us a defined list of characters that do not need to be encoded because the processor on the other end is not allowed to distinguish the encoded and non-encoded form. That's what we need.

@bsittler:
is there any good reason to even allow %-encoding of ASCII alphanumerics? Is there actually enough legitimate usage or an otherwise-impossible scenario reliant on this feature to justify it?

If you're going to go down this path, I'd want other unreserved characters (like '_' and '~') treated the same. Otherwise you create three classes of character: reserved, unreserved and non-encodable, with the same problem just for a smaller set of characters.

It seems to me like it's primarily allowing naïve filters to be bypassed, similar to overlong UTF-8 encodings -- which are thankfully banned on the web for reasons of security. Is there any reason we cannot likewise ban these?

I can't speak to whether this would mitigate any realistic security problems. My feeling is that it's been legal to encode unreserved characters for 20 years and making it illegal now would break a countless amount of software --- especially since different encoders encode slightly different sets of characters (e.g., Python's urllib.parse.quote encodes '~', even though it's in the unreserved set, so if we made "%7E" illegal, URLs generated by Python would become illegal).

(Also note that the UTF-8 standard itself banned overlong encodings from 2003 onwards; this isn't a web-specific restriction.)

@annevk
Copy link
Member

annevk commented Jan 23, 2018

@bsittler I strongly suspect it's not web-compatible, but I welcome Chrome demonstrating otherwise.

@mgiuca there are many things not entirely logical on the web, but if you can solve this particular one I'm not opposed. As I said earlier, more normalization seems nice if we can get away with it.

@valenting @achristensen07 @travisleithead would you be okay with more aggressive normalization of percent-encoded bits in URLs if Chrome (or some other entity) can demonstrate it's feasible?

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jan 23, 2018

@annevk:
@mgiuca there are many things not entirely logical on the web, but if you can solve this particular one I'm not opposed. As I said earlier, more normalization seems nice if we can get away with it.

Cool, I'll write a draft change but I won't polish it too much since this is still being debated.

If we decide not to change normalization, I still think we need to solve the other two issues: rendering and encoding arbitrary strings. Rendering needs to state a certain set of characters to be decoded. Encoding needs to state a certain set of characters to be encoded (and, for example, registerProtocolHandler would use that set).

@domenic
Copy link
Member

domenic commented Jan 23, 2018

My personal take is that we should work to solve these issues in order of most practical to most theoretical, with corresponding urgency. Here I am using @mgiuca's enumeration in #369 (comment). So I would suggest:

  1. We should solve navigator.registerProtocolHandler specification divergent from both implementations html#3377 first, ASAP since it's a mismatch with browsers and potential interop problem. In doing so we'll define a new encode set, either in HTML or URL.
  2. Then we can move the encode set into URL (if it's not there already), and give more general advice along the lines of "when inserting arbitrary strings into URLs, use this encode set"
  3. Then we can delve into issues around URL processors (like servers), URL renderers (like browser URL bars), URL equivalence, and what their relationship to encoding should be. I don't feel like I understand this space well enough, but I think if I re-read @mgiuca's comments I would have ideas of concrete steps here. But I think we should leave this for last, once we've laid the right foundation, as it's a tricky area.

(3) is the only area where I think we would make normative changes to URL parsing, based on the principle that URLs should be equivalent if and only if they parse the same. (Which IMO is a very good principle.)

@mgiuca, @annevk, does this make any sense as an approach? Although I suppose @annevk has already pinged the other implementers for their take on (3), so maybe we're just going straight for solving everything at once :)

In general I appreciate @mgiuca's thinking about how this standard applies in a larger context, and think we should definitely work on incorporating such suggestions.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jan 23, 2018

@domenic Yes, that sounds like the right order of steps.

(1) unblocks Web Share Target and fixes registerProtocolHandler. That's the main practical issue I'm trying to solve.

(3) doesn't really have a pressing issue that I'm trying to solve, but I think it's nice to fix it.

Edit: Although having said that, I'd like to come to an agreement on what the "unreserved" set would ultimately be in (3) (e.g., is it from RFC 2396, 3986, or some other set?), because that will inform the encode set in (1).

@annevk
Copy link
Member

annevk commented Jan 23, 2018

3986 seems safest. I just realized that the problems you allude to though will continue to exist for non-ASCII data, which is why I think I gave up on pursuing something grander here since the producer and consumer will need to have some agreement at some level anyway.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jan 23, 2018

3986 seems safest.

The only problem is (as I said in my initial "essay"), that many "arbitrary string encoders" (including encodeURIComponent and both Chrome and Firefox' implementations of registerProtocolHandler) use the RFC 2396 set. I think we can work with either set, since the delta between them (!*'()) are non-syntactic characters which it shouldn't matter whether we treat them as reserved or unreserved.

(Note that the default-encode set should include all reserved characters, but it's OK for it to be a superset of the reserved characters, and thus unnecessarily encode some unreserved characters. So I think it's safer actually to have a larger unreserved set from 2396.)

I just realized that the problems you allude to though will continue to exist for non-ASCII data, which is why I think I gave up on pursuing something grander here since the producer and consumer will need to have some agreement at some level anyway.

Actually, non-ASCII data is a non-issue. Both the current URL Standard, and RFC 3987, treats any non-ASCII character equivalently to its encoded form (by virtue of normalizing them to percent-encoded form). The same is true of all characters in the C0 control set, which are normalized to encoded form.

Any character that is normalized either to encoded or non-encoded form does not trigger any of the above issues. It doesn't matter if such a character is rendered encoded or non-encoded, because it has the same meaning. It doesn't matter if such a character is encoded by an "arbitrary string encoder" or not, because it has the same meaning.

So as far as I can tell, this whole issue revolves around ASCII characters outside of the C0 control set, which are not normalized one way or the other.

@annevk
Copy link
Member

annevk commented Jan 23, 2018

Well, it matters in the same sense I think as you don't know whether %FF is bogus UTF-8 or meant to represent byte 0xFF.

As for what the unreserved set should be, if you make it larger, don't you encroach on what 3986 considers reserved and for which changing the encoding would change the meaning? Aligning with JavaScript's notion seems nice too though, in a way. I guess I don't feel strongly.

@LEW21
Copy link

LEW21 commented Apr 11, 2018

While RFC 3986 (from 2005) has tried to re-reserve !'()*, I think this should be considered a failed attempt, as the most popular encoder - encodeURIComponent - ignores this reservation. RFC 2396 (from 1998) has explicitely declared them as unreserved, and it's just impossible to reserve a thing that was formally unreserved for 7 years - and that other standards, like ECMAScript 3 (from 1999), were building upon.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Apr 12, 2018

I agree with @LEW21. By making the reserved set match RFC 2396, we would be undoing the change of the newer RFC, but I believe that would align with the WHATWG mission of describing things as they are, not how they "should be". Most software that I've seen matches RFC 2396, including common programming language libraries.

(Of course, there will be software that follows RFC 3986 also. It's a tough decision.)

@mnot
Copy link
Member

mnot commented Apr 12, 2018

Just trying a couple, JS seems very much in the minority here...

Python:

>>> urllib.parse.quote("!'()*")
'%21%27%28%29%2A'

PHP:

urlencode("!'()*");
=> "%21%27%28%29%2A"

Perl:

use URI::Escape;
print(uri_escape("!'()*"));
%21%27%28%29%2A

Ruby:

require "erb"
include ERB::Util
puts url_encode("!'()*")
%21%27%28%29%2A

Go:

package main

import (
	"fmt";
	"net/url";
)

func main() {
	fmt.Println(url.PathEscape("!'()*"))
}

%21%27%28%29%2A

Program exited.

Java:

import java.net.URLEncoder
String e = URLEncoder.encode("!'()*", "UTF-8")
System.out.println(e)

%21%27%28%29*

@mgiuca
Copy link
Collaborator Author

mgiuca commented Apr 12, 2018

Hmm, I didn't know all of those @mnot -- I thought Java at least was based on the old standard. It looks like Java is based on x-www-form-urlencoded which is a different set again (it doesn't encode *-._). This means we should at least have * in the unreserved set, since it is left bare by the x-www-form-urlencoded encoder (but encoded by other encoders) and must therefore be considered equivalent between the two forms.

The thing is though, that it's safer to leave characters in the unreserved set, as long as they aren't used for any syntax. That way, encoders are free to encode them, or not, as they choose, without changing the semantics. If we chose -._~ (RFC 3986) as the unreserved set, then encoders based on RFC 3986 will be fine, but anything that encodes !'()* will potentially change the semantics of those characters (because URL standard will consider "!" and "%21" to be non-equivalent, for example).

If we choose !'()*-._~ (RFC 2396) as the unreserved set, encoders based on RFC 3986 will be fine; maybe they don't encode "!", maybe they do; either way it will be viewed exactly the same by the parser. But encoders based on RFC 2396, or x-www-form-urlencoded will also be fine. Basically, the larger the set, the better, as long as none of those symbols have an existing meaning in URL syntax (which I don't believe they do).

That's why I suggested a possibly even wider set: !$'()*,-.;_~, which is all characters that have no special meaning in URL syntax. (This adds $, , and ; to the RFC 2396 set.) Adding any other character potentially runs into trouble.

@LEW21
Copy link

LEW21 commented Apr 12, 2018

Python does not match any RFC - it encodes ~, and doesn't encode /:

quote('-._~') # -._%7E
quote('/') # /

I want to make a merge request to Python after this spec decides on what to do - but no idea if they will agree to change the behavior, or to add a new function.

PHP encodes ~:

echo urlencode('-._~'); # -._%7E
echo urlencode('~'); # %2F

Perl:

use URI::Escape;
print(uri_escape("-._~")); # -._~
print(uri_escape("/")); # %2F

Ruby:

require "erb"
include ERB::Util
puts url_encode("-._~") # -._~
puts url_encode("/") # %2F

Go:

package main

import (
        "fmt";
        "net/url";
)

func main() {
        fmt.Println(url.PathEscape("-._~")) // -._~
        fmt.Println(url.PathEscape("/")) // %2F
}

Java encodes ~:

import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;

public class HelloWorld {
  public static void main(String[] args) {
    try {
	    System.out.println(URLEncoder.encode("-._~", "UTF-8")); // -._%7E
	    System.out.println(URLEncoder.encode("/", "UTF-8")); // %2F
    } catch (UnsupportedEncodingException e) {}
  }
}

@annevk
Copy link
Member

annevk commented May 7, 2020

In whatwg/html#3377 (comment) I have attempted to summarize what browsers do for registerProtocolHandler(). My suggestion is to adopt that for a new "external string encode set" which you could use if you have some string you want to insert in a path, query, or fragment and are not sure of its contents.

If after that folks still want to pursue more drastic URL parser changes I suggest we do that in a new issue and keep this focused on the external string/templating case and the precedent for that in implementations.

@annevk annevk added topic: model For issues with the abstract-but-normative bits and removed topic: parser labels May 7, 2020
@mgiuca
Copy link
Collaborator Author

mgiuca commented May 8, 2020

I think that a proposed "external string encode set" would exactly satisfy my request here. It wouldn't be specific to RPH, as you said, it's for any time you have a string with no idea of its contents and want to shove it in a URL. (i.e., JavaScript's encodeURIComponent).

annevk added a commit that referenced this issue May 9, 2020
And use it internally. This is also an initial step for #369.
annevk added a commit that referenced this issue May 12, 2020
And use it internally. This is also an initial step for #369.
annevk added a commit that referenced this issue May 12, 2020
Also start using a hyphen for percent-encode and percent-decode consistently and clarify the various operations and how they relate.

This helps #369 and closes #296.
annevk added a commit that referenced this issue May 12, 2020
annevk added a commit that referenced this issue Jun 24, 2020
annevk added a commit that referenced this issue Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: model For issues with the abstract-but-normative bits
Development

Successfully merging a pull request may close this issue.

7 participants