Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Inconsistent normalization of % escaping #50

Closed
bblimke opened this Issue · 4 comments

2 participants

@bblimke

Addressable is inconsistent in escaping uri's with %

Addressable::URI.parse("https://encrypted.google.com/search?abc=%20").normalize.to_s
=> "https://encrypted.google.com/search?abc=%20"

Addressable::URI.parse("https://encrypted.google.com/search?abc=%2B").normalize.to_s
=> "https://encrypted.google.com/search?abc=+"

There is no way to know if normalized uri will have characters escaped or not.

This issue is a result of https://github.com/myronmarston/vcr/issues/86 and bblimke/webmock#134

@sporkmonger
Owner

Hmm, yeah that '+' probably should not be being decoded.

However, "There is no way to know if normalized uri will have characters escaped or not" is not accurate. Addressable normalizes to the set of characters that the URI spec explicitly allows in decoded form for each component. This tends to be dramatically more permissive of decoded characters than most people are used to, but it's more precisely correct.

The '+' character is a weird edge case where the correct answer is often unclear. The URI spec indicates it should be decoded, but the HTML specification would make decoding it ambiguous as to whether it should be treated as a '+' or a space. I continue to curse the people who put that paragraph into the HTML spec. One of the worst-thought-out and worst-specified bits of text ever in an accepted spec.

So as an accommodation to the HTML spec and widely used conventions, I opt to break the URI spec in the handling of the '+' character within query strings only.

But what all this should tell you is that if you're looking for specific characters to be encoded and specific characters to be unencoded, normalize may not be the method you want. The encoding methods in Addressable can take an optional character class string which directly controls what gets encoded.

@bblimke

Thanks for the detailed explanation. I understand the problem with '+' characters now. Indeed, the HTML spec is a pain.

Does it makes sense to change character class used in normalized_query and to not decode %2B? If it doesn't I will try to encode query with a custom character class in webmock, instead of using normalized_query.

@sporkmonger
Owner

I started trying to get this resolved, but it turns out to be a lot harder to do correctly than I expected.

The trick with the '+' character is that it needs to be handled differently from all other characters. Normally you would unencode, apply Unicode NFKC normalization, then selectively re-encode. However, the '+' character should simply have nothing happen to it at all. However this turns out to be hard to do, because the normalize method is currently implemented sort of like this:

encode(unicode_normalize(unencode(input)))

The trick is that if you tell the unencode method not to unencode "%2B", now the encode method is going to want to encode the "%" character as "%25".

@sporkmonger
Owner

Closed by #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.