Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Inconsistent normalization of % escaping #50

bblimke opened this Issue Sep 24, 2011 · 4 comments


None yet
2 participants

bblimke commented Sep 24, 2011

Addressable is inconsistent in escaping uri's with %

=> "https://encrypted.google.com/search?abc=%20"

=> "https://encrypted.google.com/search?abc=+"

There is no way to know if normalized uri will have characters escaped or not.

This issue is a result of https://github.com/myronmarston/vcr/issues/86 and bblimke/webmock#134


sporkmonger commented Sep 25, 2011

Hmm, yeah that '+' probably should not be being decoded.

However, "There is no way to know if normalized uri will have characters escaped or not" is not accurate. Addressable normalizes to the set of characters that the URI spec explicitly allows in decoded form for each component. This tends to be dramatically more permissive of decoded characters than most people are used to, but it's more precisely correct.

The '+' character is a weird edge case where the correct answer is often unclear. The URI spec indicates it should be decoded, but the HTML specification would make decoding it ambiguous as to whether it should be treated as a '+' or a space. I continue to curse the people who put that paragraph into the HTML spec. One of the worst-thought-out and worst-specified bits of text ever in an accepted spec.

So as an accommodation to the HTML spec and widely used conventions, I opt to break the URI spec in the handling of the '+' character within query strings only.

But what all this should tell you is that if you're looking for specific characters to be encoded and specific characters to be unencoded, normalize may not be the method you want. The encoding methods in Addressable can take an optional character class string which directly controls what gets encoded.

bblimke commented Sep 25, 2011

Thanks for the detailed explanation. I understand the problem with '+' characters now. Indeed, the HTML spec is a pain.

Does it makes sense to change character class used in normalized_query and to not decode %2B? If it doesn't I will try to encode query with a custom character class in webmock, instead of using normalized_query.


sporkmonger commented Oct 31, 2011

I started trying to get this resolved, but it turns out to be a lot harder to do correctly than I expected.

The trick with the '+' character is that it needs to be handled differently from all other characters. Normally you would unencode, apply Unicode NFKC normalization, then selectively re-encode. However, the '+' character should simply have nothing happen to it at all. However this turns out to be hard to do, because the normalize method is currently implemented sort of like this:


The trick is that if you tell the unencode method not to unencode "%2B", now the encode method is going to want to encode the "%" character as "%25".


sporkmonger commented Dec 28, 2012

Closed by #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment