Distinction between encoded and non-encoded path separator is normalized away #4

mjwillson · 2009-12-19T23:42:20Z

The problem is that 'normalize' is un-encoding percent-encoded separator characters in the URI path. So eg

irb(main):035:0> Addressable::URI.parse("/search/foo%2Fbar") == Addressable::URI.parse("/search/foo/bar")
=> true

These are two different URLs; one has 2 path components, the other has 3. Most web app routing code would treat them differently - quite important when you're doing routes for search queries for example :)

In terms of the spec, only 'unreserved' characters can be percent-encoded or decoded without changing the semantics of the URL. If one of the reserved chars (gen-delims or sub-delims) has been encoded, you have to assume that it might have been encoded for a reason, to distinguish it from the use of that character as a separator.

So I think 'normalize' should only normalize the encoding of unreserved chars, and leave reserved ones untouched.

There were some docs saying it was designed to be user-friendly rather than strictly spec-compliant, so I guess that might be why, although even then - applications rely on encoding distinctions for separator characters in URI paths, so if the user copy/pastes a URL with these distinctions, it needs to get through to the web app untouched.

It also seems like .normalize is used under the hood in a lot of places in the codebase (eg in URI#==), so I'd prefer the default implementation of it to be spec-compliant, with a "user_friendly_normalize" as an alternative if needs be...

The text was updated successfully, but these errors were encountered:

sporkmonger · 2009-12-21T07:20:19Z

Well, you're absolutely right that normalize should not be doing this. I'll fix it as soon as I have a few spare cycles.

sporkmonger · 2009-12-22T09:03:51Z

Bleh, so it turns out that this is actually a very difficult problem when you try to solve it in conjunction with multi-byte characters and unicode NFKC normalization. Might take me awhile to resolve unfortunately.

sporkmonger · 2009-12-24T22:25:40Z

Fixed in 48e0b24

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinction between encoded and non-encoded path separator is normalized away #4

Distinction between encoded and non-encoded path separator is normalized away #4

mjwillson commented Dec 19, 2009

sporkmonger commented Dec 21, 2009

sporkmonger commented Dec 22, 2009

sporkmonger commented Dec 24, 2009

Distinction between encoded and non-encoded path separator is normalized away #4

Distinction between encoded and non-encoded path separator is normalized away #4

Comments

mjwillson commented Dec 19, 2009

sporkmonger commented Dec 21, 2009

sporkmonger commented Dec 22, 2009

sporkmonger commented Dec 24, 2009