Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinction between encoded and non-encoded path separator is normalized away #4

Closed
mjwillson opened this issue Dec 19, 2009 · 3 comments

Comments

@mjwillson
Copy link

The problem is that 'normalize' is un-encoding percent-encoded separator characters in the URI path. So eg

irb(main):035:0> Addressable::URI.parse("/search/foo%2Fbar") == Addressable::URI.parse("/search/foo/bar")
=> true

These are two different URLs; one has 2 path components, the other has 3. Most web app routing code would treat them differently - quite important when you're doing routes for search queries for example :)

In terms of the spec, only 'unreserved' characters can be percent-encoded or decoded without changing the semantics of the URL. If one of the reserved chars (gen-delims or sub-delims) has been encoded, you have to assume that it might have been encoded for a reason, to distinguish it from the use of that character as a separator.

So I think 'normalize' should only normalize the encoding of unreserved chars, and leave reserved ones untouched.

There were some docs saying it was designed to be user-friendly rather than strictly spec-compliant, so I guess that might be why, although even then - applications rely on encoding distinctions for separator characters in URI paths, so if the user copy/pastes a URL with these distinctions, it needs to get through to the web app untouched.

It also seems like .normalize is used under the hood in a lot of places in the codebase (eg in URI#==), so I'd prefer the default implementation of it to be spec-compliant, with a "user_friendly_normalize" as an alternative if needs be...

@sporkmonger
Copy link
Owner

Well, you're absolutely right that normalize should not be doing this. I'll fix it as soon as I have a few spare cycles.

@sporkmonger
Copy link
Owner

Bleh, so it turns out that this is actually a very difficult problem when you try to solve it in conjunction with multi-byte characters and unicode NFKC normalization. Might take me awhile to resolve unfortunately.

@sporkmonger
Copy link
Owner

Fixed in 48e0b24

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants