New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating and escaping mismatch in pathname #112

Closed
nurse opened this Issue Apr 4, 2016 · 2 comments

Comments

2 participants
@nurse

nurse commented Apr 4, 2016

https://url.spec.whatwg.org/#path-state says path segments is handled as following. It means characters other than URL code points are invalid.

If c is not a URL code point and not "%", syntax violation.

The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E0000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

A syntax violation indicates a non-fatal mismatch between input and syntax requirements. User agents, especially conformance checkers are encouraged to report them somewhere.

But it also says it should escape characters other than default encode set.

Otherwise, UTF-8 percent encode c using the default encode set, and append the result to buffer.

The simple encode set are C0 controls and all code points greater than U+007E.
The default encode set is the simple encode set and code points U+0020, '"', "#", "<", ">", "?", "`", "{", and "}".

This means "[", "]", "^", and "|" are invalid for URL but aren't escaped.

Chrome 49.0.2623.87 and Safari 9.1 (11601.5.17.1) doesn't escape "[" and "]" but escape "^" and "|".
Firefox 39.0 escapes all of them, but 45.0 escapes only "^" (sends "[", "]", and "|" as is).

As the those result, I think "[" and "]" should be added to the URL code points, but I wander "^" and "|".

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 4, 2016

Member

Okay, so there's an intentional difference between what is conforming and what a user agent is supposed to do. At the moment "what is conforming" matches the IETF RFCs this specification replaces and "what a user agent is supposed to do" roughly matches what implementations are doing (hopefully they will all align over time).

Maybe the time has arrived we can start considering expanding "what is conforming".

Member

annevk commented Apr 4, 2016

Okay, so there's an intentional difference between what is conforming and what a user agent is supposed to do. At the moment "what is conforming" matches the IETF RFCs this specification replaces and "what a user agent is supposed to do" roughly matches what implementations are doing (hopefully they will all align over time).

Maybe the time has arrived we can start considering expanding "what is conforming".

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Dec 19, 2016

Member

I considered this again and I think that for now we best leave the current definition of syntax in place. Once the parser is more stable (implemented by most browsers and conformance checkers) we can perhaps start looking at the conformance criteria again.

Member

annevk commented Dec 19, 2016

I considered this again and I think that for now we best leave the current definition of syntax in place. Once the parser is more stable (implemented by most browsers and conformance checkers) we can perhaps start looking at the conformance criteria again.

@annevk annevk closed this Dec 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment