Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reparsing problem with non-special URL and double-dot path component #415

Closed
zealousidealroll opened this issue Sep 6, 2018 · 9 comments · Fixed by #505
Closed

Reparsing problem with non-special URL and double-dot path component #415

zealousidealroll opened this issue Sep 6, 2018 · 9 comments · Fixed by #505
Labels
interop Implementations are not interoperable with each other needs tests Moving the issue forward requires someone to write tests topic: parser

Comments

@zealousidealroll
Copy link

zealousidealroll commented Sep 6, 2018

When the URL a:/a/..//a is parsed, the resulting URL object has no hostname, and has the path //a

c = 'a'
state = scheme start state
buffer = ""
pointer = 0


c = ':'
state = scheme state
buffer = "a"
pointer = 1


scheme = "a"
buffer = ""
state = path or authority state
pointer = 2


pointer = 3
c = 'a'
state = path state
pointer = 2


state = path state
pointer = 3
c = 'a'
buffer = "a"


state = path state
pointer = 4
c = '/'
path = [ "a" ]
buffer = ""


state = path state
pointer = 5
c = '.'
path = [ "a" ]
buffer = "."


state = path state
pointer = 6
c = '.'
path = [ "a" ]
buffer = ".."


state = path state
pointer = 7
c = '/'
path = [""]
buffer = ""


state = path state
pointer = 8
c = '/'
path = ["", ""]
buffer = ""


state = path state
pointer = 9
c = 'a'
path = ["", ""]
buffer = "a"


state = path state
pointer = 10
c = EOF
path = ["", "", "a"]
buffer = ""

When the resulting URL is serialized, it gets serialized as a://a, which, if it gets reparsed, gets an empty path and a hostname a.

Found while trying to find a spec-compliant resolution for servo/rust-url#459

This isn't a problem for special URLs, which always have a host.

@GPHemsley
Copy link
Member

What should the correct behavior be?

I'm thinking this should serialize to a:///a? (I haven't tested whether that reparses correctly.)

@zealousidealroll
Copy link
Author

zealousidealroll commented Sep 6, 2018

According to https://quuz.org/url/liveview.html#a:///a , a:///a and a:/a parse to the same thing (a URL with protocol a: and path /a) so I would think that it ought to get serialized to a:////a where it'll get the //a path (consistent with its original parse).

Though I'm not sure if the "empty host" that a:////a gets and the "null host" that a:/a/..//a are actually the same thing.

@zealousidealroll
Copy link
Author

Looking at other implementations:

  • Firefox leaves the path empty (which seems very wrong)
  • Edge thinks a: is a drive letter
    • If I change the scheme to am:, it changes the path to /.//a, basically adding a disambiguating . to the front.

@annevk
Copy link
Member

annevk commented Sep 7, 2018

For https://jsdom.github.io/whatwg-url/#url=YTovYS8uLi8vYQ==&base=YWJvdXQ6Ymxhbms= Safari behaves per spec (and has a reparsing issue), Firefox seems to fail to parse the URL, and Chrome gives /a/..//a as path, which is weird, but basically means they don't have relative URL handling for this scheme.

I guess when we decide to use relative URL handling we should really set the host to a non-null value.

@annevk annevk added topic: parser needs tests Moving the issue forward requires someone to write tests interop Implementations are not interoperable with each other labels Sep 7, 2018
@domenic
Copy link
Member

domenic commented Sep 7, 2018

Firefox seems to fail to parse the URL

Interesting aside: I don't think it fails to parse. As far as I can tell it's parsing it as having protocol a: but some kind of hidden "scheme data" of /a/..//a. Parsing failure always results in a protocol of just :.

Weirder and weirder, new URL("a:/a/..//ab") in Firefox has a pathname of /a/..//a. It's only <a> elements that have the hidden "scheme data".

zealousidealroll pushed a commit to zealousidealroll/rust-url that referenced this issue Dec 11, 2018
This is a willfull violation of the URL specification for serialization.
It retains spec-compliance for parsing, it aligns with the behavior
of Microsoft Edge, and it "fixes" an acknowledged specification bug.

See whatwg/url#415

Fixes servo#397
zealousidealroll pushed a commit to zealousidealroll/rust-url that referenced this issue Dec 11, 2018
This is a willfull violation of the URL specification for serialization.
It retains spec-compliance for parsing, it aligns with the behavior
of Microsoft Edge, and it "fixes" an acknowledged specification bug.

See whatwg/url#415

Fixes servo#397
zealousidealroll added a commit to zealousidealroll/rust-url that referenced this issue Dec 11, 2018
This is a willfull violation of the URL specification for serialization.
It retains spec-compliance for parsing, it aligns with the behavior
of Microsoft Edge, and it "fixes" an acknowledged specification bug.

See whatwg/url#415

Fixes servo#397
@annevk
Copy link
Member

annevk commented May 9, 2020

If I'm reading https://tools.ietf.org/html/rfc3986#section-5.2.2 correctly that has this problem as well. It forbids a scheme followed by an absolute path from taking the form scheme://path as that's ambiguous as it repeatedly points out (elsewhere in the document), but then scheme:/doesnotmatter/..//path is fine and would also yield this state.

https://tools.ietf.org/html/rfc3986#section-5.3 does not account for it either.

I suggested changing the host above, but upon re-reading this issue I rather like what Edge did here as it is isolated to serialization and preserves intent better.

annevk added a commit that referenced this issue May 9, 2020
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host.

whatwg-url: ...

Tests: ...

Fixes #415.
@annevk
Copy link
Member

annevk commented May 13, 2020

@valenting @achristensen07 this is worth fixing, right? It's only for non-special schemes, but it seems those should be free of idempotency issues too.

annevk added a commit that referenced this issue Jun 24, 2020
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host.

whatwg-url: ...

Tests: ...

Fixes #415.
rmisev added a commit to rmisev/web-platform-tests that referenced this issue Aug 19, 2020
@annevk
Copy link
Member

annevk commented Aug 21, 2020

@valenting @achristensen07 ping.

@valenting
Copy link
Collaborator

@valenting @achristensen07 this is worth fixing, right? It's only for non-special schemes, but it seems those should be free of idempotency issues too.

I think it makes sense that non-special URLs should also be idempotent. 👍

annevk pushed a commit to web-platform-tests/wpt that referenced this issue Aug 24, 2020
annevk added a commit that referenced this issue Aug 24, 2020
Adjust the URL serializer so it does not output something which during the next parse operation would yield a host.

whatwg-url: jsdom/whatwg-url#148.

Tests: web-platform-tests/wpt#25113.

Fixes #415.
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Aug 27, 2020
…testonly

Automatic update from web-platform-tests
Test non-special URLs are idempotent

See whatwg/url#415 and
whatwg/url#505 for context.

--

wpt-commits: 551c9d604fb8b97d3f8c65793bb047d15baddbc2
wpt-pr: 25113
ambroff pushed a commit to ambroff/gecko that referenced this issue Nov 4, 2020
…testonly

Automatic update from web-platform-tests
Test non-special URLs are idempotent

See whatwg/url#415 and
whatwg/url#505 for context.

--

wpt-commits: 551c9d604fb8b97d3f8c65793bb047d15baddbc2
wpt-pr: 25113
ryanhaddad pushed a commit to WebKit/WebKit that referenced this issue Dec 22, 2020
https://bugs.webkit.org/show_bug.cgi?id=215762

Reviewed by Tim Horton.

LayoutTests/imported/w3c:

* web-platform-tests/url/a-element-expected.txt:
* web-platform-tests/url/a-element-xhtml-expected.txt:
* web-platform-tests/url/url-constructor-expected.txt:
* web-platform-tests/url/url-setters-expected.txt:

Source/WTF:

whatwg/url#505 added an interesting edge case to the URL serialization:
"If url’s host is null, url’s path’s size is greater than 1, and url’s path[0] is the empty string, then append U+002F (/) followed by U+002E (.) to output."
The problem was that URLs like "a:/a/..//a" would be parsed into "a://a" with a pathname of "//a" and an empty host.  If "a://a" was then reparsed, it would again have an href of "a://a"
but its host would be "a" and it would have an empty path.  There is consensus that URL parsing should be idempotent, so we need to do something different here.
According to whatwg/url#415 (comment) this follows what Edge did (and then subsequently abandoned when they switched to Chromium)
to make URL parsing idempotent by adding "/." before the path in the edge case of a URL with a non-special scheme (not http, https, wss, etc.) and a null host and a non-empty path that
has an empty first segment.  All the members of the URL remain unchanged except the full serialization (href).  This is not important in practice, but important in theory.

Our URL parser tries very hard to use the exact same WTF::String object given as input if it can.  However, this step is better implemented as a post-processing step that will almost never happen
because otherwise we would have to parse the entire path twice to find out if we need to add "./" or if the "./" that may have already been there needs to stay.  This is illustrated with the test URL
"t:/.//p/../../../..//x" which does need the "./".

In the common case, this adds one well-predicted branch to URL parsing, so I expect performance to be unaffected.  Since this is such a rare edge case of URLs, I expect no compatibility problems.

* wtf/URL.cpp:
(WTF::URL::pathStart const):
* wtf/URL.h:
(WTF::URL::pathStart const): Deleted.
* wtf/URLParser.cpp:
(WTF::URLParser::copyURLPartsUntil):
(WTF::URLParser::URLParser):
(WTF::URLParser::needsNonSpecialDotSlash const):
(WTF::URLParser::addNonSpecialDotSlash):
* wtf/URLParser.h:

Tools:

* TestWebKitAPI/Tests/WTF/URLParser.cpp:
(TestWebKitAPI::TEST_F):



Canonical link: https://commits.webkit.org/229956@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@267837 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop Implementations are not interoperable with each other needs tests Moving the issue forward requires someone to write tests topic: parser
Development

Successfully merging a pull request may close this issue.

5 participants