-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser generates invalid URLs #379
Comments
I'd also consider excluding ` from the fragment percent-encode set, like it was excluded from query in #17, so that fragment percent-encode set is contained within query percent-encode set. |
I don't think we're in a position to change these, unless implementations vary enough for a particular code point that there's some leeway. What we should maybe do is document the when the parser doesn't "fixup" an invalid URL more clearly. |
I think it would definitely be worth attempting to change these if at all possible, or else expanding the definition of valid URL. |
I don't think we can change most of these due to compatibility, but if someone wants to have another attempt at doing the research I'm willing to assist. I don't think we should change the definition of what's valid either though. Validity in part helps signaling problems you might face, including with legacy software. (Same as with HTML.) |
It seems extraordinarily strange that we're not only giving developers the tools to create invalid URLs, but we're also encouraging other standards to produce invalid URLs and pass them on to the rest of the ecosystem (e.g., over HTTP). In that case I'd question why we're calling such URLs invalid at all. At that point they're just "URLs produced by all software that follows the URL Standard", and validity doesn't buy us much. I don't believe you can parse/serialize any string through the HTML parser and get invalid HTML, for example. (Maybe some edge cases exist, but these characters are hardly edge cases.) |
Other standards? I don't see how this is different from #118. |
Sure. Every standard that uses the URL parser on user input is currently producing invalid URLs, right? Including standards that then use the URL serializer and send the result to the network or elsewhere. #118 stemmed from suggestions that the parser should fail when given an invalid URL. This issue is a concern about the parser producing an invalid URL. |
@domenic they can produce invalid URLs, sure. Just like the HTML parser can produce invalid HTML. |
#118 was precisely about this. That the syntax (valid URLs) conflicts with the parser. |
@annevk, I'd agree that this is a case of Garbage-In-Garbage-Out, if it would only happen when parsing invalid URLs. However, it also happens when changing components of an URL object: x = new URL('http://localhost')
x.search = 'a#{}'
x.href // "http://localhost/?a%23{}" Also, the browsers don't 100% follow the spec here - for example Chrome escapes |
OK, these are the results for Chrome, Firefox, Edge: https://docs.google.com/spreadsheets/d/1mSl2N2Wrc7ZdKy2ArhLg0t3EHI2DOdHaDyWMZQByHE4/edit?usp=sharing Note: I'm not sure if I've set spec behavior of I was testing with:
On Edge,
|
Looking at the results, I think that:
The only confusing one is:
|
@LEW21 I had another look.
|
I've added a number of test cases to test the distinct percent encode sets. The PR (for wpt) is here. Some observations,
Firefox and Chrome deviate from the spec in the following:
It's easy to make mistakes with this, so additional eyes would be good. |
FWIW as part of my work I've found some requests using unencoded curly braces and double quotes in URIs' paths. I think it's an NVIDIA update tool or something. See that test in the Rust crate httparse: https://github.com/seanmonstar/httparse/blob/8bd42db76543dee9f7e172d3f14a6666530f220c/tests/uri.rs#L3682-L3693 |
https://url.spec.whatwg.org/commit-snapshots/a1b789c6b6c36fcdb16311da5abd177e84151fca/#url-parsing
This leads to creation of invalid URLs - ones that contain
[, \, ], ^, `, {, |, }
, which are neither URL code points nor '%' and trigger validation errors:I think that either:
Related issues: #378, #17
The text was updated successfully, but these errors were encountered: