Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser generates invalid URLs #379

Open
LEW21 opened this issue Apr 10, 2018 · 16 comments
Open

Parser generates invalid URLs #379

LEW21 opened this issue Apr 10, 2018 · 16 comments
Labels
topic: parser topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)

Comments

@LEW21
Copy link

LEW21 commented Apr 10, 2018

https://url.spec.whatwg.org/commit-snapshots/a1b789c6b6c36fcdb16311da5abd177e84151fca/#url-parsing

For each byte in buffer:​

If byte is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>), append byte, percent encoded, to url’s query.

Otherwise, append a code point whose value is byte to url’s query.

This leads to creation of invalid URLs - ones that contain [, \, ], ^, `, {, |, }, which are neither URL code points nor '%' and trigger validation errors:

Otherwise:

If c is not a URL code point and not U+0025 (%), validation error.

I think that either:

  • the list of valid query characters should be expanded to include more characters, or
  • these ones should be escaped too.

Related issues: #378, #17

@LEW21 LEW21 changed the title "For each byte in buffer: If byte is less than ..." query state: "For each byte in buffer: If byte is less than ..." Apr 10, 2018
@LEW21 LEW21 changed the title query state: "For each byte in buffer: If byte is less than ..." query state: Parser generates invalid URLs Apr 10, 2018
@LEW21
Copy link
Author

LEW21 commented Apr 10, 2018

Actually, there is a similar problem for other URL parts, too.
image

For example:

  • generated path may contain [, \, ], ^, | - which are not valid there;
  • generated fragment may contain the above and #, {, } - which are not valid there;
  • if state override is given, valid path may contain ? (which will be then escaped)

@LEW21
Copy link
Author

LEW21 commented Apr 10, 2018

I'd also consider excluding ` from the fragment percent-encode set, like it was excluded from query in #17, so that fragment percent-encode set is contained within query percent-encode set.

@annevk
Copy link
Member

annevk commented Apr 11, 2018

I don't think we're in a position to change these, unless implementations vary enough for a particular code point that there's some leeway.

What we should maybe do is document the when the parser doesn't "fixup" an invalid URL more clearly.

@domenic
Copy link
Member

domenic commented Apr 11, 2018

I think it would definitely be worth attempting to change these if at all possible, or else expanding the definition of valid URL.

@annevk
Copy link
Member

annevk commented Apr 11, 2018

I don't think we can change most of these due to compatibility, but if someone wants to have another attempt at doing the research I'm willing to assist.

I don't think we should change the definition of what's valid either though. Validity in part helps signaling problems you might face, including with legacy software. (Same as with HTML.)

@domenic
Copy link
Member

domenic commented Apr 11, 2018

It seems extraordinarily strange that we're not only giving developers the tools to create invalid URLs, but we're also encouraging other standards to produce invalid URLs and pass them on to the rest of the ecosystem (e.g., over HTTP). In that case I'd question why we're calling such URLs invalid at all. At that point they're just "URLs produced by all software that follows the URL Standard", and validity doesn't buy us much.

I don't believe you can parse/serialize any string through the HTML parser and get invalid HTML, for example. (Maybe some edge cases exist, but these characters are hardly edge cases.)

@annevk
Copy link
Member

annevk commented Apr 11, 2018

Other standards? I don't see how this is different from #118.

@domenic
Copy link
Member

domenic commented Apr 11, 2018

Sure. Every standard that uses the URL parser on user input is currently producing invalid URLs, right? Including standards that then use the URL serializer and send the result to the network or elsewhere.

#118 stemmed from suggestions that the parser should fail when given an invalid URL. This issue is a concern about the parser producing an invalid URL.

@annevk
Copy link
Member

annevk commented Apr 11, 2018

@domenic they can produce invalid URLs, sure. Just like the HTML parser can produce invalid HTML.

@annevk
Copy link
Member

annevk commented Apr 11, 2018

#118 was precisely about this. That the syntax (valid URLs) conflicts with the parser.

@LEW21
Copy link
Author

LEW21 commented Apr 11, 2018

@annevk, I'd agree that this is a case of Garbage-In-Garbage-Out, if it would only happen when parsing invalid URLs. However, it also happens when changing components of an URL object:

x = new URL('http://localhost')
x.search = 'a#{}'
x.href // "http://localhost/?a%23{}"

Also, the browsers don't 100% follow the spec here - for example Chrome escapes ^ and | in paths, while the standard says it should not. I'll do more tests to check how browsers behave on all chars in all places.

@LEW21
Copy link
Author

LEW21 commented Apr 11, 2018

OK, these are the results for Chrome, Firefox, Edge: https://docs.google.com/spreadsheets/d/1mSl2N2Wrc7ZdKy2ArhLg0t3EHI2DOdHaDyWMZQByHE4/edit?usp=sharing

image

Note: I'm not sure if I've set spec behavior of \ correctly.

I was testing with:

U = () => new URL('http://localhost')
s = 'x "#%<>?[\\]^`{|}$'
u = U(); u.pathname = s; u.href
u = U(); u.search = s; u.href
u = U(); u.hash = s; u.href

On Edge, u.pathname = s and u.search = s were throwing errors, so I had to check these character-by-character with loops:

for (let c of s) {try { u = U(); u.pathname = 'x' + c + 'x'; console.log(c, u.href); } catch (e) { console.log(c, e) }}
for (let c of s) {try { u = U(); u.search   = 'x' + c + 'x'; console.log(c, u.href); } catch (e) { console.log(c, e) }}

@LEW21
Copy link
Author

LEW21 commented Apr 11, 2018

Looking at the results, I think that:

  • ^ should be added to the path percent-encode set
  • ` should be moved from the fragment percent-encode set to the path percent-encode set
  • path, query and fragment should treat [, ] and | as valid
  • query and fragment should treat ?, \, ^, `, { and } as valid
  • fragment should treat # as valid

The only confusing one is:

  • \ in paths - no idea what to do here, I don't really understand the backslash-magic in the spec

@LEW21 LEW21 changed the title query state: Parser generates invalid URLs Parser generates invalid URLs Apr 11, 2018
@domenic domenic added topic: parser topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) labels Mar 4, 2020
@annevk
Copy link
Member

annevk commented May 4, 2020

@LEW21 I had another look.

  • Encoding ^ for paths is still a reasonable suggestion as Chrome and Firefox both do encode it, but since Safari matches the specification and nobody else made an attempt it seems somewhat unfair to them to change this now and I'm not sure it's much better. @achristensen07 thoughts?
  • What is encoded in fragments changed in 7a3c69f and both Chrome and Firefox seem to align with that. Safari does not (yet) encode ` however.
  • \ in paths means / for special URLs per the specification and in most implementations.
  • I don't think we should change what is valid as per earlier statements.

@alwinb
Copy link
Contributor

alwinb commented Feb 23, 2021

I've added a number of test cases to test the distinct percent encode sets. The PR (for wpt) is here.

Some observations,

  • Safari complies in all but the fragment cases. It does not encode additional characters in the fragment.
  • It seems that Firefox and Chrome do not encode additional characters in the username, password, path, query and fragment of non-special URLs.

Firefox and Chrome deviate from the spec in the following:

  • Firefox and Chrome also encode ^ in the path of special URLs. Chrome also encodes |.
  • Firefox and Chrome also encode ' and in the username and password (of special URLs) and Firefox also encodes ..
  • However Firefox does not encode | in the password, in contrast with the spec.

It's easy to make mistakes with this, so additional eyes would be good.

@nox
Copy link
Member

nox commented Jun 18, 2022

FWIW as part of my work I've found some requests using unencoded curly braces and double quotes in URIs' paths. I think it's an NVIDIA update tool or something.

See that test in the Rust crate httparse: https://github.com/seanmonstar/httparse/blob/8bd42db76543dee9f7e172d3f14a6666530f220c/tests/uri.rs#L3682-L3693

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: parser topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)
Development

No branches or pull requests

5 participants