Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider percent-encoding more characters in "fragment state" #344

Closed
mikewest opened this issue Sep 15, 2017 · 33 comments
Closed

Consider percent-encoding more characters in "fragment state" #344

mikewest opened this issue Sep 15, 2017 · 33 comments

Comments

@mikewest
Copy link
Member

@mikewest mikewest commented Sep 15, 2017

Currently, https://url.spec.whatwg.org/#fragment-state requires that we percent-encode characters in a URL's fragment using the C0 control percent-encode set. Firefox currently encodes more than that, which I think actually makes a good deal of sense.

How would y'all feel about shifting fragment state encoding to match query state? That is, encoding "less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>)"?

@annevk
Copy link
Member

@annevk annevk commented Sep 15, 2017

@annevk
Copy link
Member

@annevk annevk commented Sep 15, 2017

@valenting
Copy link

@valenting valenting commented Sep 15, 2017

How would y'all feel about shifting fragment state encoding to match query state? That is, encoding "less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>)"?

That sounds good. It would be nice to converge onto something clear.
Firefox also escapes ` in the fragment, but I don't know if we want to keep it that way.

@jasnell
Copy link
Collaborator

@jasnell jasnell commented Sep 15, 2017

I'd be good with this. (ping @TimothyGu)

@TimothyGu
Copy link
Member

@TimothyGu TimothyGu commented Sep 15, 2017

Sounds good to me.

@achristensen07
Copy link
Collaborator

@achristensen07 achristensen07 commented Sep 15, 2017

What is the motivation for this change? I wouldn't oppose encoding more characters, but I would oppose using the query encoding set as a model.

@mikewest
Copy link
Member Author

@mikewest mikewest commented Sep 18, 2017

@achristensen07: @annevk linked to some related conversations that discuss motivations (@bzbarsky's discussion of <> in #291 is compelling, for instance). I'm hopping on this due to some marginal risk of code injection when developers reflect url.hash into their pages: that risk would be mitigated if we were a little more strict about encoding these characters.

Based on some spot-checking, Firefox encodes at least ", `, <, and > in addition to everything above 0x7E, which is an indication that we can tighten things up here while maintaining compatibility with the web.

Do you have a set of characters you'd prefer to use instead of the query encoding set? As long as the set includes < and >, I'll be happy. :)

@annevk
Copy link
Member

@annevk annevk commented Sep 18, 2017

@rmisev
Copy link
Member

@rmisev rmisev commented Sep 18, 2017

Do you have a set of characters you'd prefer to use instead of the query encoding set? As long as the set includes < and >, I'll be happy. :)

I'll prefer to encode "less than 0x21 (!)" and 0x22 (") as well:

Code point Why encode?
0x20 (SP) See #150 (comment) and https://bugzilla.mozilla.org/show_bug.cgi?id=1148861
0x22 (") For a safe copy/paste into <a href="http://example.com#with%22quote"> vs <a href="http://example.com#with"quote">

@mikewest
Copy link
Member Author

@mikewest mikewest commented Sep 18, 2017

@achristensen07
Copy link
Collaborator

@achristensen07 achristensen07 commented Sep 18, 2017

I’m fine with using the path set for the path and fragment.

@mikewest
Copy link
Member Author

@mikewest mikewest commented Sep 19, 2017

Any objections to the path set? If not, I'll put together a PR and tests.

@rmisev
Copy link
Member

@rmisev rmisev commented Sep 19, 2017

It seems a bit stricter set than in RFC 3986, which allows U+003F (?) in non-encoded form.
See https://tools.ietf.org/html/rfc3986#section-3.5

@bzbarsky
Copy link

@bzbarsky bzbarsky commented Sep 19, 2017

Yes, I'm pretty worried about '?' there in compat terms, esp given it being explicitly called out as allowed in the RFC. Actually somewhat worried about '#' and '{' and '}' too, if no one escapes those right now... How many people pass JSON in fragments, for example?

The actual set of chars the RFC allows in fragments, just for reference, is:

  fragment    = *( pchar / "/" / "?" )

where pchar is the set of characters allowed in path segments. So basically, "everything allowed in paths, plus '?'". Which is definitely more restrictive than what UAs do.

@mikewest
Copy link
Member Author

@mikewest mikewest commented Sep 20, 2017

@bzbarsky: Firefox encodes " right now, which would probably make it difficult to put valid JSON in fragments. Is there a subset of characters that you'd be more comfortable encoding? Are you happy with Firefox's current behavior?

@bzbarsky
Copy link

@bzbarsky bzbarsky commented Sep 20, 2017

which would probably make it difficult to put valid JSON in fragments.

Well, without escaping the ", right. I guess at that point you're unescaping, so even if the { and } were escaped they'd get unescaped. I did check that encode/decodeURIComponent escape/unescape those bits.

As for comfort... Basically, I'm comfortable encoding anything that at least one major browser encodes. I'm a little worried about encoding things that none of them encode. I'm very worried about ?, given that it's long been explicitly specced as OK in fragments.

@mikewest
Copy link
Member Author

@mikewest mikewest commented Sep 21, 2017

Basically, I'm comfortable encoding anything that at least one major browser encodes.

Ok. Would coalescing around Firefox's current encoding make sense? As noted above, that includes at least ", `, <, and > in addition to everything above 0x7E (maybe you could point me to the right spot? My searchfox foo is poor. :( ).

@rmisev
Copy link
Member

@rmisev rmisev commented Sep 26, 2017

Yes. I tested the Firefox Nightly: it percent encodes: 0x20 (SP) and below, 0x22 ("), 0x3C (<), 0x3E (>), 0x60 (`), and code points above 0x7E.

I think it is reasonable to standardize this percent encoding set.

My testing code.

FYI: Safari only encodes code points below 0x20 and above 0x7E; Chrome - code points below 0x20; Edge does not encode at all.

@bzbarsky
Copy link

@bzbarsky bzbarsky commented Oct 3, 2017

Ok. Would coalescing around Firefox's current encoding make sense?

I don't know, honestly. At some point it would be good to involve someone from Firefox's networking team here. Maybe @mcmanus knows who would be a good contact for that.

maybe you could point me to the right spot?

I believe the right spot is as of today http://searchfox.org/mozilla-central/rev/a4702203522745baff21e519035b6c946b7d710d/xpcom/io/nsEscape.cpp#259-271

That table has, for each ASCII char, a bitmask indicating where it does NOT need to be escaped. The bit for fragments is 1<<9, so they get escaped for all the entries in the table that are < 512 (hey at least this one is easy to skim for...).

That looks like 0x00-0x20 (control chars and space), 0x22 ("), 0x25 (%), 0x3C (<), 0x3E (>), 0x60 (`) and anything above 0x7E. Which I guess is what @rmisev got via his testing, good.

@mcmanus
Copy link

@mcmanus mcmanus commented Oct 3, 2017

@valenting is the goto here and he seems ok with this.

@mikewest
Copy link
Member Author

@mikewest mikewest commented Oct 9, 2017

Uploaded a PR at #347 and tests at web-platform-tests/wpt#7641. Feedback welcome!

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

@bzbarsky also mentioned U+0025 (%), though @rmisev did not. Did anyone look into that difference?

@rmisev
Copy link
Member

@rmisev rmisev commented Oct 10, 2017

There is a comment about the U+0025 (%): http://searchfox.org/mozilla-central/rev/a4702203522745baff21e519035b6c946b7d710d/xpcom/io/nsEscape.cpp#344-345

// Also the % will not be escaped until forced
// See bugzilla bug 61269 for details why we changed this

I guess it answers your question.

@valenting
Copy link

@valenting valenting commented Oct 10, 2017

I'll file a bug to stop escaping % (in the fragment).
I'm pretty sure we don't need that anymore.

@reschke
Copy link

@reschke reschke commented Oct 10, 2017

If you don't escape "%", how do you reliably remove percent-encoding????

This is a violation of the rule "you have to escape the escape character", no?

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

I don't think you can reliably remove it for all URLs, in practice. I'm pretty sure I looked into that and it falls down for the query due to compatibility and maybe also for the path.

@reschke
Copy link

@reschke reschke commented Oct 10, 2017

So what is code that does percent decoding supposed to do with:

%2525

?

The answer is clear for RFC 3986. If it's not clear for this spec this hints at a problem of this spec, no?

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

@reschke percent-decoding isn't something that needs to be interoperable per se. In any event, we're getting off-topic. It would be best to discuss this in a separate issue. An example of where this becomes problematic would also help.

@reschke
Copy link

@reschke reschke commented Oct 10, 2017

I'm not going to explain why removing of escaping needs to predictable in general, sorry.

For browser, the relevant question would be where to navigate to when somebody navigates to

http://example.com/#x%2525

and the HTML document contains elements with the IDs "x%25" and "x%2525".

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

It would go to the latter per https://html.spec.whatwg.org/#the-indicated-part-of-the-document. Only if that's not there would it go the former. Pretty sure that matches implementations.

@reschke
Copy link

@reschke reschke commented Oct 10, 2017

OK, so this might "work" for HTML. But not every fragment identifier is resolved in the context of HTML, not even in browsers.

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

I'm not going to explain why removing of escaping needs to predictable in general, sorry.

It needs to be predictable where it can be observable, but, e.g., whether a server errors on /% or allows it doesn't matter much.

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

FWIW, thanks to @reschke's comments I had a closer look and discovered a whole bunch of issues around fragments: whatwg/html#3111. None really about the percent decoding aspect itself (other than applying it to strings rather than bytes), but more about encodings.

@annevk annevk closed this in #347 Dec 5, 2017
annevk added a commit that referenced this issue Dec 5, 2017
Currently, we percent-encode characters in "fragment state" using the C0
control percent-encode set. Firefox encodes more than that, and it seems
reasonable to align around that behavior for reasons spelled out in #291
and the comments of #344.

This patch adds a new "fragment percent-encode set" which contains the
C0 control percent-encode set, along with:

* 0x20 (SP)
* 0x22 (")
* 0x3C (<)
* 0x3E (>)
* 0x60 (`)

Tests: web-platform-tests/wpt#7776.

Closes #344.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants