Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use percent decode on strings #3111

Merged
merged 1 commit into from May 23, 2018
Merged

Do not use percent decode on strings #3111

merged 1 commit into from May 23, 2018

Conversation

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

Also stop using "UTF-8 decode without BOM or fail" and instead use "UTF-8 decode without BOM" for fragment identifiers.

Tests: ...


/browsing-the-web.html ( diff )
/infrastructure.html ( diff )

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

Firefox and Safari use "UTF-8 decode without BOM" here. It seems that Edge and Chrome do not. I tested with <div id=&#xFFFD; and a fragment of #%C2 (which is illegal UTF-8). I think it's better to align with Firefox and Safari as that behavior is more consistent and potentially allows for the removal of "UTF-8 decode without BOM or fail" down the line.

cc @tkent-google

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

Note that we regressed defining what happens upon failure in c230f55. We could revert to that behavior instead, terminating the algorithm when decode returns failure, but I would prefer not using the "or fail" variant at all per above.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

While creating tests I discovered this only happens if the document encoding is UTF-8. It seems like fragments might be tied to the document encoding too somehow?

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

It also seems that none of the hooks defined in https://encoding.spec.whatwg.org/#specification-hooks work here, as even #%EF%BB%BF%C2 does not force UTF-8.

@hsivonen thoughts on what our options are here and what we should do?

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

Note that I'm not sure what browsers do here makes a whole lot of sense, given that when the URL is parsed they'll translate an input in a fragment such as U+00FF to its UTF-8 encoded equivalent. Meaning that if you have #&#xFF; in a URL and id=&#xFF; elsewhere, it'll break if the document is not UTF-8 encoded.

@domenic
Copy link
Member

@domenic domenic commented Oct 10, 2017

Sounds like there's still some work to do here, so tagging accordingly. Also happy if you want to split up the quick fixes (e.g. using the correct string-accepting algorithms) from the normative fixes.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 11, 2017

Per my analysis in my last comment I think it would be best if we could merge this as-is and get implementations to align. I think not using UTF-8 decode here would more likely result in breakage of legacy pages. Granted, they might have been broken for quite a while now.

It would be nice to hear at least one implementation agree though before filing bugs.

@domenic
Copy link
Member

@domenic domenic commented Oct 11, 2017

OK, well, at least there will be tests so we can see how bad the mismatch with implementations is, right?

I'm a little confused how changing things to differ from how implementations work now could cause breakage of legacy pages, and it seems like the kind of thing we might need compat analysis for.

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 18, 2017
@hsivonen
Copy link
Member

@hsivonen hsivonen commented Dec 18, 2017

I don't know what the appropriately reasonable and Web-compatible definition is.

As for what Gecko does, the name attribute value undergoes special treatment before ending up in the DOM, which violates the usual mapping of HTML source to the DOM. (Gecko bug.)

@annevk
Copy link
Member Author

@annevk annevk commented Dec 18, 2017

After further discussion with @hsivonen on IRC it turns out Firefox also exposes a "decode without BOM handling" hook (not present in the Encoding Standard) used exclusively for this (as far as web-exposed exposure goes).

It seems however that other browsers are somewhat inconsistent here. I cannot browse to an U+00FF ID using %FF on a windows-1252 page in Edge for instance. I think we could attempt to restrict this to UTF-8.

(The reason why I think that will make more legacy pages work is that the URL parser now UTF-8 encodes. It used to use the local encoding. However, decoding still uses the local encoding, which likely broke legacy pages. If they haven't been updated, switching to decoding with UTF-8 will make them work again.)

@annevk
Copy link
Member Author

@annevk annevk commented Jan 9, 2018

@tkent-google any thoughts on this? You didn't cover this #2902 so it'd be good to know what you think of exclusively using UTF-8.

@tkent-google
Copy link
Collaborator

@tkent-google tkent-google commented Jan 11, 2018

It seems Google Chrome doesn't support non-UTF8 percent decoding for scroll-to-the-fragment, and I couldn't find any bug reports about non-UTF8 encoding support in our bug database. So I don't oppose exclusively using UTF-8.

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 12, 2018
@annevk
Copy link
Member Author

@annevk annevk commented Jan 12, 2018

@tkent-google it doesn't seem that Chrome just runs UTF-8 decode though and calls it a day, per my tests in web-platform-tests/wpt#8723. Is there anything else going on?

@tkent-google
Copy link
Collaborator

@tkent-google tkent-google commented Jan 15, 2018

I think test results indicates Chrome runs UTF-8 decode for percent encoded fragments.

  • fragment-and-encoding.html
    location.hash = "%EF%BB%BF"; works because Chrome decodes it as UTF-8 even with charset=windows-1252
    location.hash = "%FF"; doesn't work because it's not valid UTF-8.

  • framgent-and-encoding-2.html
    location.hash = "%C2"; doesn't work because it's not valid UTF-8.

@whatwg whatwg deleted a comment from bonebizz21 Jan 15, 2018
@annevk
Copy link
Member Author

@annevk annevk commented Jan 15, 2018

@tkent-google ah, so if you cannot decode you try to do a literal match? E.g., it would end up matching id=%FF? I was thinking you would look for U+FFFD (which is the result of decoding that byte with error handling).

@tkent-google
Copy link
Collaborator

@tkent-google tkent-google commented Jan 16, 2018

if you cannot decode you try to do a literal match?

No. Literal matching is tried before percent decoding. It matched to the current algorithm of "determine what the indicated part of the document"

Correction of the previous comment:
I'm sorry that my previous comment wasn't correct.
I found location.hash = "%FF"; worked with Chrome. It matches to id=&#xFF. I reviewed the implementation, and the current algorithm of percent-decoding is:

  1. UTF-8 decode without BOM or fail
  2. If it failed, decode it as ISO-8859-1.

https://cs.chromium.org/chromium/src/url/url_util.cc?type=cs&q=DecodeURLEscapeSequences&sq=package:chromium&l=787

@annevk
Copy link
Member Author

@annevk annevk commented Jan 16, 2018

The way that seems to work though is that anything that was already UTF-8 decoded is kept. I don't think we want to standardize on that. That is, for %EF%BB%BF%FF you end up with U+FEFF U+00FF in Chrome, which is really weird.

Falling back to https://infra.spec.whatwg.org/#isomorphic-decode when UTF-8 decode fails seems reasonable, but we should do it across the entire input string.

@annevk
Copy link
Member Author

@annevk annevk commented Jan 17, 2018

@tkent-google would such a change be acceptable?

@tkent-google
Copy link
Collaborator

@tkent-google tkent-google commented Jan 18, 2018

Falling back to https://infra.spec.whatwg.org/#isomorphic-decode when UTF-8 decode fails seems reasonable, but we should do it across the entire input string.

It's acceptable. Other browser vendors might have different opinions.

I just added counters to Chrome for mixed decoding case, isomorphic decoding case, and so on. http://crbug.com/802988

MXEBot pushed a commit to mirror/chromium that referenced this issue Jan 18, 2018
- url_util.*
  url::DecodeURLEscapeSequences() returns what encodings are applied
  on decoding.

- KURL.*
  Add optional DecodeURLResult argument to blink::
  DecodeURLEscapeSequences().

- web_feature.mojom, LocalFrameView.cpp, and enums.xml
  Add UseCounters.

These UseCounters are for the discussion in
whatwg/html#3111 .

Bug: 802988
Change-Id: Ie171212a2ca97ee5dc0e5bb4eeb98463865e5ab3
Reviewed-on: https://chromium-review.googlesource.com/869696
Reviewed-by: Mike West <mkwst@chromium.org>
Commit-Queue: Kent Tamura <tkent@chromium.org>
Cr-Commit-Position: refs/heads/master@{#530004}
@annevk
Copy link
Member Author

@annevk annevk commented Jan 23, 2018

Thanks for adding the ScrollToFragment* counters to https://www.chromestatus.com/metrics/feature/timeline/popularity! I guess we'll revisit this issue in several months to see what options we have.

@tkent-google
Copy link
Collaborator

@tkent-google tkent-google commented May 16, 2018

I just added counters to Chrome for mixed decoding case, isomorphic decoding case, and so on. http://crbug.com/802988

Currently data/graphs in chromestatus.com are broken. Data from the internal source:

ScrollToFragmentRequested: 30.78%
ScrollToFragmentSucceedWithRaw: 6.490%
ScrollToFragmentSucceedWithASCII: 0.009423%
ScrollToFragmentSucceedWithUTF8: 0.01023%
ScrollToFragmentSucceedWithIsomorphic: 0.001333%
ScrollToFragmentSucceedWithMixed: 0.001326%
ScrollToFragmentFailWithASCII: 27.95%
ScrollToFragmentFailWithUTF8: 1.425%
ScrollToFragmentFailWithIsomorphic: 0.2417%
ScrollToFragmentFailWithMixed: 0.2415%

Note:

  • These numbers are ratio of page views.
  • Our current code tries to find a raw fragment string without percent-decoding first.
    It succeeds in 6.490% page views.
  • ScrollToFragment{Succeed,Fail}With{ASCII,UTF8,Isomorphic,Mixed} mean how we treated percent-decode result. ASCII means percent-decode result doesn't contain non-ASCII code. UTF8 means the result contain non-ASCII code and decode as UTF-8 successfully.

Summary:

  • We don't need to support isomorphic decoding, and of course chrome-specific mixed decoding.
  • We don't know how many pages need scroll-to-fragment with non-UTF8 page encodings. It should be less than 1.425% + 0.2417% + 0.2415%. However the failure rate of scroll-to-fragment is very high even with ASCII, and we can't guess it.

Also stop using "UTF-8 decode without BOM or fail" and instead use "UTF-8 decode without BOM" for fragment identifiers.

Tests: web-platform-tests/wpt#8723.
@annevk annevk force-pushed the annevk/percent-decode branch from 298a597 to a53dfde May 22, 2018
@annevk
Copy link
Member Author

@annevk annevk commented May 22, 2018

Okay, I think that means we should proceed with this PR as-is. I rebased it so OP now includes links to readable previews/diffs and the commit message links to the tests.

annevk added a commit to web-platform-tests/wpt that referenced this issue May 22, 2018
@annevk
Copy link
Member Author

@annevk annevk commented May 22, 2018

@domenic do you want to review this?

Copy link
Member

@domenic domenic left a comment

Editorially LGTM. I've gotten a little lost on what the tests/browser bugs situation is here, but I trust you'll take care of it.

@annevk
Copy link
Member Author

@annevk annevk commented May 23, 2018

Thanks. @tkent-google, could you review web-platform-tests/wpt#8723?

This will require subtle changes to all browsers. I file bugs once all the bits are reviewed.

tkent-google added a commit to web-platform-tests/wpt that referenced this issue May 23, 2018
* Navigation fragment decode and encodings

See whatwg/html#3111 for context.

* make tests more usable

* address review feedback
@annevk annevk merged commit ce8404f into master May 23, 2018
2 checks passed
@annevk annevk deleted the annevk/percent-decode branch May 23, 2018
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Jun 5, 2018
… a=testonly

Automatic update from web-platform-testsNavigation fragment decode and encodings (#8723)

* Navigation fragment decode and encodings

See whatwg/html#3111 for context.

* make tests more usable

* address review feedback

--

wpt-commits: 5b878a1e5de29aa4e68c48e0122878f983f036ff
wpt-pr: 8723
nctl144 pushed a commit to scrapy/url-chromium that referenced this issue Jul 12, 2018
- url_util.*
  url::DecodeURLEscapeSequences() returns what encodings are applied
  on decoding.

- KURL.*
  Add optional DecodeURLResult argument to blink::
  DecodeURLEscapeSequences().

- web_feature.mojom, LocalFrameView.cpp, and enums.xml
  Add UseCounters.

These UseCounters are for the discussion in
whatwg/html#3111 .

Bug: 802988
Change-Id: Ie171212a2ca97ee5dc0e5bb4eeb98463865e5ab3
Reviewed-on: https://chromium-review.googlesource.com/869696
Reviewed-by: Mike West <mkwst@chromium.org>
Commit-Queue: Kent Tamura <tkent@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#530004}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: dadd77a1df99121e213ab3f418fe0dfcbea6bf91
robin-raymond pushed a commit to webrtc-uwp/zzz-obsolete.chromium-tools that referenced this issue Sep 20, 2018
- url_util.*
  url::DecodeURLEscapeSequences() returns what encodings are applied
  on decoding.

- KURL.*
  Add optional DecodeURLResult argument to blink::
  DecodeURLEscapeSequences().

- web_feature.mojom, LocalFrameView.cpp, and enums.xml
  Add UseCounters.

These UseCounters are for the discussion in
whatwg/html#3111 .

Bug: 802988
Change-Id: Ie171212a2ca97ee5dc0e5bb4eeb98463865e5ab3
Reviewed-on: https://chromium-review.googlesource.com/869696
Reviewed-by: Mike West <mkwst@chromium.org>
Commit-Queue: Kent Tamura <tkent@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#530004}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: dadd77a1df99121e213ab3f418fe0dfcbea6bf91
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 3, 2019
… a=testonly

Automatic update from web-platform-testsNavigation fragment decode and encodings (#8723)

* Navigation fragment decode and encodings

See whatwg/html#3111 for context.

* make tests more usable

* address review feedback

--

wpt-commits: 5b878a1e5de29aa4e68c48e0122878f983f036ff
wpt-pr: 8723

UltraBlame original commit: 455d53e15aff62c2234c19995a0e2c5f7b5127b6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 3, 2019
… a=testonly

Automatic update from web-platform-testsNavigation fragment decode and encodings (#8723)

* Navigation fragment decode and encodings

See whatwg/html#3111 for context.

* make tests more usable

* address review feedback

--

wpt-commits: 5b878a1e5de29aa4e68c48e0122878f983f036ff
wpt-pr: 8723

UltraBlame original commit: 455d53e15aff62c2234c19995a0e2c5f7b5127b6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 3, 2019
… a=testonly

Automatic update from web-platform-testsNavigation fragment decode and encodings (#8723)

* Navigation fragment decode and encodings

See whatwg/html#3111 for context.

* make tests more usable

* address review feedback

--

wpt-commits: 5b878a1e5de29aa4e68c48e0122878f983f036ff
wpt-pr: 8723

UltraBlame original commit: 455d53e15aff62c2234c19995a0e2c5f7b5127b6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants