URL encoding of CSS values #9301

annevk · 2023-09-04T08:28:55Z

A long time ago @zcorpan wrote a great set of tests to determine whether features used the URL parser with a non-UTF-8 encoding. As you might expect, there is no interoperability:

https://wpt.fyi/results/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html%3Finclude%3Dcss

Everyone seems to agree that @import uses UTF-8, but CSS values depend on the stylesheet encoding in Chromium and WebKit. I wanted to double check with the WG that this is intentional and everyone is on board with aligning with the standard or if this is something that should be changed.

The text was updated successfully, but these errors were encountered:

zcorpan · 2023-09-04T09:33:29Z

Using utf-8 as the URL encoding for all CSS features is what the test currently expects, and what Firefox does. I tried to find compat bugs but didn't find any.

Is it possible for webkit and chromium to try switching to utf-8?

annevk · 2023-09-04T14:33:49Z

If Chromium is willing to switch around the same time I think that would be ok for WebKit. I'm personally not too worried about compatibility fallout for this.

annevk · 2023-09-06T11:34:57Z

Forgot to copy @chrishtr on this issue. I suspect he can weigh in for Chromium and then maybe it can be removed from the agenda.

tabatkins · 2023-09-07T00:45:47Z

I don't see any particular reason to not align, only the possibility of compat issues. So long as those end up fine, let's please align with the rest of the platform.

foolip · 2023-09-07T13:36:10Z

What kind of compat issues could there plausibly be here? If I'm reading https://url.spec.whatwg.org/#concept-basic-url-parser correctly the encoding only matters in the query string, so is it cases like url(http://example.com/?text=Ångström) when percent-encoding isn't used in the CSS source?

annevk · 2023-09-07T16:52:08Z

Correct, coupled with the stylesheet not using UTF-8 as its encoding.

foolip · 2023-09-08T15:22:46Z

I tried the change in https://chromium-review.googlesource.com/4846986 and the only test failures are in WPT. That's good.

With my Blink API owner hat on, to ship this change in Blink now I'd want to see some kind of compat analysis showing the risk is low, either by looking for such cases in httparchive, or via Chrome use counters.

However, if this behavior was already shipped in Firefox and Safari without hiccups I'd be convinced is low enough risk to just try it.

annevk · 2023-09-08T20:01:24Z

WebKit is not compliant. I have a patch, but there are similar concerns around stability that would be resolved if Chromium shipped: WebKit/WebKit#17383. 😅

css-meeting-bot · 2023-09-14T08:21:27Z

The CSS Working Group just discussed URL encoding of CSS values, and agreed to the following:

RESOLVED: for all urls we encode them as utf-8 when they go down the network stack

The full IRC log of that discussion

<emilio> TabAtkins: basic question is what encodings are used by things in the platform to parse URLs
<emilio> ... there's very little interop
<emilio> ... but with the tests that show up right now it seems everybody uses utf-8 for @import
<emilio> ... but in a stylesheet it uses whatever the stylesheet encoding is
<emilio> ... annevk wants to make it explicit
<emilio> ... is everyone fine with that?
<miriam> q?
<miriam> ack TabAtkins
<emilio> ... zcorpan mentions that firefox using utf-8 for all URLs
<emilio> ... and seems to be fine with that
<emilio> ... nobody is concerned about compat fallout
<chris> I would prefer to use UTF-8 always
<chris> q+
<emilio> ... nobody really uses non-ascii-compatible stuff
<emilio> ... so really just a question of standardizing it
<miriam> ack chris
<emilio> ... two choices: all urls are utf-8, or @import utf-8 and sheet encoding for the rest
<fremy> q+
<emilio> chris: I'd prefer utf-8 everywhere
<emilio> emilio: +1 to that, seems better to be consistent
<miriam> ack fremy
<emilio> fremy: is that likely to create invalid files?
<myles> q+ to ask a silly question
<bramus> +1 to chris’s remark
<emilio> emilio: wdym as invalid?
<emilio> fremy: you may end up with content that is not parsable in the encoding
<emilio> TabAtkins: fix is using utf-8
<emilio> fremy: why do we bother to allow utf-8 if the file is in a different encoding
<emilio> TabAtkins: it's not when you parse the file, it's about how you feed it to the url parser
<miriam> ack myles
<Zakim> myles, you wanted to ask a silly question
<emilio> myles: is the proposal that a stylesheet is in an encoding, you find a url and switch encoding?
<emilio> TabAtkins: no, you parse as normal but you feed the url parser telling it that it's utf-8
<andreubotella> IIUC this is for url-encoded bytes (%20 and so on)
<miriam> ack dbaron
<emilio> dbaron: there is a very old backwards compat behavior which is that URLs carry around the encoding of the document that contained them
<emilio> ... so that when the network fetch happens you send the bytes to the server instead of a decoded version of them
<emilio> ... ideally that should only happen when this backwards compat hack is required
<emilio> ... and has been phased out generally to use utf-8 rather than that
<emilio> myles: so you get a file, decode decode decode, some stylesheet has a URL and you need to go back and send those bytes rather than the decoded stuff?
<emilio> dbaron: the new old way of doing it, not the old old way, is that you store the encoding of the thing in which you found the url along the url
<emilio> ... purpose of that is that the server gets the same bytes as the document
<emilio> ... which is a horrible hack to mimic the old old behavior
<emilio> ... which was where the web just carried bytes around
<emilio> myles: So the new old behavior is you round-trip (go bytes to encoding, and then when the request go back to bytes)
<emilio> dbaron: yeah, and there's a migration away from that where we just send utf-8 to the server
<emilio> ... not sure what the status of that migration is
<emilio> ... clearly there's a difference between @import and other urls here
<emilio> myles: so proposal at hand is you decode, see a url, then decode those bytes as utf-8?
<emilio> TabAtkins: dbaron explained it better, when we put the url in the network stack we just stop carrying that encoding and always put it as utf-8
<emilio> myles: perfect
<emilio> dbaron: hope my memory about this is right
<emilio> TabAtkins: proposal is for all urls we encode them as utf-8 when they go down the network stack in accordance with the url standard's recomendation
<emilio> RESOLVED: for all urls we encode them as utf-8 when they go down the network stack

foolip · 2023-09-14T11:37:27Z

@annevk sounds like we're both saying we'd be happy to ship if the other proves it safe by shipping first. Deadlock.

Another way forward is to do the kind of web compat analysis required by the Blink launch process. In this case, I think an analysis of CSS resources in http archive could be enough. Concretely, look for any CSS files served with a non-UTF8 encoding, and within those look for any non-ASCII bytes in the query component of any URL. Make a list of all of them, then randomly sample 50. For each, load the site including that CSS in Chrome, Firefox and Safari and see if there's any observable difference.

Is that something you'd be willing to do? I could do the blink-dev paperwork pointing to such an analysis and predict we'd try to ship it.

zcorpan · 2023-09-14T13:51:42Z

I can take a stab at writing a query.

tabatkins · 2023-10-23T20:11:28Z

Okay, I've dropped in some text about this.

Some legacy implementations preserved the original encoding of URLs
(as represented in the stylesheet)
and reproduced that encoding when communicating over the network.
UAs must not do this;
when it's necessary to re-encode a URL into bytes
to communicate over the network,
the URL must be encoded as UTF-8,
regardless of the original stylesheet encoding.

I think this sufficiently captures the context? Let me know if I can word anything better.

…r the network. #9301

zcorpan · 2023-10-25T23:20:07Z

@tabatkins the URL parser will percent-encode the query, so the bytes that go over the network will be limited to the ASCII range. The issue is which encoding to use when percent-encoding the query string (always utf-8, or the encoding of the stylesheet or document). See https://url.spec.whatwg.org/#query-encoding-example

So maybe a bit clearer would be to say that for URLs in CSS, the URL parser's encoding argument must be omitted (i.e. use the default, UTF-8). https://url.spec.whatwg.org/#concept-url-parser

zcorpan · 2023-10-25T23:47:22Z

I've now queried httparchive. Getting the actual encoding is quite tricky, and so I gave up on that. I assume a big chunk is utf-8, but certainly not everything.

The dataset is from 2022-07-01 (same as 2022 Web Almanac). Total number of pages is 7,303,959.

Number of pages with non-ASCII in the query string in url() in CSS: 2231. So at most 0.03% of pages in the dataset (likely less since this includes utf-8 pages).

An example match is url('https://fonts.googleapis.com/css?family=Noto+Sans+JP:900&text=西部警察2020年12月29日発売決定！「大都会」シリーズ') (but the page for this uses utf-8).

(I excluded the cases where the first non-ASCII character in the query string is "â" because there were some that used non-ASCII quote marks and with encoding mismatch it became e.g. url(â€˜./fonts/Avenir-Next.eot?#iefixâ€™) – which is unlikely to work to begin with, and therefore not interesting to include.)

Full results: https://docs.google.com/spreadsheets/d/1i9Gvs1JIDo5mOw-rwPc5ppI6KrbXC8ol2XJ7h9PGAVs/edit?usp=sharing

tabatkins · 2023-10-27T20:22:49Z

Okay, rephrased. @zcorpan, does this wording look better?

(Also, thanks for the archive trawling, and further review in general! Looks pretty safe, then.)

annevk · 2023-11-03T23:19:55Z

Two things:

It's really that when you invoke the URL parser you don't use its encoding argument. All the percent-encoding happens underneath and CSS doesn't really directly access that (to my knowledge).
I landed a patch for this in WebKit so it'll get some more exposure: WebKit/WebKit@c195704.

zcorpan · 2023-11-06T14:28:07Z

As @annevk said, https://url.spec.whatwg.org/#concept-url-parser is the appropriate entry point.

for URLs in CSS, the URL parser's encoding argument must be omitted (i.e. use the default, UTF-8)

tabatkins · 2023-11-06T20:03:04Z

Yeah, after discussing with @annevk out-of-thread, I realized that the URL design is, for historical reasons, really weird. I had assumed that URLs stored their value normally, as codepoints, and then percent-encoded when you asked them to be serialized; instead, they store ASCII only, and percent-encode any non-ASCII (or ASCII-but-weird) codepoints during parsing according to a specified encoding.

Fix incoming.

tabatkins · 2023-11-06T20:08:22Z

Okay, I'm finally confident I've gotten this right. Thanks for the wording suggestion, @zcorpan, I basically used that.

zcorpan · 2023-11-06T20:33:47Z

LGTM :)

…sense to mere mortals. #9301

fantasai · 2023-11-23T00:02:03Z

@zcorpan Sorry to bug you again, but, I kindof insisted we re-write the paragraph so that it can be understandable what's happening to non-URL-spec-editors-and-peers. :) New text is:

When interpreting URLs expressed in CSS, the URL parser’s encoding argument must be omitted (i.e. use the default, UTF-8), regardless of the stylesheet encoding.

Note: In other words, a URL written in CSS will always percent-encode non-ASCII codepoints using UTF-8 in the URL object (and thus whenever using the URL value for e.g. network requests), regardless of the stylesheet’s own encoding. Note that this occurs after decoding the stylesheet into Unicode code points.

(The core sentence hasn't changed, we just added a bunch of explanatory text, which is hopefully correct.)

annevk · 2023-11-23T12:10:21Z

@fantasai it's correct, but it really only applies to the query portion of the URL. The remainder was already using UTF-8. And all of this is also the recommended default, so I'm not sure it's worth calling out as if it's something special, but that's up to you. (As in, I'd have a note when you do supply that argument.)

zcorpan · 2023-11-23T21:14:52Z

Yes it looks correct. It's the recommended default but it's different from URLs in HTML links, so it can still be surprising if one isn't already familiar with these details.

annevk added css-values-3 Agenda+ css-values-4 Current Work css-values-5 labels Sep 4, 2023

This was referenced Sep 4, 2023

Correct URL encoding of CSS WebKit/WebKit#17383

Merged

/html/infrastructure/urls/resolving-urls/query-encoding/* are disabled in Mozilla and Chromium web-platform-tests/wpt#4934

Open

astearns added this to Unslotted in TPAC 2023 agenda Sep 7, 2023

astearns moved this from Unslotted to Thursday Morning in TPAC 2023 agenda Sep 7, 2023

css-meeting-bot removed the Agenda+ label Sep 14, 2023

mozilla-apprentice mentioned this issue Sep 14, 2023

URL encoding of CSS values mozilla/wg-decisions#1249

Open

mirisuzanne added the Needs Edits label Sep 14, 2023

chromium-helper mentioned this issue Sep 14, 2023

URL encoding of CSS values chromium-helper/csswg-resolutions#232

Closed

tabatkins added a commit that referenced this issue Oct 23, 2023

[css-values-4] Per WG resolution, URLs are always utf-8 when used ove…

6089e24

…r the network. #9301

tabatkins closed this as completed Oct 23, 2023

tabatkins added Closed Accepted by CSSWG Resolution and removed Needs Edits labels Oct 23, 2023

zcorpan reopened this Oct 25, 2023

tabatkins added a commit that referenced this issue Oct 27, 2023

[css-values-4] Rephrase the utf-8 URL-encoding requirement. #9301

80972b9

tabatkins added a commit that referenced this issue Nov 6, 2023

[css-values-4] One more go at fixing the URL parsing encoding. #9301

4da2655

tabatkins closed this as completed Nov 6, 2023

fantasai added a commit that referenced this issue Nov 23, 2023

[css-values-4] Make the URL percent-encoding statement actually make …

2da1902

…sense to mere mortals. #9301

fantasai added a commit that referenced this issue Nov 23, 2023

[css-values-4] Make into a note #9301

b373bb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL encoding of CSS values #9301

URL encoding of CSS values #9301

annevk commented Sep 4, 2023

zcorpan commented Sep 4, 2023

annevk commented Sep 4, 2023

annevk commented Sep 6, 2023

tabatkins commented Sep 7, 2023

foolip commented Sep 7, 2023

annevk commented Sep 7, 2023

foolip commented Sep 8, 2023

annevk commented Sep 8, 2023

css-meeting-bot commented Sep 14, 2023

foolip commented Sep 14, 2023

zcorpan commented Sep 14, 2023

tabatkins commented Oct 23, 2023

zcorpan commented Oct 25, 2023

zcorpan commented Oct 25, 2023

tabatkins commented Oct 27, 2023

annevk commented Nov 3, 2023

zcorpan commented Nov 6, 2023

tabatkins commented Nov 6, 2023

tabatkins commented Nov 6, 2023

zcorpan commented Nov 6, 2023

fantasai commented Nov 23, 2023 •

edited

annevk commented Nov 23, 2023

zcorpan commented Nov 23, 2023

URL encoding of CSS values #9301

URL encoding of CSS values #9301

Comments

annevk commented Sep 4, 2023

zcorpan commented Sep 4, 2023

annevk commented Sep 4, 2023

annevk commented Sep 6, 2023

tabatkins commented Sep 7, 2023

foolip commented Sep 7, 2023

annevk commented Sep 7, 2023

foolip commented Sep 8, 2023

annevk commented Sep 8, 2023

css-meeting-bot commented Sep 14, 2023

foolip commented Sep 14, 2023

zcorpan commented Sep 14, 2023

tabatkins commented Oct 23, 2023

zcorpan commented Oct 25, 2023

zcorpan commented Oct 25, 2023

tabatkins commented Oct 27, 2023

annevk commented Nov 3, 2023

zcorpan commented Nov 6, 2023

tabatkins commented Nov 6, 2023

tabatkins commented Nov 6, 2023

zcorpan commented Nov 6, 2023

fantasai commented Nov 23, 2023 • edited

annevk commented Nov 23, 2023

zcorpan commented Nov 23, 2023

fantasai commented Nov 23, 2023 •

edited