Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL encoding of CSS values #9301

Closed
annevk opened this issue Sep 4, 2023 · 23 comments
Closed

URL encoding of CSS values #9301

annevk opened this issue Sep 4, 2023 · 23 comments

Comments

@annevk
Copy link
Member

annevk commented Sep 4, 2023

A long time ago @zcorpan wrote a great set of tests to determine whether features used the URL parser with a non-UTF-8 encoding. As you might expect, there is no interoperability:

https://wpt.fyi/results/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html%3Finclude%3Dcss

Everyone seems to agree that @import uses UTF-8, but CSS values depend on the stylesheet encoding in Chromium and WebKit. I wanted to double check with the WG that this is intentional and everyone is on board with aligning with the standard or if this is something that should be changed.

@zcorpan
Copy link
Member

zcorpan commented Sep 4, 2023

Using utf-8 as the URL encoding for all CSS features is what the test currently expects, and what Firefox does. I tried to find compat bugs but didn't find any.

Is it possible for webkit and chromium to try switching to utf-8?

@annevk
Copy link
Member Author

annevk commented Sep 4, 2023

If Chromium is willing to switch around the same time I think that would be ok for WebKit. I'm personally not too worried about compatibility fallout for this.

@annevk
Copy link
Member Author

annevk commented Sep 6, 2023

Forgot to copy @chrishtr on this issue. I suspect he can weigh in for Chromium and then maybe it can be removed from the agenda.

@tabatkins
Copy link
Member

I don't see any particular reason to not align, only the possibility of compat issues. So long as those end up fine, let's please align with the rest of the platform.

@astearns astearns added this to Unslotted in TPAC 2023 agenda Sep 7, 2023
@astearns astearns moved this from Unslotted to Thursday Morning in TPAC 2023 agenda Sep 7, 2023
@foolip
Copy link
Member

foolip commented Sep 7, 2023

What kind of compat issues could there plausibly be here? If I'm reading https://url.spec.whatwg.org/#concept-basic-url-parser correctly the encoding only matters in the query string, so is it cases like url(http://example.com/?text=Ångström) when percent-encoding isn't used in the CSS source?

@annevk
Copy link
Member Author

annevk commented Sep 7, 2023

Correct, coupled with the stylesheet not using UTF-8 as its encoding.

@foolip
Copy link
Member

foolip commented Sep 8, 2023

I tried the change in https://chromium-review.googlesource.com/4846986 and the only test failures are in WPT. That's good.

With my Blink API owner hat on, to ship this change in Blink now I'd want to see some kind of compat analysis showing the risk is low, either by looking for such cases in httparchive, or via Chrome use counters.

However, if this behavior was already shipped in Firefox and Safari without hiccups I'd be convinced is low enough risk to just try it.

@annevk
Copy link
Member Author

annevk commented Sep 8, 2023

WebKit is not compliant. I have a patch, but there are similar concerns around stability that would be resolved if Chromium shipped: WebKit/WebKit#17383. 😅

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed URL encoding of CSS values, and agreed to the following:

  • RESOLVED: for all urls we encode them as utf-8 when they go down the network stack
The full IRC log of that discussion <emilio> TabAtkins: basic question is what encodings are used by things in the platform to parse URLs
<emilio> ... there's very little interop
<emilio> ... but with the tests that show up right now it seems everybody uses utf-8 for @import
<emilio> ... but in a stylesheet it uses whatever the stylesheet encoding is
<emilio> ... annevk wants to make it explicit
<emilio> ... is everyone fine with that?
<miriam> q?
<miriam> ack TabAtkins
<emilio> ... zcorpan mentions that firefox using utf-8 for all URLs
<emilio> ... and seems to be fine with that
<emilio> ... nobody is concerned about compat fallout
<chris> I would prefer to use UTF-8 always
<chris> q+
<emilio> ... nobody really uses non-ascii-compatible stuff
<emilio> ... so really just a question of standardizing it
<miriam> ack chris
<emilio> ... two choices: all urls are utf-8, or @import utf-8 and sheet encoding for the rest
<fremy> q+
<emilio> chris: I'd prefer utf-8 everywhere
<emilio> emilio: +1 to that, seems better to be consistent
<miriam> ack fremy
<emilio> fremy: is that likely to create invalid files?
<myles> q+ to ask a silly question
<bramus> +1 to chris’s remark
<emilio> emilio: wdym as invalid?
<emilio> fremy: you may end up with content that is not parsable in the encoding
<emilio> TabAtkins: fix is using utf-8
<emilio> fremy: why do we bother to allow utf-8 if the file is in a different encoding
<emilio> TabAtkins: it's not when you parse the file, it's about how you feed it to the url parser
<miriam> ack myles
<Zakim> myles, you wanted to ask a silly question
<emilio> myles: is the proposal that a stylesheet is in an encoding, you find a url and switch encoding?
<emilio> TabAtkins: no, you parse as normal but you feed the url parser telling it that it's utf-8
<andreubotella> IIUC this is for url-encoded bytes (%20 and so on)
<miriam> ack dbaron
<emilio> dbaron: there is a very old backwards compat behavior which is that URLs carry around the encoding of the document that contained them
<emilio> ... so that when the network fetch happens you send the bytes to the server instead of a decoded version of them
<emilio> ... ideally that should only happen when this backwards compat hack is required
<emilio> ... and has been phased out generally to use utf-8 rather than that
<emilio> myles: so you get a file, decode decode decode, some stylesheet has a URL and you need to go back and send those bytes rather than the decoded stuff?
<emilio> dbaron: the new old way of doing it, not the old old way, is that you store the encoding of the thing in which you found the url along the url
<emilio> ... purpose of that is that the server gets the same bytes as the document
<emilio> ... which is a horrible hack to mimic the old old behavior
<emilio> ... which was where the web just carried bytes around
<emilio> myles: So the new old behavior is you round-trip (go bytes to encoding, and then when the request go back to bytes)
<emilio> dbaron: yeah, and there's a migration away from that where we just send utf-8 to the server
<emilio> ... not sure what the status of that migration is
<emilio> ... clearly there's a difference between @import and other urls here
<emilio> myles: so proposal at hand is you decode, see a url, then decode those bytes as utf-8?
<emilio> TabAtkins: dbaron explained it better, when we put the url in the network stack we just stop carrying that encoding and always put it as utf-8
<emilio> myles: perfect
<emilio> dbaron: hope my memory about this is right
<emilio> TabAtkins: proposal is for all urls we encode them as utf-8 when they go down the network stack in accordance with the url standard's recomendation
<emilio> RESOLVED: for all urls we encode them as utf-8 when they go down the network stack

@foolip
Copy link
Member

foolip commented Sep 14, 2023

@annevk sounds like we're both saying we'd be happy to ship if the other proves it safe by shipping first. Deadlock.

Another way forward is to do the kind of web compat analysis required by the Blink launch process. In this case, I think an analysis of CSS resources in http archive could be enough. Concretely, look for any CSS files served with a non-UTF8 encoding, and within those look for any non-ASCII bytes in the query component of any URL. Make a list of all of them, then randomly sample 50. For each, load the site including that CSS in Chrome, Firefox and Safari and see if there's any observable difference.

Is that something you'd be willing to do? I could do the blink-dev paperwork pointing to such an analysis and predict we'd try to ship it.

@zcorpan
Copy link
Member

zcorpan commented Sep 14, 2023

I can take a stab at writing a query.

@tabatkins
Copy link
Member

Okay, I've dropped in some text about this.

Some legacy implementations preserved the original encoding of URLs
(as represented in the stylesheet)
and reproduced that encoding when communicating over the network.
UAs must not do this;
when it's necessary to re-encode a URL into bytes
to communicate over the network,
the URL must be encoded as UTF-8,
regardless of the original stylesheet encoding.

I think this sufficiently captures the context? Let me know if I can word anything better.

@zcorpan
Copy link
Member

zcorpan commented Oct 25, 2023

@tabatkins the URL parser will percent-encode the query, so the bytes that go over the network will be limited to the ASCII range. The issue is which encoding to use when percent-encoding the query string (always utf-8, or the encoding of the stylesheet or document). See https://url.spec.whatwg.org/#query-encoding-example

So maybe a bit clearer would be to say that for URLs in CSS, the URL parser's encoding argument must be omitted (i.e. use the default, UTF-8). https://url.spec.whatwg.org/#concept-url-parser

@zcorpan zcorpan reopened this Oct 25, 2023
@zcorpan
Copy link
Member

zcorpan commented Oct 25, 2023

I've now queried httparchive. Getting the actual encoding is quite tricky, and so I gave up on that. I assume a big chunk is utf-8, but certainly not everything.

The dataset is from 2022-07-01 (same as 2022 Web Almanac). Total number of pages is 7,303,959.

Number of pages with non-ASCII in the query string in url() in CSS: 2231. So at most 0.03% of pages in the dataset (likely less since this includes utf-8 pages).

An example match is url('https://fonts.googleapis.com/css?family=Noto+Sans+JP:900&text=西部警察2020年12月29日 発売決定!「大都会」シリーズ') (but the page for this uses utf-8).

(I excluded the cases where the first non-ASCII character in the query string is "â" because there were some that used non-ASCII quote marks and with encoding mismatch it became e.g. url(‘./fonts/Avenir-Next.eot?#iefix’) – which is unlikely to work to begin with, and therefore not interesting to include.)

Full results: https://docs.google.com/spreadsheets/d/1i9Gvs1JIDo5mOw-rwPc5ppI6KrbXC8ol2XJ7h9PGAVs/edit?usp=sharing

@tabatkins
Copy link
Member

Okay, rephrased. @zcorpan, does this wording look better?

(Also, thanks for the archive trawling, and further review in general! Looks pretty safe, then.)

@annevk
Copy link
Member Author

annevk commented Nov 3, 2023

Two things:

  1. It's really that when you invoke the URL parser you don't use its encoding argument. All the percent-encoding happens underneath and CSS doesn't really directly access that (to my knowledge).
  2. I landed a patch for this in WebKit so it'll get some more exposure: WebKit/WebKit@c195704.

@zcorpan
Copy link
Member

zcorpan commented Nov 6, 2023

As @annevk said, https://url.spec.whatwg.org/#concept-url-parser is the appropriate entry point.

for URLs in CSS, the URL parser's encoding argument must be omitted (i.e. use the default, UTF-8)

@tabatkins
Copy link
Member

Yeah, after discussing with @annevk out-of-thread, I realized that the URL design is, for historical reasons, really weird. I had assumed that URLs stored their value normally, as codepoints, and then percent-encoded when you asked them to be serialized; instead, they store ASCII only, and percent-encode any non-ASCII (or ASCII-but-weird) codepoints during parsing according to a specified encoding.

Fix incoming.

@tabatkins
Copy link
Member

Okay, I'm finally confident I've gotten this right. Thanks for the wording suggestion, @zcorpan, I basically used that.

@zcorpan
Copy link
Member

zcorpan commented Nov 6, 2023

LGTM :)

fantasai added a commit that referenced this issue Nov 23, 2023
@fantasai
Copy link
Collaborator

fantasai commented Nov 23, 2023

@zcorpan Sorry to bug you again, but, I kindof insisted we re-write the paragraph so that it can be understandable what's happening to non-URL-spec-editors-and-peers. :) New text is:

When interpreting URLs expressed in CSS, the URL parser’s encoding argument must be omitted (i.e. use the default, UTF-8), regardless of the stylesheet encoding.

Note: In other words, a URL written in CSS will always percent-encode non-ASCII codepoints using UTF-8 in the URL object (and thus whenever using the URL value for e.g. network requests), regardless of the stylesheet’s own encoding. Note that this occurs after decoding the stylesheet into Unicode code points.

(The core sentence hasn't changed, we just added a bunch of explanatory text, which is hopefully correct.)

fantasai added a commit that referenced this issue Nov 23, 2023
@annevk
Copy link
Member Author

annevk commented Nov 23, 2023

@fantasai it's correct, but it really only applies to the query portion of the URL. The remainder was already using UTF-8. And all of this is also the recommended default, so I'm not sure it's worth calling out as if it's something special, but that's up to you. (As in, I'd have a note when you do supply that argument.)

@zcorpan
Copy link
Member

zcorpan commented Nov 23, 2023

Yes it looks correct. It's the recommended default but it's different from URLs in HTML links, so it can still be surprising if one isn't already familiar with these details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
TPAC 2023 agenda
Thursday Morning
Development

No branches or pull requests

7 participants