Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-syntax] Give up on <urange> production? #8835

Open
tabatkins opened this issue May 12, 2023 · 19 comments
Open

[css-syntax] Give up on <urange> production? #8835

tabatkins opened this issue May 12, 2023 · 19 comments

Comments

@tabatkins
Copy link
Member

In the previous versions of Syntax I tried my best to be grammar-agnostic, so you could parse CSS generically, and only throw out invalid rules/declarations as a final step. Implementations actually grammar-checked as they went and didn't add invalid stuff to the stylesheets, but that was meant to be an unobservable performance optimization.

However, we also wanted to fix the "u+a { color: red; } is invalid because <urange> isn't valid in a selector" problem; CSS2 defined <urange> as a generic token. We discussed this in the mailing list (transferred to #3588), and I ended up solving it by defining a very complex production that required preserving the representation of tokens, which nothing else in the whole spec required. Accordingly, no browser has actually implemented this: Chrome just never fixed the bug, and Firefox implemented it by invoking a different tokenizer.

(<an+b> is also now a complex production, but aside from preserving whether a positive number had a + or not, it relied on the standard token values, so wasn't as problematic. Everybody seems to have actually followed the spec text.)

However, now that Syntax is required to grammar-check as it parses, we can potentially correct this in a less weird way, by having a tokenizer flag that gets invoked only for the 'unicode-range' descriptor and allows the tokenizer to produce urange tokens again. This'll better match what Firefox is doing, and should make it easy for Chrome to correctly implement as well.

Thoughts?

@tabatkins
Copy link
Member Author

/cc @emilio @sesse @andruud (I'm not sure who to ping from WebKit anymore)

@emilio
Copy link
Collaborator

emilio commented May 13, 2023

cc @SimonSapin in case he has any opinions, since he implemented the Firefox parser

@SimonSapin
Copy link
Contributor

However, now that Syntax is required to grammar-check as it parses,

What does this mean?

@tabatkins
Copy link
Member Author

@SimonSapin The new Nesting behavior we agreed on is that, in nesting contexts, you first attempt to parse a declaration and grammar-validate it; if that fails, you back up and instead parse as a rule. This is how we resolved the grammar overlap with foo:bar ... looking like both a declaration and a rule. You can see this in https://drafts.csswg.org/css-syntax/#consume-block-contents (note that the "validate" check is done in https://drafts.csswg.org/css-syntax/#consume-declaration).

@SimonSapin
Copy link
Contributor

Firefox has a mechanism to "rewind" arbitrarily far to implement <something> | <other> alternate syntax, so that aspect of Nesting isn’t be a problem. But that’s all at the parser level.

Having a descriptor require switching "modes" in the tokenizer is more than that. It feels unsatisfactory but if it’s the least bad solution given other constraints then 🤷

@emilio
Copy link
Collaborator

emilio commented May 17, 2023

Yeah I don't think this is needed for nesting afaict.

@tabatkins
Copy link
Member Author

Yeah, this isn't about Nesting, it's just a wart I've been worrying about in Syntax since I wrote it. Literally nobody implements the spec as written.

Having a descriptor require switching "modes" in the tokenizer is more than that.

That is, effectively, what Firefox is doing right now anyway, in https://github.com/servo/rust-cssparser/blob/master/src/unicode_range.rs#L43. Once you see a u ident, you grab the rest of the tokens in the property, drop back to the text range they covered, and then run a custom parser over the text to try and extract a UnicodeRange token.

@andruud
Copy link
Member

andruud commented May 18, 2023

I'm sceptical about switching tokenizer modes as a general concept that can happen anywhere, because tokenization is logically separated from parsing (in Blink anyway), and in many cases we tokenize the input in full before looking at any of the tokens (which means it's too late to switch modes by the time we discover the need for that switch).

I think the current spec is not too unreasonble, except for the requirement that tokens retain their representation.

Thought: implementations must already be able to retain the original string for declarations due to custom properties, can we work with that here?

  • Remove <urange>, instead:
  • unicode-range: <any-value>.
  • Spec that unicode-range retains the original string of its declaration, exactly like custom props.
  • At some point interpret that original string into a range per 7.1.

How would we define unicode-range if we introduced it today? unicode-range: <string>?

@tabatkins
Copy link
Member Author

Thought: implementations must already be able to retain the original string for declarations due to custom properties, can we work with that here?

Yeah, that probably works, tho it would mean that we lose the ability to check the syntax for the property at parse time. Maybe we can do this, but still hardcode a check into the parser that works on the string value of the entire declaration?

How would we define unicode-range if we introduced it today? unicode-range: ?

Maybe a string, but we could also have really easy used existing syntax. For example, if we'd just used u- rather than u+ this wouldn't be a problem at all: u-14a and u-100-14a are single idents with unambiguously interpretable values, and u-1?? is an ident followed by ? delim tokens. The grammar would just have been <urange> = <custom-ident> '?'*, with a restriction that no whitespace is allowed in the production, and then some prose rules about how to interpret the ident's value + question marks into a range. Or we could've done the same with a hash token, since that's already used for hex values in colors; it also uses ident syntax after the hash, so it would parse identically - #100-14a, etc.

@andruud
Copy link
Member

andruud commented May 19, 2023

Maybe we can do this, but still hardcode a check into the parser that works on the string value of the entire declaration?

Yeah, that was my thinking. We say in prose that (original) strings that don't match the microsyntax are invalid (parse-time). This is arguably "ugly", but at least it's isolated to unicode-range and the rest of CSS doesn't have to care.

Maybe a string, but we could also have really easy used existing syntax.

If we need unicode ranges in a new property/descriptor in the future, would we use a <string> for that? Asking because we should probably keep the look-at-the-original-string-hacks to a bare minimum.

having a tokenizer flag that gets invoked only for the 'unicode-range' descriptor and allows the tokenizer to produce urange tokens again.

By the way, I think we can still implement "parse the original string" (in Blink) even if it's specified like this, as it's not observable what we actually do (I think). I'm just a little worried that we're opening the door to lots "oh, we'll just switch tokenizer mode to solve that" in the future, which may result in a lot of new "Chrome never fixed the bug"-situations. On the other hand we seem to be opening some unwanted door regardless, so if the mode-switch is a substantially nicer spec than the alternatives, then maybe it indeed is the least bad option ...

@SimonSapin
Copy link
Contributor

Thought: implementations must already be able to retain the original string for declarations due to custom properties,

Is that really the case? Do custom properties not work if an implementation saves tokens? (Regardless of whether it might prefer a string as it’s more compact in memory)

@tabatkins
Copy link
Member Author

Do custom properties not work if an implementation saves tokens? (Regardless of whether it might prefer a string as it’s more compact in memory)

Yes, there's a very explicit requirement for saving the original string representation (or enough information to reconstruct it), because of things like putting numeric-ish data into custom properties (for non-CSS purposes) and expecting it to come out unmangled by serialization when they read it later. See https://drafts.csswg.org/css-variables/#serializing-custom-props (particularly the Note at the end of the section) and the discussion at https://logs.csswg.org/irc.w3.org/css/2016-09-19/#e722375.

I mean, you can save tokens, so long as each one individually remembers the bit of the string it was parsed from. There's no observable difference between that and saving the entire property value as a unit.

@tabatkins
Copy link
Member Author

By the way, I think we can still implement "parse the original string" (in Blink) even if it's specified like this, as it's not observable what we actually do (I think). I'm just a little worried that we're opening the door to lots "oh, we'll just switch tokenizer mode to solve that" in the future, which may result in a lot of new "Chrome never fixed the bug"-situations. On the other hand we seem to be opening some unwanted door regardless, so if the mode-switch is a substantially nicer spec than the alternatives, then maybe it indeed is the least bad option ...

That's a fair worry! I can assure that, at least so long as I'm around to keep a hand in CSS's syntax, that won't happen. I'm extremely annoyed that we have to do this in the first place; I'm just trying to find the most realistic way to write it that implementations will actually follow, since my first attempt failed (Chrome never updated to it; Firefox does something completely different).

But since we do have a "preserve the string exactly" requirement in custom properties (which should probably be at least noted in the Syntax spec, since it does inform how you have to build your parser somewhat), I think leaning on that for unicode-range is probably the most straightforward way to go. (Plus, Firefox's behavior will then match the spec.) I'll put down some text and see how I feel about it.

@tabatkins
Copy link
Member Author

Okay, first pass at a new unicode-range approach is checked in. The tokenizer now has a flag that allows or disallows unicode ranges, and the "consume a declaration" explicitly grabs the text of the declarations' value and retokenizes it with the flag on if it sees the declaration's name is "unicode-range". The tokenization rules were taken verbatim from the old version of CSS Syntax, so the Chrome impl of the token is exactly correct still. (Firefox is close but not exact - it'll consume more than 6 digits+question marks, and just fail to produce a token if it does so. This doesn't matter in practice, tho.)

The invocation is very obviously clumsy, and there's a note in the algorithm explicitly calling out that this is to fix a legacy syntax mistake, so it should be clear this isn't to be reproduced.

I also explicitly introduced the concept that your token stream needs to be able to access the underlying string, and invoke that capability in "consume a declaration" if you're parsing a custom property. That should be the only place it's technically needed.

@cdoublev
Copy link
Collaborator

I mean, you can save tokens, so long as each one individually remembers the bit of the string it was parsed from. There's no observable difference between that and saving the entire property value as a unit.

I think there are minor differences with trailing whitespaces/comments at the top-level of the declaration value.

You also loose the ability to serialize the fallback argument of var() with the original string but the other parts as tokens. This is something I suggested in #6484 but I am aware that this may not be ideal for memory savings.

This may be subjective, but it also seems easier to handle whitespaces disallowed in <an+b> subproductions (rather than for a whole production) when each token remembers its own string bit.

@tabatkins
Copy link
Member Author

The only part of the "original string" that an+b needed to remember was whether a positive number had a + or not on it; I've reworded the spec to just track that information explicitly on the token and rely on it, rather than preserve the exact string.

@cdoublev
Copy link
Collaborator

cdoublev commented Mar 6, 2024

Do you think unicode-range should keep <urange># as its value definition, and require the start/end code point to be lower than 110000, and start <= end?

U+110000, U+11????, U+1-0, are currently invalid unicode-range values on WPT and in Chrome and FF.


Check if three code points would start a unicode-range does not allow u+? and similar, which are valid on WPT and in Chrome and FF.


I do not think <unicode-range-token> can be produced when setting CSSFontFaceRule.style.unicodeRange, at the moment.

@mscuthbert
Copy link

If <urange> does disappear I hope that the unicode-range value can be specified more completely than just <string> for the aid of parsers. The ability to specify more than one (comma-separated) unicode range is at present scarcely documented, but is crucial for users of fonts whose ranges have been broken up due to historical circumstances in the unicode spec. The musical symbols ranges, for instance, is divided into at least two (three or more depending on completeness) ranges, and some fonts (including the W3C Standard Music Font Layout, SMuFL which defines both Unicode standard and Private Use definitions) depend on implementors being able to subset non-connected ranges to create composite-font rendering experiences.

@cdoublev
Copy link
Collaborator

cdoublev commented May 3, 2024

It can be parsed against <unicode-range-token>#, ignoring the minor flaws reported above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants