-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[css-syntax] Give up on <urange> production? #8835
Comments
cc @SimonSapin in case he has any opinions, since he implemented the Firefox parser |
What does this mean? |
@SimonSapin The new Nesting behavior we agreed on is that, in nesting contexts, you first attempt to parse a declaration and grammar-validate it; if that fails, you back up and instead parse as a rule. This is how we resolved the grammar overlap with |
Firefox has a mechanism to "rewind" arbitrarily far to implement Having a descriptor require switching "modes" in the tokenizer is more than that. It feels unsatisfactory but if it’s the least bad solution given other constraints then 🤷 |
Yeah I don't think this is needed for nesting afaict. |
Yeah, this isn't about Nesting, it's just a wart I've been worrying about in Syntax since I wrote it. Literally nobody implements the spec as written.
That is, effectively, what Firefox is doing right now anyway, in https://github.com/servo/rust-cssparser/blob/master/src/unicode_range.rs#L43. Once you see a |
I'm sceptical about switching tokenizer modes as a general concept that can happen anywhere, because tokenization is logically separated from parsing (in Blink anyway), and in many cases we tokenize the input in full before looking at any of the tokens (which means it's too late to switch modes by the time we discover the need for that switch). I think the current spec is not too unreasonble, except for the requirement that tokens retain their representation. Thought: implementations must already be able to retain the original string for declarations due to custom properties, can we work with that here?
How would we define |
Yeah, that probably works, tho it would mean that we lose the ability to check the syntax for the property at parse time. Maybe we can do this, but still hardcode a check into the parser that works on the string value of the entire declaration?
Maybe a string, but we could also have really easy used existing syntax. For example, if we'd just used |
Yeah, that was my thinking. We say in prose that (original) strings that don't match the microsyntax are invalid (parse-time). This is arguably "ugly", but at least it's isolated to
If we need unicode ranges in a new property/descriptor in the future, would we use a
By the way, I think we can still implement "parse the original string" (in Blink) even if it's specified like this, as it's not observable what we actually do (I think). I'm just a little worried that we're opening the door to lots "oh, we'll just switch tokenizer mode to solve that" in the future, which may result in a lot of new "Chrome never fixed the bug"-situations. On the other hand we seem to be opening some unwanted door regardless, so if the mode-switch is a substantially nicer spec than the alternatives, then maybe it indeed is the least bad option ... |
Is that really the case? Do custom properties not work if an implementation saves tokens? (Regardless of whether it might prefer a string as it’s more compact in memory) |
Yes, there's a very explicit requirement for saving the original string representation (or enough information to reconstruct it), because of things like putting numeric-ish data into custom properties (for non-CSS purposes) and expecting it to come out unmangled by serialization when they read it later. See https://drafts.csswg.org/css-variables/#serializing-custom-props (particularly the Note at the end of the section) and the discussion at https://logs.csswg.org/irc.w3.org/css/2016-09-19/#e722375. I mean, you can save tokens, so long as each one individually remembers the bit of the string it was parsed from. There's no observable difference between that and saving the entire property value as a unit. |
That's a fair worry! I can assure that, at least so long as I'm around to keep a hand in CSS's syntax, that won't happen. I'm extremely annoyed that we have to do this in the first place; I'm just trying to find the most realistic way to write it that implementations will actually follow, since my first attempt failed (Chrome never updated to it; Firefox does something completely different). But since we do have a "preserve the string exactly" requirement in custom properties (which should probably be at least noted in the Syntax spec, since it does inform how you have to build your parser somewhat), I think leaning on that for unicode-range is probably the most straightforward way to go. (Plus, Firefox's behavior will then match the spec.) I'll put down some text and see how I feel about it. |
Okay, first pass at a new unicode-range approach is checked in. The tokenizer now has a flag that allows or disallows unicode ranges, and the "consume a declaration" explicitly grabs the text of the declarations' value and retokenizes it with the flag on if it sees the declaration's name is "unicode-range". The tokenization rules were taken verbatim from the old version of CSS Syntax, so the Chrome impl of the token is exactly correct still. (Firefox is close but not exact - it'll consume more than 6 digits+question marks, and just fail to produce a token if it does so. This doesn't matter in practice, tho.) The invocation is very obviously clumsy, and there's a note in the algorithm explicitly calling out that this is to fix a legacy syntax mistake, so it should be clear this isn't to be reproduced. I also explicitly introduced the concept that your token stream needs to be able to access the underlying string, and invoke that capability in "consume a declaration" if you're parsing a custom property. That should be the only place it's technically needed. |
I think there are minor differences with trailing whitespaces/comments at the top-level of the declaration value. You also loose the ability to serialize the fallback argument of This may be subjective, but it also seems easier to handle whitespaces disallowed in |
The only part of the "original string" that an+b needed to remember was whether a positive number had a |
Do you think
Check if three code points would start a unicode-range does not allow I do not think |
If |
It can be parsed against |
In the previous versions of Syntax I tried my best to be grammar-agnostic, so you could parse CSS generically, and only throw out invalid rules/declarations as a final step. Implementations actually grammar-checked as they went and didn't add invalid stuff to the stylesheets, but that was meant to be an unobservable performance optimization.
However, we also wanted to fix the "
u+a { color: red; }
is invalid because<urange>
isn't valid in a selector" problem; CSS2 defined<urange>
as a generic token. We discussed this in the mailing list (transferred to #3588), and I ended up solving it by defining a very complex production that required preserving the representation of tokens, which nothing else in the whole spec required. Accordingly, no browser has actually implemented this: Chrome just never fixed the bug, and Firefox implemented it by invoking a different tokenizer.(
<an+b>
is also now a complex production, but aside from preserving whether a positive number had a+
or not, it relied on the standard token values, so wasn't as problematic. Everybody seems to have actually followed the spec text.)However, now that Syntax is required to grammar-check as it parses, we can potentially correct this in a less weird way, by having a tokenizer flag that gets invoked only for the 'unicode-range' descriptor and allows the tokenizer to produce urange tokens again. This'll better match what Firefox is doing, and should make it easy for Chrome to correctly implement as well.
Thoughts?
The text was updated successfully, but these errors were encountered: