Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css‑syntax] <string‑token> should probably handle valid surrogate pairs #6352

Open
ExE-Boss opened this issue Jun 5, 2021 · 3 comments

Comments

@ExE-Boss
Copy link
Contributor

ExE-Boss commented Jun 5, 2021

Spec: https://drafts.csswg.org/css-syntax-3/#consume-string-token


The 6 digit syntax for <string‑token> was only added to CSS in CSS2, with the CSS Syntax module restricting it to not allow surrogate code points, but I would expect that a string containing valid surrogate pairs would work like the equivalent 6 digit syntax, e.g.:

.foo:before {
	/*
	The "CSS1" and "CSS2.1" specifications parse this as:
		- U+D83D (High Surrogate)
		- U+DD25 (Low Surrogate)

	The "CSS Syntax Level 3" specification parses this as:
		- U+FFFD (Replacement Character)
		- U+FFFD (Replacement Character)

	I would expect this to parse as the U+D83D and U+DD25 surrogate code points,
	which would decode to U+1F525 (Fire) at parse time.
	*/
	content: "\D83D\DD25";
}

and

.foo:before {
	/*
	The "CSS1" specification parses this as:
		- U+1F52 (Greek Small Letter Upsilon With Psili And Varia)
		- U+35 (Digit Five)
	The "CSS2.1" and "CSS Syntax Level 3" specifications parse this as:
		- U+1F525 (Fire)
	*/
	content: "\1F525";
}

would be equivalent.

@tabatkins
Copy link
Member

Do you have a compat need for this? I'd prefer not to allow it if possible, as it would require us to either allow lone surrogates (the only way these can be produced, as they're otherwise "censored" away during parsing) or add some complication to escaping such that, if you decode a high surrogate, you immediately check if the next characters are an escape for a low surrogate, then decode them together and emit the combined codepoint.

That's possible, I'd just like to avoid it if it's not necessary.


I'm not sure what the significance of your comment about CSS1 only allowing 4-digit escapes is, sorry.

@ExE-Boss
Copy link
Contributor Author

ExE-Boss commented Nov 25, 2021

Well, back when WebKit didn’t correctly support the CSS2.1 syntax (before r114876), they implemented support for surrogate pairsmailing list as a workaround.


I'm not sure what the significance of your comment about CSS1 only allowing 4-digit escapes is, sorry.

The CSS1 syntax for Unicode escapesCSS1 Appendix B in case‑insensitive flexCSS1 ref16 notation is:

unicode		\\[0-9a-f]{1,4}

Which means that anything after the 4th hex digit is not part of the escaped code point in CSS1:

.foo:before {
	/*
	The "CSS1" specification parses this as:
		- U+0000 (Null) → U+FFFD (Replacement Character)
		- U+0034 (Digit Four)
		- U+0031 (Digit One)

	The "CSS2.1" and "CSS Syntax Level 3" specifications parse this as:
		- U+0041 (Latin Capital Letter A)
	*/
	content: "\000041";
}

@tabatkins
Copy link
Member

Sure, impls did all sorts of weird things in the bad old days, but is there a current compat need for this? Do you know of pages that are currently broken with the specified Syntax behavior, but would be fixed if we allowed lone surrogates to be produced by the escape syntax?

The CSS1 syntax for Unicode escapes[...]

Sure, I'm still just not sure how a CSS1 spec detail is relevant to anything here. CSS2 was first published more than twenty years ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants