[css-fonts-5] Make `unicode-range` syntax suck less #7921

LeaVerou · 2022-10-19T17:18:58Z

Right now unicode-range accepts everything in terms of codepoints. For example:

/* yen, kanji, hiragana, katakana */
unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;

This has several problems:

It's hard to read, even if you want to specify specific characters, you need to find their codepoints
Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone
There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.

The text was updated successfully, but these errors were encountered:

LeaVerou · 2022-10-19T17:27:03Z

Ideas and related work:

It's hard to read, even if you want to specify specific characters, you need to find their codepoints

For this, I'd propose allowing <string> as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";

Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone

For this there has already been discussion in #4573 and just needs spec edits.

There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.

Perhaps we need some kind of not operator. This can even be just a keyword (not, exclude?) in front of existing value syntax:

<urange> = <urange> | not <urange>

Which would allow things like:

unicode-range: greek, not japanese, not U+A5;

Or a minus operator and a keyword for all characters?

unicode-range: greek except "π";

tabatkins · 2022-10-19T23:10:00Z

Yeah, I think [ not? [ <urange> | string | <script-keyword>] ]# is pretty reasonable, with strings being equivalent to a range that covers all the codepoints of the string. All positive ranges would be added, then all the negative ranges would be subtracted; I don't think there's a real need to subtract from a particular range.

dbaron · 2022-10-20T13:24:37Z

From @tabatkins :

All positive ranges would be added, then all the negative ranges would be subtracted

Presumably if there are no positive ranges, then the starting point would be all characters rather than none.

From @LeaVerou :

unicode-range: greek, not japanese, not U+A5;

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

LeaVerou · 2022-10-20T13:36:31Z

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

I was just trying to show syntax, but I agree that is a poor example. They definitely don't intersect! The fact that I can't easily think of examples that do intersect probably proves that @tabatkins is right and we don't need to subtract from a particular range.

aphillips · 2022-10-23T16:45:01Z

You might want to start by looking at what Unicode and ICU have done in this space. For example, the UnicodeSet class in ICU4J is similar to the kinds of "range selection" you're describing here--one can add characters according to various Unicode properties, classes, and scripts to build up ranges, invert ranges, etc.

I think the descriptions in the thread above need to be tighter. Are greek and japanese supposed to be script names, e.g. equivalent to ISO15924 codes like Grek and Jpan? Or are they meant to describe specific character sets, such as the el (Greek) and ja locale exemplary sets in CLDR (such as this one)? These kinds of sets definitely do intersect in various ways (and most languages use at least some of the "common" script--think punctuation). I'll also call out that Unicode runs all the way to U+10FFFF.

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

r12a · 2022-10-27T16:37:08Z

For this, I'd propose allowing as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";

It may also be useful (i haven't thought it through completely) to allow ranges separated by hyphens, like:

unicode-range: "&¡-§©"

which would include the characters &¡¢£¤¥¦§©. You'd need a way of escaping the - character though.

(Two other cautions about situations where it may be better to stick with code point numbers:

[1] Using characters instead of code point values may cause some difficulty when specifying RTL character sets. For example in

unicode-range: "ذ-خ", "ى", "a-z", "ب-ت";

the underlying order is not what you see (although it could be worse).

[2] You'll probably still want to use code point values for combining characters and invisible characters, and especially for formatting characters such as RLI/LRI etc which will again make the declaration look odd and hard to edit.
)

tabatkins · 2022-10-27T23:01:34Z

Right, those issues are precisely why I don't think we want to allow string-based ranges, at least not with that syntax. A range(start, end) function could potentially work, if needed. (Tho since all the syntactic characters inside the parens are non-directional it still ends up being very confusingly visibly reordered if viewed in a web-based editor.)

LeaVerou · 2022-10-28T02:04:43Z

I think ranges are useful but obviously the token that indicates this is a range would need to be outside the string. Eg a function like @tabatkins described or even <string> to <string> or <string> - <string>

r12a · 2022-10-28T07:39:51Z

the token that indicates this is a range would need to be outside the string

Not necessarily. On the (probably rare) occasion where - has to be specified as a character it could be escaped (like in regex expressions). In fact, this whole thing sounds very like establishing a regex expression, so perhaps that offers an alternative approach to the syntax?

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

LeaVerou · 2022-10-28T15:48:48Z

I think that makes it harder to read what the range is. I love regex, but it's not exactly known for its readability 😀

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

Not sure I follow. If anything it seems to me that doing ranges with syntax outside the string makes this easier.

svgeesus · 2024-04-15T18:40:27Z

@aphillips

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

tabatkins · 2024-04-15T18:44:25Z

Yeah, if we'd designed it today it would have sucked a whole lot less. That syntax can drink; that syntax has graduated college; that syntax can rent a car without an additional surcharge.

tabatkins · 2024-04-15T18:53:10Z

or even <string> to <string>

Playing with it a bit myself, unfortunately I think we'd be well-served by using a separator token with strong LTR directionality like to.

If you're trying to denote a range from U+062E (خ) to U+0630 (ذ), you get the following results with a weak directionality vs strong directionality separator:

range("خ" to "ذ")

range("خ", "ذ")

The above two strings are exactly identical save for the separator used, but the bidi algorithm makes the second look like it's in the wrong order.

svgeesus · 2024-04-15T19:38:21Z

My abject apologies, once again for the unicode-range syntax.

"Put it in for now, Chris, until we come up with something better" -- Håkon Wium Lie, spring 1997

On the other hand, at least it wasn't the worst syntax proposed. Feast your eyes on the hex-encoded BMP bitmap:

unicode-range: 0x02037FBC4571000003100C000000100010000300BDF74300000000000

aphillips · 2024-04-16T14:15:21Z

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

I'll remind Martin and Misha that they missed one. 😆

LeaVerou added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. css-fonts-5 labels Oct 19, 2022

w3cbot mentioned this issue Oct 20, 2022

[css-fonts-5] Make unicode-range syntax suck less w3c/i18n-activity#1602

Open

w3c deleted a comment from tabatkins Oct 20, 2022

svgeesus mentioned this issue Apr 15, 2024

[css‑fonts‑4] Create keywords for unicode‑range #4573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-fonts-5] Make `unicode-range` syntax suck less #7921

[css-fonts-5] Make `unicode-range` syntax suck less #7921

LeaVerou commented Oct 19, 2022 •

edited

Loading

LeaVerou commented Oct 19, 2022

tabatkins commented Oct 19, 2022

dbaron commented Oct 20, 2022

LeaVerou commented Oct 20, 2022 •

edited

Loading

aphillips commented Oct 23, 2022 •

edited

Loading

r12a commented Oct 27, 2022 •

edited

Loading

tabatkins commented Oct 27, 2022

LeaVerou commented Oct 28, 2022 •

edited

Loading

r12a commented Oct 28, 2022

LeaVerou commented Oct 28, 2022 •

edited

Loading

svgeesus commented Apr 15, 2024

tabatkins commented Apr 15, 2024

tabatkins commented Apr 15, 2024

svgeesus commented Apr 15, 2024

aphillips commented Apr 16, 2024

[css-fonts-5] Make unicode-range syntax suck less #7921

[css-fonts-5] Make unicode-range syntax suck less #7921

Comments

LeaVerou commented Oct 19, 2022 • edited Loading

LeaVerou commented Oct 19, 2022

tabatkins commented Oct 19, 2022

dbaron commented Oct 20, 2022

LeaVerou commented Oct 20, 2022 • edited Loading

aphillips commented Oct 23, 2022 • edited Loading

r12a commented Oct 27, 2022 • edited Loading

tabatkins commented Oct 27, 2022

LeaVerou commented Oct 28, 2022 • edited Loading

r12a commented Oct 28, 2022

LeaVerou commented Oct 28, 2022 • edited Loading

svgeesus commented Apr 15, 2024

tabatkins commented Apr 15, 2024

tabatkins commented Apr 15, 2024

svgeesus commented Apr 15, 2024

aphillips commented Apr 16, 2024

[css-fonts-5] Make `unicode-range` syntax suck less #7921

[css-fonts-5] Make `unicode-range` syntax suck less #7921

LeaVerou commented Oct 19, 2022 •

edited

Loading

LeaVerou commented Oct 20, 2022 •

edited

Loading

aphillips commented Oct 23, 2022 •

edited

Loading

r12a commented Oct 27, 2022 •

edited

Loading

LeaVerou commented Oct 28, 2022 •

edited

Loading

LeaVerou commented Oct 28, 2022 •

edited

Loading