Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-fonts-5] Make unicode-range syntax suck less #7921

Open
LeaVerou opened this issue Oct 19, 2022 · 15 comments
Open

[css-fonts-5] Make unicode-range syntax suck less #7921

LeaVerou opened this issue Oct 19, 2022 · 15 comments
Labels
css-fonts-5 i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@LeaVerou
Copy link
Member

LeaVerou commented Oct 19, 2022

Right now unicode-range accepts everything in terms of codepoints. For example:

/* yen, kanji, hiragana, katakana */
unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;

This has several problems:

  1. It's hard to read, even if you want to specify specific characters, you need to find their codepoints
  2. Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone
  3. There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.
@LeaVerou LeaVerou added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. css-fonts-5 labels Oct 19, 2022
@LeaVerou
Copy link
Member Author

Ideas and related work:

  1. It's hard to read, even if you want to specify specific characters, you need to find their codepoints

For this, I'd propose allowing <string> as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";
  1. Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone

For this there has already been discussion in #4573 and just needs spec edits.

  1. There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.

Perhaps we need some kind of not operator. This can even be just a keyword (not, exclude?) in front of existing value syntax:

<urange> = <urange> | not <urange>

Which would allow things like:

unicode-range: greek, not japanese, not U+A5;

Or a minus operator and a keyword for all characters?

unicode-range: greek except "π";

@tabatkins
Copy link
Member

Yeah, I think [ not? [ <urange> | string | <script-keyword>] ]# is pretty reasonable, with strings being equivalent to a range that covers all the codepoints of the string. All positive ranges would be added, then all the negative ranges would be subtracted; I don't think there's a real need to subtract from a particular range.

@dbaron
Copy link
Member

dbaron commented Oct 20, 2022

From @tabatkins :

All positive ranges would be added, then all the negative ranges would be subtracted

Presumably if there are no positive ranges, then the starting point would be all characters rather than none.

From @LeaVerou :

unicode-range: greek, not japanese, not U+A5;

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 20, 2022

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

I was just trying to show syntax, but I agree that is a poor example. They definitely don't intersect! The fact that I can't easily think of examples that do intersect probably proves that @tabatkins is right and we don't need to subtract from a particular range.

@w3c w3c deleted a comment from tabatkins Oct 20, 2022
@aphillips
Copy link
Contributor

aphillips commented Oct 23, 2022

You might want to start by looking at what Unicode and ICU have done in this space. For example, the UnicodeSet class in ICU4J is similar to the kinds of "range selection" you're describing here--one can add characters according to various Unicode properties, classes, and scripts to build up ranges, invert ranges, etc.

I think the descriptions in the thread above need to be tighter. Are greek and japanese supposed to be script names, e.g. equivalent to ISO15924 codes like Grek and Jpan? Or are they meant to describe specific character sets, such as the el (Greek) and ja locale exemplary sets in CLDR (such as this one)? These kinds of sets definitely do intersect in various ways (and most languages use at least some of the "common" script--think punctuation). I'll also call out that Unicode runs all the way to U+10FFFF.

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

@r12a
Copy link
Contributor

r12a commented Oct 27, 2022

For this, I'd propose allowing as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";

It may also be useful (i haven't thought it through completely) to allow ranges separated by hyphens, like:

unicode-range: "&¡-§©"

which would include the characters &¡¢£¤¥¦§©. You'd need a way of escaping the - character though.

(Two other cautions about situations where it may be better to stick with code point numbers:

[1] Using characters instead of code point values may cause some difficulty when specifying RTL character sets. For example in

unicode-range: "ذ-خ", "ى", "a-z", "ب-ت";

the underlying order is not what you see (although it could be worse).

[2] You'll probably still want to use code point values for combining characters and invisible characters, and especially for formatting characters such as RLI/LRI etc which will again make the declaration look odd and hard to edit.
)

@tabatkins
Copy link
Member

Right, those issues are precisely why I don't think we want to allow string-based ranges, at least not with that syntax. A range(start, end) function could potentially work, if needed. (Tho since all the syntactic characters inside the parens are non-directional it still ends up being very confusingly visibly reordered if viewed in a web-based editor.)

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 28, 2022

I think ranges are useful but obviously the token that indicates this is a range would need to be outside the string. Eg a function like @tabatkins described or even <string> to <string> or <string> - <string>

@r12a
Copy link
Contributor

r12a commented Oct 28, 2022

the token that indicates this is a range would need to be outside the string

Not necessarily. On the (probably rare) occasion where - has to be specified as a character it could be escaped (like in regex expressions). In fact, this whole thing sounds very like establishing a regex expression, so perhaps that offers an alternative approach to the syntax?

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

@LeaVerou
Copy link
Member Author

LeaVerou commented Oct 28, 2022

I think that makes it harder to read what the range is. I love regex, but it's not exactly known for its readability 😀

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

Not sure I follow. If anything it seems to me that doing ranges with syntax outside the string makes this easier.

@svgeesus
Copy link
Contributor

@aphillips

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

@tabatkins
Copy link
Member

Yeah, if we'd designed it today it would have sucked a whole lot less. That syntax can drink; that syntax has graduated college; that syntax can rent a car without an additional surcharge.

@tabatkins
Copy link
Member

or even <string> to <string>

Playing with it a bit myself, unfortunately I think we'd be well-served by using a separator token with strong LTR directionality like to.

If you're trying to denote a range from U+062E (خ) to U+0630 (ذ), you get the following results with a weak directionality vs strong directionality separator:

range("خ" to "ذ")

range("خ", "ذ")

The above two strings are exactly identical save for the separator used, but the bidi algorithm makes the second look like it's in the wrong order.

@svgeesus
Copy link
Contributor

My abject apologies, once again for the unicode-range syntax.

image

"Put it in for now, Chris, until we come up with something better" -- Håkon Wium Lie, spring 1997

On the other hand, at least it wasn't the worst syntax proposed. Feast your eyes on the hex-encoded BMP bitmap:

unicode-range: 0x02037FBC4571000003100C000000100010000300BDF74300000000000

@aphillips
Copy link
Contributor

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

I'll remind Martin and Misha that they missed one. 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
css-fonts-5 i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

6 participants