Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css‑fonts‑4] Create keywords for unicode‑range #4573

Open
ExE-Boss opened this issue Dec 8, 2019 · 16 comments
Open

[css‑fonts‑4] Create keywords for unicode‑range #4573

ExE-Boss opened this issue Dec 8, 2019 · 16 comments
Assignees
Labels
Agenda+ i18n Add to agenda for CSS-i18n calls Agenda+ TPAC Closed Accepted by CSSWG Resolution css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Design / Proposal

Comments

@ExE-Boss
Copy link
Contributor

ExE-Boss commented Dec 8, 2019

https://drafts.csswg.org/css-fonts/#unicode-range-desc

Inspired by @Crissov’s comment in #2855 (comment):

emoji world indeed make more sense in a font descriptor than as a font family.

@font-face {
  font-family: Twemoji;
  unicode-range: emoji;
} 

emoji would be a new <urange> keyword equivalent to enumerating all the Unicode codepoints where emoji reside.

@ExE-Boss ExE-Boss changed the title [css] [css‑fonts‑4] Add emoji keyword to unicode‑range Dec 8, 2019
@ExE-Boss ExE-Boss changed the title [css‑fonts‑4] Add emoji keyword to unicode‑range [css‑fonts‑4] Add emoji as a keyword to unicode‑range Dec 8, 2019
@Crissov
Copy link
Contributor

Crissov commented Dec 8, 2019

Alternatively, a list of ISO 15924 script codes could be allowed for the unicode-range font descriptor. The standard provides the special codes Zsye and 993 for emojis. Thatʼs more versatile and less spec maintenance work than keeping a custom list of keywords in CSS Fonts, but it is also less readable.

PS: #1744 was somewhat similar, requesting a language or lang descriptor, but scripts make more sense.

@svgeesus
Copy link
Contributor

svgeesus commented Dec 9, 2019

Adding script names or language names is a recurring request, and we do need to address it at some point. I agree with @Crissov that scripts make more sense than languages.

For example, this is both cumbersome and fragile against future additions:

@font-face {
            	font-family: 'Headings';
            	src: url(fonts/Japanese.woff);
            	unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;
            	/* yen, kanji, hiragana, katakana */
            }

I agree with @Crissov that using an existing list, provided it is well maintained, well documented and readily available, is much better than getting into the business of script or language registries.

It isn't clear to me that ISO 15924:2004 defines which characters are included in each script. I wasn't keen to spend the CHF 68 to find out. Anyone know? Or is that all contained in the registry?

Unicode® Standard Annex #24 Unicode Script Property is online, and freely available, and appears to be a superset of ISO 15924.

I'm happy that the registry is online and that Unicode is the registration authority. That at least means that ISO and Unicode are striving to be in alignment here (with a few exceptions, like Fractur and Gaelige being distinct in ISO 15924 and unified in Unicode UAX 24.

I plan to reach out to the maintainers of the registry to confirm the exact status.

@svgeesus svgeesus self-assigned this Dec 9, 2019
@svgeesus svgeesus added css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Data Needs Design / Proposal labels Dec 9, 2019
@svgeesus
Copy link
Contributor

svgeesus commented Dec 9, 2019

Hmm. From the registry

Hira;410;Hiragana;hiragana;Hiragana;1.1;2004-05-01

That says that Hiragana exists, but not which code points are covered.

@svgeesus
Copy link
Contributor

svgeesus commented Dec 9, 2019

The complete list is in the Scripts file. For example

# ================================================

3041..3096    ; Hiragana # Lo  [86] HIRAGANA LETTER SMALL A..HIRAGANA LETTER SMALL KE
309D..309E    ; Hiragana # Lm   [2] HIRAGANA ITERATION MARK..HIRAGANA VOICED ITERATION MARK
309F          ; Hiragana # Lo       HIRAGANA DIGRAPH YORI
1B001..1B11E  ; Hiragana # Lo [286] HIRAGANA LETTER ARCHAIC YE..HENTAIGANA LETTER N-MU-MO-2
1B150..1B152  ; Hiragana # Lo   [3] HIRAGANA LETTER SMALL WI..HIRAGANA LETTER SMALL WO
1F200         ; Hiragana # So       SQUARE HIRAGANA HOKA

# Total code points: 379

# ================================================

@faceless2
Copy link

Using the Unicode Script property as shorthand for a ranges is a really, really good idea. More intuitive, less error prone, less verbose, and it's a public list that's already baked into CSS implementations - the Unicode Script property is already referenced by css-text-3. Thumbs up to this whole issue.

@jfkthame
Copy link
Contributor

One issue with the Unicode Script property is the characters that have Script=Inherited (generally diacritics) or Script=Common (mostly punctuation)... authors might be surprised at things that don't get included by a naïve Script code because they're actually shared by a couple of scripts and so ended up being assigned Script=Common instead of the "expected" script.

As a trivial example: Script=Devanagari would (perhaps unexpectedly) exclude the punctuation marks DEVANAGARI DANDA and DEVANAGARI DOUBLE DANDA, despite their apparently script-specific names, because Scripts.txt has

0964..0965    ; Common # Po   [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA

So perhaps ranges should also take account of whatever appears in the Unicode ScriptExtensions list, which would handle this:

0964          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DANDA
0965          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Limb Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DOUBLE DANDA

This would be more useful than using just the simple Script property, IMO.

@litherum
Copy link
Contributor

litherum commented Jan 22, 2020

unicode-range: emoji; is probably not what you want, because modern emoji are combining strings that include code points which aren't actually emoji characters (like ZWJ)

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

  • RESOLVED: we are going to create keywords for unicode ranges
The full IRC log of that discussion <stantonm> topic: Add ISO 15924 script codes to unicode-range
<astearns> github: https://github.com//issues/4573
<stantonm> myles: unicode-range takes bunch of code-points
<dbaron> the addition of those two agenda items was https://wiki.csswg.org/planning/galicia-2020?do=diff&rev2%5B0%5D=1569210305&rev2%5B1%5D=1570141384&difftype=sidebyside
<stantonm> ... bad for a couple reasons, lots of numbers and not clear what they mean
<stantonm> ... also when adding some like emoji, you can list all unicode points - but it changes over time
<stantonm> ... proposal to add keyword that lets the browsers define the code points
<stantonm> florian: what are the keywords
<stantonm> myles: issue says use pull keywords from ISO
<stantonm> hober: we shouldn't define these things, reference something in unicode
<stantonm> myles: different languages use some common code points
<stantonm> ... keywords shouldn't be a partition, there will be overlaps
<stantonm> ... space character will be in most of them
<stantonm> fantasai: two factors, script extensions list - some of these are assigned to common script
<stantonm> ... we should be looking up script extensions
<stantonm> ... other case is super common things - numbers, space, etc
<stantonm> ... alot of things assigned to common script
<stantonm> ... probably makes sense to include common by default, but have opt out
<stantonm> myles: we should resolve that we would like keywords, but not resolve on the actual keywords
<stantonm> fantasai: we should rely on iso
<stantonm> faceless2: rely on existing registry
<stantonm> astearns: should we have everything in the registry
<stantonm> heycam: do the names in the registry match normal css conventions?
<stantonm> TabAtkins: looks like no?
<stantonm> fantasai: should be a list of keywords 4 chars long
<faceless2> https://www.unicode.org/Public/12.1.0/ucd/Scripts.txt
<astearns> `Zsye 993: Emoji`
<stantonm> TabAtkins: if we're confident they are 4 letters, we can take directly
<stantonm> fantasai: think that should be fine, they need to maintain compat
<faceless2> example values : "Hebrew", "Devanagari", "Common"
<stantonm> myles: we may get it wrong, can we tentatively resolve to try something out first
<stantonm> florian: go with 4 letter name of long name? or not deciding
<stantonm> faceless2: where did four letter name come from?
<stantonm> florian: long name has hyphens, 4 letter is defined somewhere else
<stantonm> TabAtkins: casing shouldn't be important
<dbaron> The 4 letter script codes are always letters and come from ISO15924: https://tools.ietf.org/html/rfc5646#section-2.2.3
<stantonm> astearns: leave it to the fonts editors to define what keywords we pull, don't need to resolve on that now
<stantonm> myles: I'll also contact unicode
<stantonm> jfkthame: should there also be exclusion values?
<stantonm> hober: if you could exclude a range, you could exclude common range
<stantonm> myles: be careful we don't turn this into a full language
<stantonm> chris: even if you do a good job, when unicode adds new values you may unintentionally exclude things
<stantonm> ... shift burden of defining onto external body
<dbaron> also see https://unicode.org/iso15924/iso15924-codes.html
<stantonm> RESOLVED: we are going to create keywords for unicode ranges
<dbaron> "Zsye" is for Emoji, I think :-/
<dbaron> I think that's a little unfortunate.

@Crissov
Copy link
Contributor

Crissov commented Jan 24, 2020

/cc @markusicu

@markusicu
Copy link

Hi, I got cc'ed here...

As I think you found, ISO 15924 does not define which characters have which script. Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec).

For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties
(Unicode 13 will hoist all of these into the UCD proper.)

Elsewhere in UTS 51 you can also find regexes for well-formed emoji sequences.

ICU has API to get the emoji character properties (per code point, or as a UnicodeSet).

FYI I work on Unicode/CLDR/ICU and am the current 15924 registrar.

@duerst
Copy link

duerst commented Jan 25, 2020

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

  • RESOLVED: we are going to create keywords for unicode ranges

If this is a about ranges, it may make sense to consider blocks instead (or in addition to) scripts. Blocks don't have an ISO standard, they are directly defined by Unicode. There are some overlaps between script name and block names; some regexp engines use e.g. 'hiragana' for the hiragana script, and 'in_hiragana' for the hiragana block. In many cases, there is more than one block for a script. Block data is available in the Blocks file. There are e.g. 8 blocks with the term 'Latin' in their name. There are also cases where characters are not in a block that carries the name of their script. For example, the three blocks
1B000..1B0FF; Kana Supplement
1B100..1B12F; Kana Extended-A
1B130..1B16F; Small Kana Extension
may contain both katakana and hiragana (and other related characters).

@markusicu
Copy link

Unicode blocks are usually not very useful. They are an artifact of the character assignment process and history and are not designed to fit any other purpose. Multiple blocks for one script is one problem (and growing). Blocks also include unassigned code points, and sometimes unrelated characters. That's why the Script and Script_Extensions properties are generally recommended and used.

There are also cases where characters are not in a block that carries the name of their script.

FYI Outside of the ISO script code, "Kana" refers to both Hiragana and Katakana. https://en.wikipedia.org/wiki/Kana

@svgeesus
Copy link
Contributor

From Use CSS to boost the font size of emoji with no extra markup by Terence Eden @edent

Emoji codepoints are complicated - especially when it comes to combining characters. You can see a full list of every sequence in Unicode 15.1. There are currently 3,782 different emoji.

There was some talk of using named ranges but that doesn't seem to have gone anywhere. So, instead, I've extracted all the Emoji codepoints and manually grouped them. It's a pretty long sequence, and I'm sure I've made a few mistakes.

@svgeesus
Copy link
Contributor

So (re-reading) we have some concrete suggestions

  • Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec).
  • For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties
    (Unicode 13 will hoist all of these into the UCD proper.)
  • Include Common by default, but have opt out
  • Scripts are defined in Scripts.txt
  • Script Extensions are defined in ScriptExtensions.txt

@jfkthame
Copy link
Contributor

From Use CSS to boost the font size of emoji with no extra markup by Terence Eden @edent

Emoji codepoints are complicated - especially when it comes to combining characters. You can see a full list of every sequence in Unicode 15.1. There are currently 3,782 different emoji.

There was some talk of using named ranges but that doesn't seem to have gone anywhere. So, instead, I've extracted all the Emoji codepoints and manually grouped them. It's a pretty long sequence, and I'm sure I've made a few mistakes.

Neat, though it's difficult to make it really comprehensive because of dual-use characters that might or might not be part of an emoji, depending on context. E.g. 👨‍❤️‍👨 doesn't currently work in the example because it uses U+2764 HEAVY BLACK HEART to connect the two people emojis, but U+2764 is also a non-emoji dingbat in its own right, so didn't get listed in the emoji unicode-range. Adding it there should make 👨‍❤️‍👨 work, but could also disrupt any non-emoji use of the symbol. Similarly for the U+2620 SKULL AND CROSSBONES used in 🏴‍☠️.

So exactly what "unicode-range: emoji" ought to encompass is tricky -- and maybe not really solvable within a unicode-range model that simply partitions codepoints into "included" vs "excluded".

(@edent fyi, you might like to include U+E0060-E007F in your unicode-range, to make more of the flags work in Chrome.)

@svgeesus svgeesus changed the title [css‑fonts‑4] Add emoji as a keyword to unicode‑range [css‑fonts‑4] Create keywords for unicode‑range May 6, 2024
@svgeesus
Copy link
Contributor

svgeesus commented May 6, 2024

Title changed to reflect the CSS WG resolution and clarify that emoji is just one of the use cases.

This would make a great topic for a TPAC joint session with I18n Core WG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ i18n Add to agenda for CSS-i18n calls Agenda+ TPAC Closed Accepted by CSSWG Resolution css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Design / Proposal
Projects
None yet
Development

No branches or pull requests