[css‑fonts‑4] Create keywords for `unicode‑range` #4573

ExE-Boss · 2019-12-08T10:47:10Z

https://drafts.csswg.org/css-fonts/#unicode-range-desc

Inspired by @Crissov’s comment in #2855 (comment):

emoji world indeed make more sense in a font descriptor than as a font family.
@font-face {
  font-family: Twemoji;
  unicode-range: emoji;
} 

emoji would be a new <urange> keyword equivalent to enumerating all the Unicode codepoints where emoji reside.

The text was updated successfully, but these errors were encountered:

Crissov · 2019-12-08T11:41:05Z

Alternatively, a list of ISO 15924 script codes could be allowed for the unicode-range font descriptor. The standard provides the special codes Zsye and 993 for emojis. Thatʼs more versatile and less spec maintenance work than keeping a custom list of keywords in CSS Fonts, but it is also less readable.

PS: #1744 was somewhat similar, requesting a language or lang descriptor, but scripts make more sense.

svgeesus · 2019-12-09T18:01:46Z

Adding script names or language names is a recurring request, and we do need to address it at some point. I agree with @Crissov that scripts make more sense than languages.

For example, this is both cumbersome and fragile against future additions:

@font-face {
            	font-family: 'Headings';
            	src: url(fonts/Japanese.woff);
            	unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;
            	/* yen, kanji, hiragana, katakana */
            }

I agree with @Crissov that using an existing list, provided it is well maintained, well documented and readily available, is much better than getting into the business of script or language registries.

It isn't clear to me that ISO 15924:2004 defines which characters are included in each script. I wasn't keen to spend the CHF 68 to find out. Anyone know? Or is that all contained in the registry?

Unicode® Standard Annex #24 Unicode Script Property is online, and freely available, and appears to be a superset of ISO 15924.

I'm happy that the registry is online and that Unicode is the registration authority. That at least means that ISO and Unicode are striving to be in alignment here (with a few exceptions, like Fractur and Gaelige being distinct in ISO 15924 and unified in Unicode UAX 24.

I plan to reach out to the maintainers of the registry to confirm the exact status.

svgeesus · 2019-12-09T18:07:54Z

Hmm. From the registry

Hira;410;Hiragana;hiragana;Hiragana;1.1;2004-05-01

That says that Hiragana exists, but not which code points are covered.

svgeesus · 2019-12-09T18:20:05Z

The complete list is in the Scripts file. For example

# ================================================

3041..3096    ; Hiragana # Lo  [86] HIRAGANA LETTER SMALL A..HIRAGANA LETTER SMALL KE
309D..309E    ; Hiragana # Lm   [2] HIRAGANA ITERATION MARK..HIRAGANA VOICED ITERATION MARK
309F          ; Hiragana # Lo       HIRAGANA DIGRAPH YORI
1B001..1B11E  ; Hiragana # Lo [286] HIRAGANA LETTER ARCHAIC YE..HENTAIGANA LETTER N-MU-MO-2
1B150..1B152  ; Hiragana # Lo   [3] HIRAGANA LETTER SMALL WI..HIRAGANA LETTER SMALL WO
1F200         ; Hiragana # So       SQUARE HIRAGANA HOKA

# Total code points: 379

# ================================================

faceless2 · 2019-12-09T19:00:06Z

Using the Unicode Script property as shorthand for a ranges is a really, really good idea. More intuitive, less error prone, less verbose, and it's a public list that's already baked into CSS implementations - the Unicode Script property is already referenced by css-text-3. Thumbs up to this whole issue.

jfkthame · 2020-01-22T14:04:57Z

One issue with the Unicode Script property is the characters that have Script=Inherited (generally diacritics) or Script=Common (mostly punctuation)... authors might be surprised at things that don't get included by a naïve Script code because they're actually shared by a couple of scripts and so ended up being assigned Script=Common instead of the "expected" script.

As a trivial example: Script=Devanagari would (perhaps unexpectedly) exclude the punctuation marks DEVANAGARI DANDA and DEVANAGARI DOUBLE DANDA, despite their apparently script-specific names, because Scripts.txt has

0964..0965    ; Common # Po   [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA

So perhaps ranges should also take account of whatever appears in the Unicode ScriptExtensions list, which would handle this:

0964          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DANDA
0965          ; Beng Deva Dogr Gong Gonm Gran Gujr Guru Knda Limb Mahj Mlym Nand Orya Sind Sinh Sylo Takr Taml Telu Tirh # Po       DEVANAGARI DOUBLE DANDA

This would be more useful than using just the simple Script property, IMO.

litherum · 2020-01-22T15:59:55Z

unicode-range: emoji; is probably not what you want, because modern emoji are combining strings that include code points which aren't actually emoji characters (like ZWJ)

css-meeting-bot · 2020-01-23T15:20:32Z

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

RESOLVED: we are going to create keywords for unicode ranges

The full IRC log of that discussion

<stantonm> topic: Add ISO 15924 script codes to unicode-range
<astearns> github: https://github.com//issues/4573
<stantonm> myles: unicode-range takes bunch of code-points
<dbaron> the addition of those two agenda items was https://wiki.csswg.org/planning/galicia-2020?do=diff&rev2%5B0%5D=1569210305&rev2%5B1%5D=1570141384&difftype=sidebyside
<stantonm> ... bad for a couple reasons, lots of numbers and not clear what they mean
<stantonm> ... also when adding some like emoji, you can list all unicode points - but it changes over time
<stantonm> ... proposal to add keyword that lets the browsers define the code points
<stantonm> florian: what are the keywords
<stantonm> myles: issue says use pull keywords from ISO
<stantonm> hober: we shouldn't define these things, reference something in unicode
<stantonm> myles: different languages use some common code points
<stantonm> ... keywords shouldn't be a partition, there will be overlaps
<stantonm> ... space character will be in most of them
<stantonm> fantasai: two factors, script extensions list - some of these are assigned to common script
<stantonm> ... we should be looking up script extensions
<stantonm> ... other case is super common things - numbers, space, etc
<stantonm> ... alot of things assigned to common script
<stantonm> ... probably makes sense to include common by default, but have opt out
<stantonm> myles: we should resolve that we would like keywords, but not resolve on the actual keywords
<stantonm> fantasai: we should rely on iso
<stantonm> faceless2: rely on existing registry
<stantonm> astearns: should we have everything in the registry
<stantonm> heycam: do the names in the registry match normal css conventions?
<stantonm> TabAtkins: looks like no?
<stantonm> fantasai: should be a list of keywords 4 chars long
<faceless2> https://www.unicode.org/Public/12.1.0/ucd/Scripts.txt
<astearns> `Zsye 993: Emoji`
<stantonm> TabAtkins: if we're confident they are 4 letters, we can take directly
<stantonm> fantasai: think that should be fine, they need to maintain compat
<faceless2> example values : "Hebrew", "Devanagari", "Common"
<stantonm> myles: we may get it wrong, can we tentatively resolve to try something out first
<stantonm> florian: go with 4 letter name of long name? or not deciding
<stantonm> faceless2: where did four letter name come from?
<stantonm> florian: long name has hyphens, 4 letter is defined somewhere else
<stantonm> TabAtkins: casing shouldn't be important
<dbaron> The 4 letter script codes are always letters and come from ISO15924: https://tools.ietf.org/html/rfc5646#section-2.2.3
<stantonm> astearns: leave it to the fonts editors to define what keywords we pull, don't need to resolve on that now
<stantonm> myles: I'll also contact unicode
<stantonm> jfkthame: should there also be exclusion values?
<stantonm> hober: if you could exclude a range, you could exclude common range
<stantonm> myles: be careful we don't turn this into a full language
<stantonm> chris: even if you do a good job, when unicode adds new values you may unintentionally exclude things
<stantonm> ... shift burden of defining onto external body
<dbaron> also see https://unicode.org/iso15924/iso15924-codes.html
<stantonm> RESOLVED: we are going to create keywords for unicode ranges
<dbaron> "Zsye" is for Emoji, I think :-/
<dbaron> I think that's a little unfortunate.

Crissov · 2020-01-24T10:52:41Z

/cc @markusicu

markusicu · 2020-01-24T17:49:13Z

Hi, I got cc'ed here...

As I think you found, ISO 15924 does not define which characters have which script. Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec).

For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties
(Unicode 13 will hoist all of these into the UCD proper.)

Elsewhere in UTS 51 you can also find regexes for well-formed emoji sequences.

ICU has API to get the emoji character properties (per code point, or as a UnicodeSet).

FYI I work on Unicode/CLDR/ICU and am the current 15924 registrar.

duerst · 2020-01-25T09:24:37Z

The CSS Working Group just discussed Add ISO 15924 script codes to unicode-range, and agreed to the following:

RESOLVED: we are going to create keywords for unicode ranges

If this is a about ranges, it may make sense to consider blocks instead (or in addition to) scripts. Blocks don't have an ISO standard, they are directly defined by Unicode. There are some overlaps between script name and block names; some regexp engines use e.g. 'hiragana' for the hiragana script, and 'in_hiragana' for the hiragana block. In many cases, there is more than one block for a script. Block data is available in the Blocks file. There are e.g. 8 blocks with the term 'Latin' in their name. There are also cases where characters are not in a block that carries the name of their script. For example, the three blocks
1B000..1B0FF; Kana Supplement
1B100..1B12F; Kana Extended-A
1B130..1B16F; Small Kana Extension
may contain both katakana and hiragana (and other related characters).

markusicu · 2020-01-27T06:25:25Z

Unicode blocks are usually not very useful. They are an artifact of the character assignment process and history and are not designed to fit any other purpose. Multiple blocks for one script is one problem (and growing). Blocks also include unassigned code points, and sometimes unrelated characters. That's why the Script and Script_Extensions properties are generally recommended and used.

There are also cases where characters are not in a block that carries the name of their script.

FYI Outside of the ISO script code, "Kana" refers to both Hiragana and Katakana. https://en.wikipedia.org/wiki/Kana

This reverts commit 1853cad.

svgeesus · 2024-04-15T17:59:55Z

From Use CSS to boost the font size of emoji with no extra markup by Terence Eden @edent

Emoji codepoints are complicated - especially when it comes to combining characters. You can see a full list of every sequence in Unicode 15.1. There are currently 3,782 different emoji.

There was some talk of using named ranges but that doesn't seem to have gone anywhere. So, instead, I've extracted all the Emoji codepoints and manually grouped them. It's a pretty long sequence, and I'm sure I've made a few mistakes.

svgeesus · 2024-04-15T19:02:01Z

So (re-reading) we have some concrete suggestions

Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec).
For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties
(Unicode 13 will hoist all of these into the UCD proper.)
Include Common by default, but have opt out
Scripts are defined in Scripts.txt
Script Extensions are defined in ScriptExtensions.txt

jfkthame · 2024-04-15T19:12:00Z

From Use CSS to boost the font size of emoji with no extra markup by Terence Eden @edent

Emoji codepoints are complicated - especially when it comes to combining characters. You can see a full list of every sequence in Unicode 15.1. There are currently 3,782 different emoji.

There was some talk of using named ranges but that doesn't seem to have gone anywhere. So, instead, I've extracted all the Emoji codepoints and manually grouped them. It's a pretty long sequence, and I'm sure I've made a few mistakes.

Neat, though it's difficult to make it really comprehensive because of dual-use characters that might or might not be part of an emoji, depending on context. E.g. 👨‍❤️‍👨 doesn't currently work in the example because it uses U+2764 HEAVY BLACK HEART to connect the two people emojis, but U+2764 is also a non-emoji dingbat in its own right, so didn't get listed in the emoji unicode-range. Adding it there should make 👨‍❤️‍👨 work, but could also disrupt any non-emoji use of the symbol. Similarly for the U+2620 SKULL AND CROSSBONES used in 🏴‍☠️.

So exactly what "unicode-range: emoji" ought to encompass is tricky -- and maybe not really solvable within a unicode-range model that simply partitions codepoints into "included" vs "excluded".

(@edent fyi, you might like to include U+E0060-E007F in your unicode-range, to make more of the flags work in Chrome.)

svgeesus · 2024-05-06T19:26:12Z

Title changed to reflect the CSS WG resolution and clarify that emoji is just one of the use cases.

This would make a great topic for a TPAC joint session with I18n Core WG

ExE-Boss changed the title ~~[css]~~ [css‑fonts‑4] Add emoji keyword to unicode‑range Dec 8, 2019

ExE-Boss mentioned this issue Dec 8, 2019

[css-fonts-4] Should emoji be restricted in its range? #2855

Closed

ExE-Boss changed the title ~~[css‑fonts‑4] Add emoji keyword to unicode‑range~~ [css‑fonts‑4] Add emoji as a keyword to unicode‑range Dec 8, 2019

svgeesus self-assigned this Dec 9, 2019

svgeesus added css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Data Needs Design / Proposal labels Dec 9, 2019

xfq mentioned this issue Dec 10, 2019

[css‑fonts‑4] Add emoji as a keyword to unicode‑range w3c/i18n-activity#829

Open

svgeesus added the Agenda+ F2F label Jan 22, 2020

mozilla-apprentice mentioned this issue Jan 23, 2020

[css‑fonts‑4] Add emoji as a keyword to unicode‑range mozilla/wg-decisions#181

Closed

astearns removed the Agenda+ F2F label Jan 23, 2020

mozilla-apprentice mentioned this issue Jan 25, 2020

[css‑fonts‑4] Add emoji as a keyword to unicode‑range mozilla/wg-decisions#204

Open

JRaspass mentioned this issue Aug 17, 2020

unicode-range could be minified tdewolff/minify#321

Closed

LeaVerou added Closed Accepted by CSSWG Resolution Needs Edits and removed Needs Data Needs Design / Proposal labels Oct 19, 2022

LeaVerou mentioned this issue Oct 19, 2022

[css-fonts-5] Make unicode-range syntax suck less #7921

Open

svgeesus added a commit that referenced this issue Oct 26, 2022

[css-fonts-4] Add emoji as a keyword to unicode‑range #4573

1853cad

svgeesus added a commit that referenced this issue Oct 26, 2022

Revert "[css-fonts-4] Add emoji as a keyword to unicode‑range #4573"

b23e298

This reverts commit 1853cad.

svgeesus removed the Needs Edits label Sep 19, 2023

svgeesus added the Needs Design / Proposal label Apr 15, 2024

svgeesus changed the title ~~[css‑fonts‑4] Add emoji as a keyword to unicode‑range~~ [css‑fonts‑4] Create keywords for unicode‑range May 6, 2024

svgeesus added Agenda+ TPAC Agenda+ i18n Add to agenda for CSS-i18n calls labels May 6, 2024

tsm-odoo mentioned this issue Jun 19, 2024

[IMP] mail: message body emoji are bigger odoo/odoo#169699

Closed

brianjlacy mentioned this issue Aug 1, 2024

[css-fonts-4] Suggestion: Support Unicode Character Sequences in unicode-range #10651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css‑fonts‑4] Create keywords for `unicode‑range` #4573

[css‑fonts‑4] Create keywords for `unicode‑range` #4573

ExE-Boss commented Dec 8, 2019 •

edited

Loading

Crissov commented Dec 8, 2019 •

edited

Loading

svgeesus commented Dec 9, 2019 •

edited

Loading

svgeesus commented Dec 9, 2019

svgeesus commented Dec 9, 2019

faceless2 commented Dec 9, 2019

jfkthame commented Jan 22, 2020

litherum commented Jan 22, 2020 •

edited

Loading

css-meeting-bot commented Jan 23, 2020

Crissov commented Jan 24, 2020

markusicu commented Jan 24, 2020

duerst commented Jan 25, 2020

markusicu commented Jan 27, 2020

svgeesus commented Apr 15, 2024

svgeesus commented Apr 15, 2024

jfkthame commented Apr 15, 2024

svgeesus commented May 6, 2024

[css‑fonts‑4] Create keywords for unicode‑range #4573

[css‑fonts‑4] Create keywords for unicode‑range #4573

Comments

ExE-Boss commented Dec 8, 2019 • edited Loading

Crissov commented Dec 8, 2019 • edited Loading

svgeesus commented Dec 9, 2019 • edited Loading

svgeesus commented Dec 9, 2019

svgeesus commented Dec 9, 2019

faceless2 commented Dec 9, 2019

jfkthame commented Jan 22, 2020

litherum commented Jan 22, 2020 • edited Loading

css-meeting-bot commented Jan 23, 2020

Crissov commented Jan 24, 2020

markusicu commented Jan 24, 2020

duerst commented Jan 25, 2020

markusicu commented Jan 27, 2020

svgeesus commented Apr 15, 2024

svgeesus commented Apr 15, 2024

jfkthame commented Apr 15, 2024

svgeesus commented May 6, 2024

[css‑fonts‑4] Create keywords for `unicode‑range` #4573

[css‑fonts‑4] Create keywords for `unicode‑range` #4573

ExE-Boss commented Dec 8, 2019 •

edited

Loading

Crissov commented Dec 8, 2019 •

edited

Loading

svgeesus commented Dec 9, 2019 •

edited

Loading

litherum commented Jan 22, 2020 •

edited

Loading