[css-text-4] Refine definition of ideographs #9503

frivoal · 2023-10-20T03:19:25Z

kojiishi · 2023-10-23T04:36:04Z

css-text-4/Overview.bs

+		<dd>Includes all [=typographic character units=] [[CSS-TEXT-3]]
+		whose base character:
+		* belongs to Unicode Letters [L*], Mark [M*], Symbols [S*], or Numbers [N*] [=general category=], and
+		* has the Han, Hiragana, or Katakana [=script property=], and


Characters removed by this definition

Characters added by this definition

From these two lists, I think:

Adding Kana extensions, Hentaigana, etc. is good.

U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed.

Should Circled Katakana probably be removed?

Squared Katakana words, I'm not sure. Ask at JLTF?

U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed.

In css-text-4, we have this:

script property
Defined in Unicode Standard Annex #24 [UAX24] and given as the Script property in the Unicode Character Database [UAX44]. (UAs must include any ScriptExtensions.txt assignments in this mapping.)

Since U+30FC has the ScriptExtensions property set to Hiragana and Katakana, it is not be removed.

But I think that our defintion of "script property" being not just the Unicode script property but also the Script Extensions property is too easy to miss. Maybe, editorially, we should rename it to "script properties" or "script related properties".

Should Circled Katakana probably be removed?

No strong opinion, from me. Why do you think they should be removed?

I agree with @kojiishi that the character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed. It is cl-10 in JLReq and its behavior for spacing is the same as Hiragana, Katakana and Kanji.

I believe circled Katakana should be treated the same as other circled letters, however it is currently a part of Kanji cl-19. We'll discuss this at JLReq TF.

Squared Katakana words is postfixed abbreviations cl-13 in JLReq. It is a part of symbols in the new simplified proposal. We'll revisit this in JLReq TF also.

the character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed

It is not being removed, so we're in agreement.

Since U+30FC has the ScriptExtensions property set to Hiragana and Katakana, it is not be removed.

The ScriptExtensions property is a different property from the Script property. If you want to use that, it's better to be explicit. UAX#24 3.2 says it has a list only when the list is "a small, enumerable list," so it's a "be careful when I have to rely on it" type of the property.

From implementers' point of view, it's a lot more expensive performance-wise. It's (probably) the only property that return a list. Great if we can avoid that when possible.

For this specific case, if the Unicode Blocks property works, that's more helpful. Is that ok?

@kojiishi That's what we did originally, but when Unicode updated they added letters outside the original blocks. So from the spec point of view, it's better to use the script properties. But from an implementation point of view, you probably want to implement it as unicode blocks.

As @xfq reported somewhere, the proposal to the Unicode is in progress. I'd like to see how it goes.

I agree that the maintenance when Unicode updates is a concern for us. I think we're in consensus to try to avoid/minimize it, only slightly different opinions on how.

Allow me to confirm, here we're talking about the script extension property, not the script property, correct? I'm assuming you meant so when you wrote "the script properties" but please let me know if I read you wrong.

First, let me correct my previous comment, it wasn't correct. It was based on the past feedback from Myles and Jonathan. But after talking to UTC experts, they preferred script extension over blocks. This is part because while the script extension property isn't easy for browsers to use in a performant way, it's easier for ICU. If we were going to implement this in ICU, it's possible for ICU to support CSS-defined derived properties, as they did for the line break iterators, but Unicode is more natural place.

Another point. CSS defining simple derived properties, combination of only one or two properties, is fine, but when it goes beyond certain complexity, the risk of overfitting increases, and thus the possibility of maintenance to adopt new Unicode versions will increase, so we may not be achieving our original goal. IIRC this was the primary reason for UAX#50, feedback from both CSS (John) and Unicode (Eric).

Seeing @frivoal's improved proposal and other feedback, I think this properly is likely to exceed the limit. If UTC can accept the proposal, I think it's the best way to go.

kojiishi · 2023-10-23T04:37:31Z

css-text-4/Overview.bs

+		whose base character:
+		* belongs to Unicode Letters [L*], Mark [M*], Symbols [S*], or Numbers [N*] [=general category=], and
+		* has the Han, Hiragana, or Katakana [=script property=], and
+		* is not categorized as East Asian Halfwidth (H) by [[!UAX11]]


This looks good. I think we also need to exclude EAW=H from "non-ideographic letters/numerics" too, right?

seems reasonable, yes.

I compared the proposed definition with "spacing property" which is under development for jlreq-d. Differences are shown below:

Characters that are in css 'ideograph' but excluded from 'J' class in spacing_property

㋐ CIRCLED KATAKANA (sc=Katakana, gc=So)

㌁ SQUARE KATAKANA and HIRAGANA (sc=Katakana/Hiragana, gc=So)

Han radicals (sc=Hani, gc=So)

U+16FF0 VIETNAMESE ALTERNATE READING MARK CA and U+16FF1 VIETNAMESE ALTERNATE READING MARK NHAY (gc=Mc; sc=Hani; ea=W)

U+16FE3 OLD CHINESE ITERATION MARK (gc=Lm; sc=Hani; ea=W)

Characters that are excluded from css 'ideograph' but in 'J' class in spacing_property
6. 〆、〼 (gc=Lo; sc=Common; ea=W)
7. U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK, U-3031-3034 VERTICAL KANA REPEAT MARKs (gc=Lm; sc=Common; ea=W)
8. Idegrahpic characters other than Han Ideograph (Ideographic=Yes, sc=Kits/Nshu/Tang)

Here are comments on each of these based on the discussions done when we developed spacing_property.

We believe circled characters should be removed from CSS 'Ideograph'. They were a part of cl-19 (Kanji) in JLReq, but it was agreed to treat all circled, double circled, negative circled and square letters as symbols that do not generate aki space with any other characters (class 'O' in the spacing_property). The reasoning is that they are enclosed symbols more than they are numbers, katanaka, kanji or alphabet (e.g, ①Ⓐ㉄㉮㊀).

We believe squared Katakana / Hiragana should be removed from CSS 'Ideograph' for the same reason as 1.

We believe radicals should stay in 'Ideograph', i.e as-is. We want to update spacing_property to match it.

We do not have enough knowledge to tell if it should be or should not be a part of 'Ideograph'. Since they are used in combination with base Kanji, they do not need to be included perhaps?

We believe this character should stay in 'Ideograph', i.e as-is. We want to update spacing_property to match it.

Probably 〆〼 should be added to 'Ideograph'. We do not have a strong opinion on them.

As agreed, the PROLONGED SOUND MARK should be added. The same goes to U-3031-3034 VERTICAL KANA REPEAT MARKs.

Probably all ideographic characters should be added to 'Ideograph' because they all originated from Han ideograph. At the same time we do not have a strong opinion on them.

I might provide updates as I just asked JLReq TF if they have further comments.

Regarding excluding East Asian Halfwidth (H), I believe we can just say "and is categorised as East Asian Wide (W)" because all characters in this category is "ea=W". What do you think?

If we are to update class "J" of the spacing_property with above recommendations incorporated, it would look like:
[[[[:sc=Hiragana:][:sc=Katakana:][:sc=Common:]][:ideographic:]]&[:gc=L:]&[:ea=W:]]
[[[:sc=Hani:]]&[[:gc=L:][:gc=Nl:][:gc=So:]]]

@kidayasuo Your comments look great, but it also looks like the ideal list is more complicated than I would like to add to a CSS spec. How about doing this in 2 steps:

We go with some simple class for now, so that browsers can ship earlier.

Someone voluteers to add the property to Unicode, so that the future spec can switch to it.

For the purpose of 1, I'm fine with the current definition, or add/modify some more, but I wish to keep it simple.

Another possibility for 1: @nt1m @vitorroriz Is it possible for you to provide the list used by iOS/macOS? If it's simple enough, it might be a good candidate for the purpopse of 1 above.

@kidayasuo Are you or someone at JLTF willing to volunteer talking to Unicode?

Regarding excluding East Asian Halfwidth (H), I believe we can just say "and is categorised as East Asian Wide (W)" because all characters in this category is "ea=W". What do you think?

Whether "ea=W" includes "ea=F" or not is a bit ambiguous. The UAX#11 reads to me it does, but ICU doesn't. I think it's safer to be explicit.

/cc @Clqsin45

We go with some simple class for now, so that browsers can ship earlier.

How simpler should it be?

For most practical uses I think the proposed changes + excluding ones below would be sufficient.

㋐ CIRCLED KATAKANA (sc=Katakana, gc=So)
㌁ SQUARE KATAKANA and HIRAGANA (sc=Katakana/Hiragana, gc=So)

cc @fantasai

@kidayasuo and I will work on a proposal to Unicode together. This will be 2 steps; one to reach a consensus to define a new property, then on the data. Hopefully the first step shouldn't be too hard.

The proposal at: w3c/csswg-drafts#9503 (comment)

[css-text-4] Refine definition of ideographs

c4d8d2b

See w3c#9501 and w3c#9471

This was referenced Oct 20, 2023

[css-text] The definition of ideographs includes punctuation marks #9501

Open

[css-text][text-autospace] Is halfwidth Kana "non-ideographic letters"? #9471

Open

w3cbot mentioned this pull request Oct 20, 2023

[css-text-4] Refine definition of ideographs w3c/i18n-activity#1784

Open

frivoal mentioned this pull request Oct 23, 2023

CSS text-spacing property and its longhands w3ctag/design-reviews#907

Closed

1 task

kojiishi reviewed Oct 23, 2023

View reviewed changes

kojiishi mentioned this pull request Nov 9, 2023

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

Open

kidayasuo mentioned this pull request Nov 19, 2023

JLReq TF Meeting Notes - 2023-10-31 w3c/jlreq#382

Open

fantasai marked this pull request as draft November 20, 2023 21:28

This was referenced Feb 20, 2024

[css-text][text-spacing] Extra spacing between Hangul and Hanja #9979

Open

[css-text][text-spacing] Extra spacing between Bopomofo and Chinese characters #9980

Open

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

532f6e7

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi mentioned this pull request Feb 22, 2024

Apply @kidayasuo's proposal kojiishi/unicode-auto-spacing#2

Open

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

fe0f781

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

7556762

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

35f2854

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

c0f38c8

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

157352f

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

b82e9da

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

e6835c2

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

19d2912

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

f14aec8

The proposal at: w3c/csswg-drafts#9503 (comment)

kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024

Apply @kidayasuo's proposal

bc34678

The proposal at: w3c/csswg-drafts#9503 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text-4] Refine definition of ideographs #9503

[css-text-4] Refine definition of ideographs #9503

frivoal commented Oct 20, 2023

kojiishi Oct 23, 2023

frivoal Oct 23, 2023

kidayasuo Oct 26, 2023

frivoal Oct 26, 2023

kojiishi Oct 27, 2023 •

edited

fantasai Nov 13, 2023

kojiishi Feb 24, 2024

kojiishi Oct 23, 2023

frivoal Oct 23, 2023

kidayasuo Oct 27, 2023

kidayasuo Oct 27, 2023

kidayasuo Oct 27, 2023

kojiishi Nov 8, 2023

kidayasuo Nov 10, 2023

nt1m Nov 10, 2023

kojiishi Nov 11, 2023 •

edited

[css-text-4] Refine definition of ideographs #9503

Are you sure you want to change the base?

[css-text-4] Refine definition of ideographs #9503

Conversation

frivoal commented Oct 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kojiishi Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kojiishi Nov 11, 2023 • edited

Choose a reason for hiding this comment

kojiishi Oct 27, 2023 •

edited

kojiishi Nov 11, 2023 •

edited