Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text-4] Refine definition of ideographs #9503

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frivoal
Copy link
Collaborator

@frivoal frivoal commented Oct 20, 2023

See #9501 and #9471

@frivoal frivoal added css-text-4 i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Review of Proposed Text i18n-jlreq Japanese language enablement i18n-clreq Chinese language enablement i18n-klreq Korean language enablement labels Oct 20, 2023
<dd>Includes all [=typographic character units=] [[CSS-TEXT-3]]
whose base character:
* belongs to Unicode Letters [L*], Mark [M*], Symbols [S*], or Numbers [N*] [=general category=], and
* has the Han, Hiragana, or Katakana [=script property=], and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From these two lists, I think:

  • Adding Kana extensions, Hentaigana, etc. is good.
  • U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed.
  • Should Circled Katakana probably be removed?
  • Squared Katakana words, I'm not sure. Ask at JLTF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed.

In css-text-4, we have this:

script property
Defined in Unicode Standard Annex #24 [UAX24] and given as the Script property in the Unicode Character Database [UAX44]. (UAs must include any ScriptExtensions.txt assignments in this mapping.)

Since U+30FC has the ScriptExtensions property set to Hiragana and Katakana, it is not be removed.

But I think that our defintion of "script property" being not just the Unicode script property but also the Script Extensions property is too easy to miss. Maybe, editorially, we should rename it to "script properties" or "script related properties".

Should Circled Katakana probably be removed?

No strong opinion, from me. Why do you think they should be removed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kojiishi that the character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed. It is cl-10 in JLReq and its behavior for spacing is the same as Hiragana, Katakana and Kanji.

I believe circled Katakana should be treated the same as other circled letters, however it is currently a part of Kanji cl-19. We'll discuss this at JLReq TF.

Squared Katakana words is postfixed abbreviations cl-13 in JLReq. It is a part of symbols in the new simplified proposal. We'll revisit this in JLReq TF also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be removed

It is not being removed, so we're in agreement.

Copy link
Contributor

@kojiishi kojiishi Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since U+30FC has the ScriptExtensions property set to Hiragana and Katakana, it is not be removed.

The ScriptExtensions property is a different property from the Script property. If you want to use that, it's better to be explicit. UAX#24 3.2 says it has a list only when the list is "a small, enumerable list," so it's a "be careful when I have to rely on it" type of the property.

From implementers' point of view, it's a lot more expensive performance-wise. It's (probably) the only property that return a list. Great if we can avoid that when possible.

For this specific case, if the Unicode Blocks property works, that's more helpful. Is that ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kojiishi That's what we did originally, but when Unicode updated they added letters outside the original blocks. So from the spec point of view, it's better to use the script properties. But from an implementation point of view, you probably want to implement it as unicode blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @xfq reported somewhere, the proposal to the Unicode is in progress. I'd like to see how it goes.

I agree that the maintenance when Unicode updates is a concern for us. I think we're in consensus to try to avoid/minimize it, only slightly different opinions on how.

Allow me to confirm, here we're talking about the script extension property, not the script property, correct? I'm assuming you meant so when you wrote "the script properties" but please let me know if I read you wrong.

First, let me correct my previous comment, it wasn't correct. It was based on the past feedback from Myles and Jonathan. But after talking to UTC experts, they preferred script extension over blocks. This is part because while the script extension property isn't easy for browsers to use in a performant way, it's easier for ICU. If we were going to implement this in ICU, it's possible for ICU to support CSS-defined derived properties, as they did for the line break iterators, but Unicode is more natural place.

Another point. CSS defining simple derived properties, combination of only one or two properties, is fine, but when it goes beyond certain complexity, the risk of overfitting increases, and thus the possibility of maintenance to adopt new Unicode versions will increase, so we may not be achieving our original goal. IIRC this was the primary reason for UAX#50, feedback from both CSS (John) and Unicode (Eric).

Seeing @frivoal's improved proposal and other feedback, I think this properly is likely to exceed the limit. If UTC can accept the proposal, I think it's the best way to go.

whose base character:
* belongs to Unicode Letters [L*], Mark [M*], Symbols [S*], or Numbers [N*] [=general category=], and
* has the Han, Hiragana, or Katakana [=script property=], and
* is not categorized as East Asian Halfwidth (H) by [[!UAX11]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think we also need to exclude EAW=H from "non-ideographic letters/numerics" too, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, yes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the proposed definition with "spacing property" which is under development for jlreq-d. Differences are shown below:

Characters that are in css 'ideograph' but excluded from 'J' class in spacing_property

  1. ㋐ CIRCLED KATAKANA (sc=Katakana, gc=So)
  2. ㌁ SQUARE KATAKANA and HIRAGANA (sc=Katakana/Hiragana, gc=So)
  3. Han radicals (sc=Hani, gc=So)
  4. U+16FF0 VIETNAMESE ALTERNATE READING MARK CA and U+16FF1 VIETNAMESE ALTERNATE READING MARK NHAY (gc=Mc; sc=Hani; ea=W)
  5. U+16FE3 OLD CHINESE ITERATION MARK (gc=Lm; sc=Hani; ea=W)

Characters that are excluded from css 'ideograph' but in 'J' class in spacing_property
6. 〆、〼 (gc=Lo; sc=Common; ea=W)
7. U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK, U-3031-3034 VERTICAL KANA REPEAT MARKs (gc=Lm; sc=Common; ea=W)
8. Idegrahpic characters other than Han Ideograph (Ideographic=Yes, sc=Kits/Nshu/Tang)

Here are comments on each of these based on the discussions done when we developed spacing_property.

  1. We believe circled characters should be removed from CSS 'Ideograph'. They were a part of cl-19 (Kanji) in JLReq, but it was agreed to treat all circled, double circled, negative circled and square letters as symbols that do not generate aki space with any other characters (class 'O' in the spacing_property). The reasoning is that they are enclosed symbols more than they are numbers, katanaka, kanji or alphabet (e.g, ①Ⓐ㉄㉮㊀).
  2. We believe squared Katakana / Hiragana should be removed from CSS 'Ideograph' for the same reason as 1.
  3. We believe radicals should stay in 'Ideograph', i.e as-is. We want to update spacing_property to match it.
  4. We do not have enough knowledge to tell if it should be or should not be a part of 'Ideograph'. Since they are used in combination with base Kanji, they do not need to be included perhaps?
  5. We believe this character should stay in 'Ideograph', i.e as-is. We want to update spacing_property to match it.
  6. Probably 〆〼 should be added to 'Ideograph'. We do not have a strong opinion on them.
  7. As agreed, the PROLONGED SOUND MARK should be added. The same goes to U-3031-3034 VERTICAL KANA REPEAT MARKs.
  8. Probably all ideographic characters should be added to 'Ideograph' because they all originated from Han ideograph. At the same time we do not have a strong opinion on them.

I might provide updates as I just asked JLReq TF if they have further comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding excluding East Asian Halfwidth (H), I believe we can just say "and is categorised as East Asian Wide (W)" because all characters in this category is "ea=W". What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to update class "J" of the spacing_property with above recommendations incorporated, it would look like:
[[[[:sc=Hiragana:][:sc=Katakana:][:sc=Common:]][:ideographic:]]&[:gc=L:]&[:ea=W:]]
[[[:sc=Hani:]]&[[:gc=L:][:gc=Nl:][:gc=So:]]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidayasuo Your comments look great, but it also looks like the ideal list is more complicated than I would like to add to a CSS spec. How about doing this in 2 steps:

  1. We go with some simple class for now, so that browsers can ship earlier.
  2. Someone voluteers to add the property to Unicode, so that the future spec can switch to it.

For the purpose of 1, I'm fine with the current definition, or add/modify some more, but I wish to keep it simple.

Another possibility for 1: @nt1m @vitorroriz Is it possible for you to provide the list used by iOS/macOS? If it's simple enough, it might be a good candidate for the purpopse of 1 above.

@kidayasuo Are you or someone at JLTF willing to volunteer talking to Unicode?

Regarding excluding East Asian Halfwidth (H), I believe we can just say "and is categorised as East Asian Wide (W)" because all characters in this category is "ea=W". What do you think?

Whether "ea=W" includes "ea=F" or not is a bit ambiguous. The UAX#11 reads to me it does, but ICU doesn't. I think it's safer to be explicit.

/cc @Clqsin45

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We go with some simple class for now, so that browsers can ship earlier.

How simpler should it be?

For most practical uses I think the proposed changes + excluding ones below would be sufficient.

㋐ CIRCLED KATAKANA (sc=Katakana, gc=So)
㌁ SQUARE KATAKANA and HIRAGANA (sc=Katakana/Hiragana, gc=So)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kojiishi kojiishi Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidayasuo and I will work on a proposal to Unicode together. This will be 2 steps; one to reach a consensus to define a new property, then on the data. Hopefully the first step shouldn't be too hard.

@fantasai fantasai marked this pull request as draft November 20, 2023 21:28
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
kojiishi added a commit to kojiishi/unicode-auto-spacing that referenced this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
css-text-4 i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-klreq Korean language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Needs Review of Proposed Text
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants