-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255
Comments
@aethanyc or @makotokato can you take this issue? Probably for 1.x Priority. |
Discussion: Longer term, we would like it if the upstreamed TOML files would be updated along with the specification, so that ICU4X does not need to do anything more than pulling in updates from upstream. |
Looking at the toml files, my impression is that they define a state machine transitioned by code point (that is, a The following new line breaking rules require more lookahead than that:
These require looking at two code points to the right of the (non-)break, plus any intervening CM (since these are after LB9). |
Henri, this is interesting. In your comment you correctly identified what LB15a and LB15b are trying to do, and why they need to do that (instead of treating Pi as LB=OP and Pf as LB=CL: that would mess with German, Finnish, etc. usage of Pf initially or Pi finally). However, these new rules do not help with the Chinese issue at hand, since there are no spaces (there may visually appear to be space, but that is because U+2018 etc. have ambiguous width; here they are wide). This has recently come to the attention of the Properties and Algorithms Group of the UTC; it may be possible to do something about it in the ID QU ID case. |
We still need to update line segmenter to Unicode 15.1. @makotokato is working on it. |
I am experimenting with moving LB8a and LB9 into the code of the line segmenter, as
|
Hopefully no functional change. Last time I attempted to look at Unicode 15.1 line breaking, that was made impractical by the need, for every new state X, to add an X_ZWJ state, transitions X CM → X, X ZWJ → X_ZWJ, X_ZWJ CM → X, as well as X_ZWJ Y → Z for every transition X Y → Z, and to add or update rules to prevent breaks after X_ZWJ. Hopefully this will make that upgrade a little more tractable. (Incidentally it makes the state table a bit smaller.) Tested with 200 000 monkeys (recall that only 200 are checked in). Related to #3255; see my comment there for the rationale. Aside: While looking at this, it came to my attention that the `LineBreakStrictness::Anywhere` option does not do what the standard says, cf. https://drafts.csswg.org/css-text-3/#valdef-line-break-anywhere and https://drafts.csswg.org/css-text-3/#typographic-character-unit referenced therein. Of course, we _do_ have a correct implementation of `line-break: anywhere`, since we have a grapheme cluster segmenter.
The Properties and Algorithms Group plans to recommend the following proposals to Unicode Technical Committee #175 later this month. If they are accepted, the changes would be published as part of Unicode Version 15.1, in September.
UAX #14:
UAX #29:
LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant
, where:Virama=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Virama}]
LinkingConsonant=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Consonant}]
ExtCccZwj=[\p{gcb=Extend}-\p{ccc=0}] \p{gcb=ZWJ}]
The text was updated successfully, but these errors were encountered: