FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

eggrobin · 2023-04-04T13:19:19Z

The Properties and Algorithms Group plans to recommend the following proposals to Unicode Technical Committee #‌175 later this month. If they are accepted, the changes would be published as part of Unicode Version 15.1, in September.

UAX #‌14:

L2/23-063, Line breaking around quotation marks.
L2/23-072, Proposed changes for line breaking on orthographic syllables.
- Note that this involves new property values for the Line_Break property.

UAX #‌29:

(No proposal paper, this will be part of L2/23-079.) Upstream the CLDR root tailoring for grapheme clusters, that is, add a new rule GB9c LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant, where:
- Virama=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Virama}]
- LinkingConsonant=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Consonant}]
- ExtCccZwj=[\p{gcb=Extend}-\p{ccc=0}] \p{gcb=ZWJ}]

The text was updated successfully, but these errors were encountered:

sffc · 2023-04-04T23:07:35Z

@makotokato @aethanyc

sffc · 2023-04-20T18:13:01Z

@aethanyc or @makotokato can you take this issue? Probably for 1.x Priority.

sffc · 2023-05-11T18:11:27Z

Discussion: Longer term, we would like it if the upstreamed TOML files would be updated along with the specification, so that ICU4X does not need to do anything more than pulling in updates from upstream.

eggrobin · 2023-05-17T14:59:26Z

Looking at the toml files, my impression is that they define a state machine transitioned by code point (that is, a [[tables]] record defines a transition from its left state to its name state when the next code point has the class right), and that the breaks at each step are determined by the [[rules]] with a matching left state, and looking ahead one code point matching the class right.

The following new line breaking rules require more lookahead than that:

× [\p{Pf}&QU] ( SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | eot)
(AK | ◌ | AS) × (AK | ◌ | AS) VF

These require looking at two code points to the right of the (non-)break, plus any intervening CM (since these are after LB9).

hsivonen · 2023-10-19T06:06:35Z

Gecko bug

eggrobin · 2023-10-19T15:48:01Z

Henri, this is interesting.

In your comment you correctly identified what LB15a and LB15b are trying to do, and why they need to do that (instead of treating Pi as LB=OP and Pf as LB=CL: that would mess with German, Finnish, etc. usage of Pf initially or Pi finally).

However, these new rules do not help with the Chinese issue at hand, since there are no spaces (there may visually appear to be space, but that is because U+2018 etc. have ambiguous width; here they are wide). This has recently come to the attention of the Properties and Algorithms Group of the UTC; it may be possible to do something about it in the ID QU ID case.
I will mention that issue in that discussion. Nothing will happen on that front before Unicode 16.0 in September 2024 though.

aethanyc · 2024-05-17T21:31:57Z

We still need to update line segmenter to Unicode 15.1. @makotokato is working on it.

eggrobin · 2024-06-04T12:41:49Z

I am experimenting with moving LB8a and LB9 into the code of the line segmenter, as

the combination of these rules makes the state table extraordinarily painful to maintain (and it makes it large), as every state needs to be replicated: X ZWJ is different from X for most X since there is no break after ZWJ per LB8a, but X ZWJ CM brings you back to the X state, so the X ZWJ states cannot be merged;
these rules cannot be tailored (so there is no reason to allow for custom data to change their behaviour), and are in practice reasonably stable: they last changed in Unicode 11 (2018), following up on some earlier Unicode 9 (2016) changes for emoji ZWJ sequences; contrast the other rules that have been changing wildly every year.

Hopefully no functional change. Last time I attempted to look at Unicode 15.1 line breaking, that was made impractical by the need, for every new state X, to add an X_ZWJ state, transitions X CM → X, X ZWJ → X_ZWJ, X_ZWJ CM → X, as well as X_ZWJ Y → Z for every transition X Y → Z, and to add or update rules to prevent breaks after X_ZWJ. Hopefully this will make that upgrade a little more tractable. (Incidentally it makes the state table a bit smaller.) Tested with 200 000 monkeys (recall that only 200 are checked in). Related to #3255; see my comment there for the rationale. Aside: While looking at this, it came to my attention that the `LineBreakStrictness::Anywhere` option does not do what the standard says, cf. https://drafts.csswg.org/css-text-3/#valdef-line-break-anywhere and https://drafts.csswg.org/css-text-3/#typographic-character-unit referenced therein. Of course, we _do_ have a correct implementation of `line-break: anywhere`, since we have a grapheme cluster segmenter.

eggrobin added the C-segmentation Component: Segmentation label Apr 4, 2023

sffc added S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality labels Apr 4, 2023

aethanyc mentioned this issue Apr 6, 2023

Upgrade segmenter to Unicode 15.0.0 #3273

Merged

sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023

sffc added the blocked A dependency must be resolved before this is actionable label May 11, 2023

sffc assigned aethanyc May 11, 2023

makotokato mentioned this issue Oct 17, 2023

Support Indic_Syllabic_Category in icu_properties #4170

Closed

eggrobin mentioned this issue Nov 27, 2023

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

Closed

aethanyc assigned makotokato and unassigned aethanyc May 17, 2024

eggrobin mentioned this issue Jun 4, 2024

Move LB8a and LB9 out of the table #5001

Merged

makotokato mentioned this issue Jul 10, 2024

Support Unicode 15.1 for line segmenter #5218

Merged

eggrobin closed this as completed in be4c14d Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

eggrobin commented Apr 4, 2023 •

edited

Loading

sffc commented Apr 4, 2023

sffc commented Apr 20, 2023

sffc commented May 11, 2023

eggrobin commented May 17, 2023 •

edited

Loading

hsivonen commented Oct 19, 2023

eggrobin commented Oct 19, 2023

aethanyc commented May 17, 2024

eggrobin commented Jun 4, 2024 •

edited

Loading

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

Comments

eggrobin commented Apr 4, 2023 • edited Loading

sffc commented Apr 4, 2023

sffc commented Apr 20, 2023

sffc commented May 11, 2023

eggrobin commented May 17, 2023 • edited Loading

hsivonen commented Oct 19, 2023

eggrobin commented Oct 19, 2023

aethanyc commented May 17, 2024

eggrobin commented Jun 4, 2024 • edited Loading

eggrobin commented Apr 4, 2023 •

edited

Loading

eggrobin commented May 17, 2023 •

edited

Loading

eggrobin commented Jun 4, 2024 •

edited

Loading