Prepare for future normalization data for Unicode 16 #4538

hsivonen · 2024-01-23T17:05:03Z

ICU4X's data representation doesn't have a place for marking the first character of an expansion as a starter that combines backwards.

To accommodate the Unicode 16 characters discussed in section 5 of https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf I expect the least disruptive change to be that characters that can occur as the first character of a composition whose second character may occur as the first character of an expansion decomposition be represented the way non-BMP singleton decompositions are currently represented. This should exclude them from NFC passthrough eligibility without having to find a way to mark the second character.

Without this patch, these characters would be incorrectly be claimed as having a singleton decomposition (to self) in
CanonicalDecomposition::decompose().

There are no tests, because we don't actually have data like this, yet. The purpose is to enable a copy of ICU4X binary that has this code to be able to consume dynamically-loaded data from the future.

This change is low-risk in case we don't end up using the data trick anticipated by this patch.

ICU4X's data representation doesn't have a place for marking the first character of an expansion as a starter that combines backwards. To accommodate for the Unicode 16 characters discussed in section 5 of https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf I expect the least disruptive change to be that characters that can occur as the first character of a composition whose second character may occur as the first character of an expansion decomposition be represented the way non-BMP singleton decompositions are currently represented. This should exclude them from NFC passthrough eligibility without having to find a way to mark the second character. Without this patch, these characters would be incorrectly be claimed as having a singleton decomposition (to self) in `CanonicalDecomposition::decompose()`. There are no tests, because we don't actually have data like this, yet. The purpose is to enable a copy of ICU4X binary that has this code to be able to consume dynamically-loaded data from the future. This change is low-risk in case we don't end up using the data trick anticipated by this patch.

hsivonen · 2024-01-29T12:33:57Z

The analogous case could be supported for BMP data as well. However, the upcoming characters that need this are non-BMP characters and a BMP character could be represented in the part of the data intended for non-BMP characters.

eggrobin

Looks plausible to me.

Ultimately though, I do not trust my ability to stare at the code of an optimized normalizer. It would be really nice if we could run the 16.0α conformance tests on this by way of #4602.

sffc · 2024-03-07T19:39:08Z

@hsivonen OK to merge?

hsivonen · 2024-03-11T15:56:39Z

Thanks. It looks like https://unicode-org.atlassian.net/browse/ICU-22586 is still open, so running the data pipeline with Unicode 16 data remains blocked.

markusicu · 2024-03-11T18:56:03Z

It looks like https://unicode-org.atlassian.net/browse/ICU-22586 is still open

Yes. I plan to get to that before the end of this month. I have a couple of things lined up before that.

hsivonen requested a review from echeran as a code owner January 23, 2024 17:05

hsivonen requested a review from eggrobin January 23, 2024 17:05

eggrobin requested a review from markusicu January 23, 2024 17:06

eggrobin approved these changes Feb 28, 2024

View reviewed changes

Manishearth approved these changes Mar 7, 2024

View reviewed changes

hsivonen merged commit 38fe002 into unicode-org:main Mar 11, 2024
29 checks passed

hsivonen deleted the kiratrai branch March 11, 2024 15:54

hsivonen mentioned this pull request May 2, 2024

Make the normalizer work with new Unicode 16 normalization behaviors #4860

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for future normalization data for Unicode 16 #4538

Prepare for future normalization data for Unicode 16 #4538

hsivonen commented Jan 23, 2024 •

edited

Loading

hsivonen commented Jan 29, 2024

eggrobin left a comment

sffc commented Mar 7, 2024

hsivonen commented Mar 11, 2024

markusicu commented Mar 11, 2024

Prepare for future normalization data for Unicode 16 #4538

Prepare for future normalization data for Unicode 16 #4538

Conversation

hsivonen commented Jan 23, 2024 • edited Loading

hsivonen commented Jan 29, 2024

eggrobin left a comment

Choose a reason for hiding this comment

sffc commented Mar 7, 2024

hsivonen commented Mar 11, 2024

markusicu commented Mar 11, 2024

hsivonen commented Jan 23, 2024 •

edited

Loading