Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for future normalization data for Unicode 16 #4538

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

hsivonen
Copy link
Member

@hsivonen hsivonen commented Jan 23, 2024

ICU4X's data representation doesn't have a place for marking the first character of an expansion as a starter that combines backwards.

To accommodate the Unicode 16 characters discussed in section 5 of https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf I expect the least disruptive change to be that characters that can occur as the first character of a composition whose second character may occur as the first character of an expansion decomposition be represented the way non-BMP singleton decompositions are currently represented. This should exclude them from NFC passthrough eligibility without having to find a way to mark the second character.

Without this patch, these characters would be incorrectly be claimed as having a singleton decomposition (to self) in
CanonicalDecomposition::decompose().

There are no tests, because we don't actually have data like this, yet. The purpose is to enable a copy of ICU4X binary that has this code to be able to consume dynamically-loaded data from the future.

This change is low-risk in case we don't end up using the data trick anticipated by this patch.

ICU4X's data representation doesn't have a place for marking the
first character of an expansion as a starter that combines backwards.

To accommodate for the Unicode 16 characters discussed in section 5 of
https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf
I expect the least disruptive change to be that characters that
can occur as the first character of a composition whose second
character may occur as the first character of an expansion decomposition
be represented the way non-BMP singleton decompositions are currently
represented. This should exclude them from NFC passthrough eligibility
without having to find a way to mark the second character.

Without this patch, these characters would be incorrectly be claimed
as having a singleton decomposition (to self) in
`CanonicalDecomposition::decompose()`.

There are no tests, because we don't actually have data like this, yet.
The purpose is to enable a copy of ICU4X binary that has this code to
be able to consume dynamically-loaded data from the future.

This change is low-risk in case we don't end up using the data trick
anticipated by this patch.
@hsivonen
Copy link
Member Author

The analogous case could be supported for BMP data as well. However, the upcoming characters that need this are non-BMP characters and a BMP character could be represented in the part of the data intended for non-BMP characters.

Copy link
Member

@eggrobin eggrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks plausible to me.

Ultimately though, I do not trust my ability to stare at the code of an optimized normalizer. It would be really nice if we could run the 16.0α conformance tests on this by way of #4602.

@sffc
Copy link
Member

sffc commented Mar 7, 2024

@hsivonen OK to merge?

@hsivonen hsivonen merged commit 38fe002 into unicode-org:main Mar 11, 2024
29 checks passed
@hsivonen hsivonen deleted the kiratrai branch March 11, 2024 15:54
@hsivonen
Copy link
Member Author

Thanks. It looks like https://unicode-org.atlassian.net/browse/ICU-22586 is still open, so running the data pipeline with Unicode 16 data remains blocked.

@markusicu
Copy link
Member

It looks like https://unicode-org.atlassian.net/browse/ICU-22586 is still open

Yes. I plan to get to that before the end of this month. I have a couple of things lined up before that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants