-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for future normalization data for Unicode 16 #4538
Conversation
ICU4X's data representation doesn't have a place for marking the first character of an expansion as a starter that combines backwards. To accommodate for the Unicode 16 characters discussed in section 5 of https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf I expect the least disruptive change to be that characters that can occur as the first character of a composition whose second character may occur as the first character of an expansion decomposition be represented the way non-BMP singleton decompositions are currently represented. This should exclude them from NFC passthrough eligibility without having to find a way to mark the second character. Without this patch, these characters would be incorrectly be claimed as having a singleton decomposition (to self) in `CanonicalDecomposition::decompose()`. There are no tests, because we don't actually have data like this, yet. The purpose is to enable a copy of ICU4X binary that has this code to be able to consume dynamically-loaded data from the future. This change is low-risk in case we don't end up using the data trick anticipated by this patch.
The analogous case could be supported for BMP data as well. However, the upcoming characters that need this are non-BMP characters and a BMP character could be represented in the part of the data intended for non-BMP characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks plausible to me.
Ultimately though, I do not trust my ability to stare at the code of an optimized normalizer. It would be really nice if we could run the 16.0α conformance tests on this by way of #4602.
@hsivonen OK to merge? |
Thanks. It looks like https://unicode-org.atlassian.net/browse/ICU-22586 is still open, so running the data pipeline with Unicode 16 data remains blocked. |
Yes. I plan to get to that before the end of this month. I have a couple of things lined up before that. |
ICU4X's data representation doesn't have a place for marking the first character of an expansion as a starter that combines backwards.
To accommodate the Unicode 16 characters discussed in section 5 of https://www.unicode.org/L2/L2024/24009-utc178-properties-recs.pdf I expect the least disruptive change to be that characters that can occur as the first character of a composition whose second character may occur as the first character of an expansion decomposition be represented the way non-BMP singleton decompositions are currently represented. This should exclude them from NFC passthrough eligibility without having to find a way to mark the second character.
Without this patch, these characters would be incorrectly be claimed as having a singleton decomposition (to self) in
CanonicalDecomposition::decompose()
.There are no tests, because we don't actually have data like this, yet. The purpose is to enable a copy of ICU4X binary that has this code to be able to consume dynamically-loaded data from the future.
This change is low-risk in case we don't end up using the data trick anticipated by this patch.