Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update uppercase lowercase to use UnicodeData.txt vs CaseFolding.txt #1611

Merged
merged 5 commits into from Jul 18, 2019

Conversation

ekrich
Copy link
Member

@ekrich ekrich commented May 29, 2019

This is a long overdue update. After learning more on the subject of Unicode, it was apparent that using the UnicodeData was more appropriate than the CaseFolding file. We are in the process of improving the transformation code which is public at https://github.com/ekrich/scala-unicode and we can also easily update to newer versions of Unicode. Unicode 10.0.0 which is used in JDK11 has been transformed using the same code.

The encoding compression is as follows:

Points Number Ranges
Lowers 1096 237
Uppers 1092 232

@ekrich ekrich changed the title [WIP] Update uppercase lowercase to use UnicodeData.txt vs CaseFolding.txt Update uppercase lowercase to use UnicodeData.txt vs CaseFolding.txt May 31, 2019
@ekrich
Copy link
Member Author

ekrich commented May 31, 2019

Analysis Output:

toUpperCase cp => sn vs jdk
num-diffs=53
Dž/1c5/453 => same vs DŽ/1c4/452
Lj/1c8/456 => same vs LJ/1c7/455
Nj/1cb/459 => same vs NJ/1ca/458
Dz/1f2/498 => same vs DZ/1f1/497
...

The other 49 differences, our encoding has an uppercase and jvm does no casing.

Edit: 2019-07-17

The four cases have no uppercase encoding in the Unicode DB. The JVM performs a TitleCase transform to the correct codepoint for TitleCase. These codepoints are not in the SpecialCasing.txt file either so I think the jvm encoding is incorrect. From my understanding of doing a TitleCase on a String, the first character, should be TitleCased or UpperCased in that order. I do not think you should TitleCase a character by itself not as a substitute for UpperCase. This is incorrect. These 4 fonts are special and have both upper, lower, and titlecase in the UnicodeData file.

toLowerCase cp => sn vs jdk
num-diffs=49
...

For all differences, our encoding has lowercase and jvm does no casing.

The previous encoding had 70 differences for toUpperCase and 63 for toLowerCase so this is definitely an improvement.

Based on these findings, ready for review. Clearly, this was not ready for review at this point.

@ekrich
Copy link
Member Author

ekrich commented May 31, 2019

Just to confirm with myself that we are really doing well, I used UnicodeData.txt from version 6.3.0 and got the follow results.

toUpperCase cp => sn vs jdk
num-diffs=4
Dž/1c5/453 => same vs DŽ/1c4/452
Lj/1c8/456 => same vs LJ/1c7/455
Nj/1cb/459 => same vs NJ/1ca/458
Dz/1f2/498 => same vs DZ/1f1/497
toLowerCase cp => sn vs jdk
num-diffs=0

So the additional changes are due to more code points in Unicode 7.0.0. I can update this PR to use the 6.3.0 data so we more closely match JDK8 which uses 6.2.0. Then when we decide to track the next production JDK version, 11 we can upgrade to using Unicode 10.0.0.

@ekrich
Copy link
Member Author

ekrich commented Jun 1, 2019

Update: the 4 codepoints above in the JDK need to be handled so I will change some code to handle that special case. I think it makes sense to have 100% parity with the current JDK 8 that we are tracking.

@ekrich
Copy link
Member Author

ekrich commented Jun 1, 2019

Now we match JDK8 exactly.

toUpperCase cp => sn vs jdk
num-diffs=0

toLowerCase cp => sn vs jdk
num-diffs=0

Generation output

[info] Running org.ekrich.unicode.CaseUpperLower 6.3.0
Path: /6.3.0/UCD/UnicodeData.txt
Total Records: 24434
Full DB: 24434
Upper/Lower: 2090
Lower <compat>: 36
Upper <compat>: 28
Lowers: 1051
Uppers: 1043
lower compress ranges: 233
upper compress ranges: 224

@ekrich
Copy link
Member Author

ekrich commented Jun 7, 2019

@densh Ready for review 😃

Copy link
Member

@densh densh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job @ekrich ! 👍

@densh densh merged commit d1c01d0 into scala-native:master Jul 18, 2019
@ekrich ekrich deleted the topic/unicode branch July 18, 2019 21:47
ekrich added a commit to ekrich/scala-native that referenced this pull request May 21, 2021
…ing.txt (scala-native#1611)

* Update uppercase lowercase to use UnicodeData.txt vs CaseFolding.txt

* Separate and update tests and simplify lookup code for new encoding

* Does not use SpecialCasing.txt

* Full parity with JDK8 using Unicode 6.3.0

* Remove test case above Unicode 6.3 upper/lower case range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants