Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String to{Lower, Upper}Case handles Unicode special, non-locale dependent, casing rules #2098

Conversation

WojciechMazur
Copy link
Contributor

This PR continues work originated in #2095 Scala.js port, becouse it have not contained handling for Unicode special handling for characters not being locale dpendent, as it is probably implemented natively in JS.
Changes include handling of greek sigma letter, latin I with dot, and other characters specified in SpecialCasing.txt

Before converting from draft PR it needs to be rebased onto changes in #2095.

@WojciechMazur WojciechMazur force-pushed the feature/string_toLUCase_special_handling branch from aedbd20 to 4f0c1d5 Compare December 30, 2020 16:04
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
// Outside word boundary and not found cased
if (c.isWhitespace) false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does isWhitespace come from? I see no reference to whitespace in the Unicode spec for Final_Sigma. However I see the concept of "case-ignorable", which I don't see in the code.

We should stick 100% to the Unicode spec of Final_Sigma, and comment everything with the references that we're using to make the code directly comparable to the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was a kind of simplification used to match with JVM, eg. matching only characters within word boundary. I've replaced it with the correct implementation based on Unicode specification. To make it possible I had to add two additional methods to Characters and associated arrays of range indices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was a kind of simplification used to match with JVM, eg. matching only characters within word boundary.

Hum, wait. Do you mean that the JVM's toLowerCase() behaves differently than the Unicode spec? Do you have an example input string where it's not the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, by simplification I've meant simplification on our side, eg. test ```assertEquals("dσς aσς bσc", "DΣΣ AΣΣ BΣC".toLowerCase())`` was created as I've seen that only last sigma in each word should be changed and didn't know or interpretted incorectlly caseIgnorable rule from specification.

In each case I've tested on JVM (for Final_Sigma rule) incompatibilites with Unicode spec comes from implementation of different Unicode version in given JDK. For example there are 181 more "cased" symbols in Unicode 13.0.0 when comparing to 10.0.0 used in JDK 11. I belive it's fixed in newer versions.

@WojciechMazur WojciechMazur force-pushed the feature/string_toLUCase_special_handling branch from 4f0c1d5 to 5827534 Compare January 5, 2021 11:46
@WojciechMazur WojciechMazur marked this pull request as ready for review January 5, 2021 12:05
Copy link
Collaborator

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better now, thank you. I still have one concern over supplementary code points.

javalib/src/main/scala/java/lang/Character.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/Character.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/Character.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/Character.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
@inline
def nextIndex(currentIdx: Int, currentChar: Char): Int = {
val charSize = Character.charCount(currentChar)
if (iterateForward) currentIdx + charSize
else currentIdx - charSize
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for surrogate pairs, i.e., supplementary code points. You'll need something similar to skipCharsWithCombiningClassOtherThanNoneOrAboveForwards and its backwards counterpart. I suggest you follow the same pattern of having a pair of skipCaseIgnorableCharsForwards and skipCaseIgnorableCharsBackwards, written after the above for CombiningClass, and then use them in a similar way as is done in afterSoftDotted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When applying this change I've found out that we had a bug in skipCharsXY methods. Port from Scala.js have not took into account Strings created using String.substring method. I'm not sure had it works in Scala.js but in Scala Native it uses the same underlying byte array, but with different offset and count values to determinate start or end of string.

Following missing test would fail:

assertEquals("\u0307\u0301I",
                "iíìĩı i\u0307\u0301i".substring(7).toUpperCase(Lithuanian)) // equivalent of "\u0307\u0301i"
assertEquals("I\u0301I",
                "iíìĩı i\u0307\u0301i".substring(6).toUpperCase(Lithuanian))  // equivalent of "i\u0307\u0301i"

I've decided to put fix in this PR are they're strongly related and also main loop of skipCharsXY is used also for case ignorable characters

javalib/src/main/scala/java/lang/StringSpecialCasing.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
// use offset + count instead of length to handle Strings created using subString method
val len = offset + count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length(), and codePointAt() should take care of offset. We shouldn't have to do this here. If we were manipulating value directly, we would have to do this; but when we're using length(), charAt, etc., we must not take offset into account, because those methods already do that. Perhaps codePointAt is wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted changes related to offset and count, I looks that problem with operations on substring was occouring due to some other bug which was already fixed in a meantime. Thanks you for for spotting this.

javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
Copy link
Collaborator

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two trivial comments below. Otherwise LGTM.

javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
javalib/src/main/scala/java/lang/String.scala Outdated Show resolved Hide resolved
@WojciechMazur WojciechMazur merged commit 0d74236 into scala-native:master Jan 7, 2021
@WojciechMazur WojciechMazur deleted the feature/string_toLUCase_special_handling branch February 15, 2021 09:09
ekrich pushed a commit to ekrich/scala-native that referenced this pull request May 21, 2021
…e dependent, casing rules (scala-native#2098)

This PR continues work originated in scala-native#2095 Scala.js port, becouse it have not contained handling for Unicode special handling for characters not being locale dpendent, as it is probably implemented natively in JS.

Changes include handling of greek sigma letter, latin I with dot, and other characters specified in Unicode 13.0.0 SpecialCasing.txt. In order to make this PR fully compatible with Unicode specification there was introduced array of `cased` and `caseIgnorable` character indices used to determinate proper handling.
WojciechMazur added a commit to WojciechMazur/scala-native that referenced this pull request Aug 25, 2021
…e dependent, casing rules (scala-native#2098)

This PR continues work originated in scala-native#2095 Scala.js port, becouse it have not contained handling for Unicode special handling for characters not being locale dpendent, as it is probably implemented natively in JS.

Changes include handling of greek sigma letter, latin I with dot, and other characters specified in Unicode 13.0.0 SpecialCasing.txt. In order to make this PR fully compatible with Unicode specification there was introduced array of `cased` and `caseIgnorable` character indices used to determinate proper handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants