Fix some details in String and Character #4222

sjrd · 2020-10-04T14:44:33Z

Fix a corner case of Character.toLowerCase
Full compliance of String.compareTo, equalsIgnoreCase and compareToIgnoreCase
Fix compliance of String.trim()
Rework all methods of Character dealing with code units and surrogates
Implement String.format(Locale, ...)
Cleanup

gzm0 · 2020-10-04T14:51:37Z

test-suite/shared/src/test/scala/org/scalajs/testsuite/javalib/lang/CharacterTest.scala

@@ -426,6 +426,26 @@ class CharacterTest {
    assertEquals(0x10ffff, Character.toLowerCase(0x10ffff)) // 􏿿 => 􏿿
  }

+  @Test def toLowerCase_CodePoint_special_cases(): Unit = {
+    assertEquals(0x0069, Character.toLowerCase(0x0130))
+  }


This test seems duplicate. Intentional?

The two test methods are intentional, because they test logically different things. In the case of toLowerCase(), they happen to test the same set of code points. I have added comments to explain the situation.

gzm0 · 2020-10-05T07:25:16Z

javalanglib/src/main/scala/java/lang/Character.scala

+  private final val HighSurrogateAddValue = 0x10000 >> HighSurrogateShift
+
+  @inline def isValidCodePoint(codePoint: Int): scala.Boolean =
+    (codePoint >>> 16) <= 0x10


I'm not sure I understand the point of making bit magic changes here. IMO the previous implementation was very easy to understand and verify. This one is hard to understand.

As far as speed is concerned, I think we need some evidence that this is any faster. For example, clang and GCC produce the exact same assembly for both of these: https://godbolt.org/z/3rKva7. IMHO it is reasonable to assume that a decent JIT will do the same.

A similar comment applies to isBmpCodePoint.

JS JITs are not as smart as C++ compilers, because they have to trade their own speed against that of the run-time.

This implementation of isValidCodePoint compiles down to:

0x1c43e1d42c27 87 41c1e810 shrl r8, 16 0x1c43e1d42c2b 8b 4183f810 cmpl r8,0x10 0x1c43e1d42c2f 8f 0f878a000000 ja 0x1c43e1d42cbf <+0x11f>

while the "naive" (codePoint >= 0) && (codePoint <= 0x10ffff) compiles to

0x30cdf2ac2c27 87 4183f800 cmpl r8,0x0 0x30cdf2ac2c2b 8b 0f8c97000000 jl 0x30cdf2ac2cc8 <+0x128> 0x30cdf2ac2c31 91 4181f8ffff1000 cmpl r8,0x10ffff 0x30cdf2ac2c38 98 0f8f8a000000 jg 0x30cdf2ac2cc8 <+0x128>

There is clearly only one comparison in the former, versus two in the latter.

Similarly, (codePoint & ~0xffff) == 0 compiles to

0x3a7b5d182d87 87 41f7c00000ffff testl r8,0xffff0000 0x3a7b5d182d8e 8e 0f858a000000 jnz 0x3a7b5d182e1e <+0x11e>

while (codePoint >= 0) && (codePoint <= 0xffff) compiles to

0x1a2679602c27 87 4183f800 cmpl r8,0x0 0x1a2679602c2b 8b 0f8c97000000 jl 0x1a2679602cc8 <+0x128> 0x1a2679602c31 91 4181f8ffff0000 cmpl r8,0xffff 0x1a2679602c38 98 0f8f8a000000 jg 0x1a2679602cc8 <+0x128>

which is this time one test instruction versus two cmp instructions.

That said, the burden of proof is probably not worth it, so I've reverted to the dumb comparisons.

That's really cool. How did you get these? (I was looking for something like that)

$ node --print-code whatever.js
but you need to make sure that whatever.js executes the thing you want to inspect enough times to actually trigger its compilation. Executing once will just execute it from the bytecode, without JIT compilation.

OK. I saw this on SO. But it looked way too annoying. I guess it is :-/

gzm0 · 2020-10-05T09:07:31Z

ci/checksizes.sh

@@ -46,7 +46,7 @@ case $FULLVER in
    REVERSI_OPT_GZ_EXPECTEDSIZE=33000
    ;;
  2.13.3)
-    REVERSI_PREOPT_EXPECTEDSIZE=686000
+    REVERSI_PREOPT_EXPECTEDSIZE=687000


Do you have an idea what this is caused by?

Yes, this was caused the change of implementation of j.l.Character.hashCode() = Character.hashCode(charValue()), which adds ~20 characters in the output because of one more intermediate local variable. The size of 2.13.3 happened to be 5 bytes away from the limit, so those 20 characters made it topple in the next category. But there's nothing to worry about.

It turns out this is a special case because `"İ".toLowerCase()` returns two code points: `"i\u0307"`, i.e., a lower-case ASCII `i` followed by a Dot Above. As demonstrated by the script that generates tests, this is the only code point for which we need a special case.

The precise result `String.compareTo` is actually specified, so that testing only `< 0` or `> 0` is not good enough. We now test the precise results, and adapt the implementation to conform to them. Likewise, the way `equalsIgnoreCase` and `compareToIgnoreCase` compare characters is specified as a char-by-char normalization. We fix our implementations to conform to the specified comparison and adapt our tests accordingly.

The JDK `trim()` specifies that characters '\u0000' through '\u0020' (' ') and only those are considered whitespace, but the JavaScript `trim()` function has a different definition. It is therefore incorrect to rely on JavaScript's `trim()`. Instead we must implement the function ourselves.

We group all the methods of `Character` that manipulate code points and code units at the beginning of the object (following their order in the JavaDoc). We add the following methods, which complete the set of low-level methods in `Character`: * `hashCode(Char)` * `toString(Int)` (added in JDK 11) * `highSurrogate(Int)` * `lowSurrogate(Int)` * `reverseBytes(Char)`

And simplify the implementation of the existing locale-insensitive `String.format(...)`.

It was only used for `toLowerCase()` and `toUpperCase()` at this point. It seems clearer to simply inline the methods using `js.Dynamic` instead (as already done in some other methods, such as `length()` and `charAt()`).

sjrd requested a review from gzm0 October 4, 2020 14:44

sjrd force-pushed the fix-unicode-details branch from cb6cfa5 to 42cb7ff Compare October 5, 2020 07:55

gzm0 requested changes Oct 5, 2020

View reviewed changes

sjrd force-pushed the fix-unicode-details branch from 42cb7ff to 61b2d95 Compare October 5, 2020 12:02

sjrd added 6 commits October 5, 2020 14:06

Implement String.format(Locale, ...).

d6baf91

And simplify the implementation of the existing locale-insensitive `String.format(...)`.

Remove String.SpecialJSStringOps.

1df3189

It was only used for `toLowerCase()` and `toUpperCase()` at this point. It seems clearer to simply inline the methods using `js.Dynamic` instead (as already done in some other methods, such as `length()` and `charAt()`).

sjrd force-pushed the fix-unicode-details branch from 61b2d95 to 1df3189 Compare October 5, 2020 12:06

gzm0 approved these changes Oct 5, 2020

View reviewed changes

gzm0 merged commit face822 into scala-js:master Oct 5, 2020

sjrd deleted the fix-unicode-details branch October 5, 2020 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some details in String and Character #4222

Fix some details in String and Character #4222

sjrd commented Oct 4, 2020

gzm0 Oct 4, 2020

sjrd Oct 5, 2020

gzm0 Oct 5, 2020

sjrd Oct 5, 2020 •

edited

gzm0 Oct 5, 2020

sjrd Oct 5, 2020

gzm0 Oct 5, 2020

gzm0 Oct 5, 2020

sjrd Oct 5, 2020

Fix some details in String and Character #4222

Fix some details in String and Character #4222

Conversation

sjrd commented Oct 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjrd Oct 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjrd Oct 5, 2020 •

edited