Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some details in String and Character #4222

Merged
merged 6 commits into from Oct 5, 2020

Conversation

sjrd
Copy link
Member

@sjrd sjrd commented Oct 4, 2020

  • Fix a corner case of Character.toLowerCase
  • Full compliance of String.compareTo, equalsIgnoreCase and compareToIgnoreCase
  • Fix compliance of String.trim()
  • Rework all methods of Character dealing with code units and surrogates
  • Implement String.format(Locale, ...)
  • Cleanup

@@ -426,6 +426,26 @@ class CharacterTest {
assertEquals(0x10ffff, Character.toLowerCase(0x10ffff)) // 􏿿 => 􏿿
}

@Test def toLowerCase_CodePoint_special_cases(): Unit = {
assertEquals(0x0069, Character.toLowerCase(0x0130))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems duplicate. Intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two test methods are intentional, because they test logically different things. In the case of toLowerCase(), they happen to test the same set of code points. I have added comments to explain the situation.

private final val HighSurrogateAddValue = 0x10000 >> HighSurrogateShift

@inline def isValidCodePoint(codePoint: Int): scala.Boolean =
(codePoint >>> 16) <= 0x10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the point of making bit magic changes here. IMO the previous implementation was very easy to understand and verify. This one is hard to understand.

As far as speed is concerned, I think we need some evidence that this is any faster. For example, clang and GCC produce the exact same assembly for both of these: https://godbolt.org/z/3rKva7. IMHO it is reasonable to assume that a decent JIT will do the same.

A similar comment applies to isBmpCodePoint.

Copy link
Member Author

@sjrd sjrd Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS JITs are not as smart as C++ compilers, because they have to trade their own speed against that of the run-time.

This implementation of isValidCodePoint compiles down to:

0x1c43e1d42c27    87  41c1e810       shrl r8, 16
0x1c43e1d42c2b    8b  4183f810       cmpl r8,0x10
0x1c43e1d42c2f    8f  0f878a000000   ja 0x1c43e1d42cbf  <+0x11f>

while the "naive" (codePoint >= 0) && (codePoint <= 0x10ffff) compiles to

0x30cdf2ac2c27    87  4183f800       cmpl r8,0x0
0x30cdf2ac2c2b    8b  0f8c97000000   jl 0x30cdf2ac2cc8  <+0x128>
0x30cdf2ac2c31    91  4181f8ffff1000 cmpl r8,0x10ffff
0x30cdf2ac2c38    98  0f8f8a000000   jg 0x30cdf2ac2cc8  <+0x128>

There is clearly only one comparison in the former, versus two in the latter.

Similarly, (codePoint & ~0xffff) == 0 compiles to

0x3a7b5d182d87    87  41f7c00000ffff testl r8,0xffff0000
0x3a7b5d182d8e    8e  0f858a000000   jnz 0x3a7b5d182e1e  <+0x11e>

while (codePoint >= 0) && (codePoint <= 0xffff) compiles to

0x1a2679602c27    87  4183f800       cmpl r8,0x0
0x1a2679602c2b    8b  0f8c97000000   jl 0x1a2679602cc8  <+0x128>
0x1a2679602c31    91  4181f8ffff0000 cmpl r8,0xffff
0x1a2679602c38    98  0f8f8a000000   jg 0x1a2679602cc8  <+0x128>

which is this time one test instruction versus two cmp instructions.

That said, the burden of proof is probably not worth it, so I've reverted to the dumb comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really cool. How did you get these? (I was looking for something like that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ node --print-code whatever.js
but you need to make sure that whatever.js executes the thing you want to inspect enough times to actually trigger its compilation. Executing once will just execute it from the bytecode, without JIT compilation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I saw this on SO. But it looked way too annoying. I guess it is :-/

@@ -46,7 +46,7 @@ case $FULLVER in
REVERSI_OPT_GZ_EXPECTEDSIZE=33000
;;
2.13.3)
REVERSI_PREOPT_EXPECTEDSIZE=686000
REVERSI_PREOPT_EXPECTEDSIZE=687000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea what this is caused by?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was caused the change of implementation of j.l.Character.hashCode() = Character.hashCode(charValue()), which adds ~20 characters in the output because of one more intermediate local variable. The size of 2.13.3 happened to be 5 bytes away from the limit, so those 20 characters made it topple in the next category. But there's nothing to worry about.

It turns out this is a special case because `"İ".toLowerCase()`
returns two code points: `"i\u0307"`, i.e., a lower-case ASCII `i`
followed by a Dot Above.

As demonstrated by the script that generates tests, this is the
only code point for which we need a special case.
The precise result `String.compareTo` is actually specified, so
that testing only `< 0` or `> 0` is not good enough. We now test
the precise results, and adapt the implementation to conform to
them.

Likewise, the way `equalsIgnoreCase` and `compareToIgnoreCase`
compare characters is specified as a char-by-char normalization.
We fix our implementations to conform to the specified comparison
and adapt our tests accordingly.
The JDK `trim()` specifies that characters '\u0000' through
'\u0020' (' ') and only those are considered whitespace, but the
JavaScript `trim()` function has a different definition. It is
therefore incorrect to rely on JavaScript's `trim()`. Instead we
must implement the function ourselves.
We group all the methods of `Character` that manipulate code points
and code units at the beginning of the object (following their
order in the JavaDoc).

We add the following methods, which complete the set of low-level
methods in `Character`:

* `hashCode(Char)`
* `toString(Int)` (added in JDK 11)
* `highSurrogate(Int)`
* `lowSurrogate(Int)`
* `reverseBytes(Char)`
And simplify the implementation of the existing
locale-insensitive `String.format(...)`.
It was only used for `toLowerCase()` and `toUpperCase()` at this
point. It seems clearer to simply inline the methods using
`js.Dynamic` instead (as already done in some other methods, such
as `length()` and `charAt()`).
@gzm0 gzm0 merged commit face822 into scala-js:master Oct 5, 2020
@sjrd sjrd deleted the fix-unicode-details branch October 5, 2020 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants