New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some details in String and Character #4222
Conversation
cb6cfa5
to
42cb7ff
Compare
@@ -426,6 +426,26 @@ class CharacterTest { | |||
assertEquals(0x10ffff, Character.toLowerCase(0x10ffff)) // => | |||
} | |||
|
|||
@Test def toLowerCase_CodePoint_special_cases(): Unit = { | |||
assertEquals(0x0069, Character.toLowerCase(0x0130)) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems duplicate. Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two test methods are intentional, because they test logically different things. In the case of toLowerCase()
, they happen to test the same set of code points. I have added comments to explain the situation.
private final val HighSurrogateAddValue = 0x10000 >> HighSurrogateShift | ||
|
||
@inline def isValidCodePoint(codePoint: Int): scala.Boolean = | ||
(codePoint >>> 16) <= 0x10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the point of making bit magic changes here. IMO the previous implementation was very easy to understand and verify. This one is hard to understand.
As far as speed is concerned, I think we need some evidence that this is any faster. For example, clang and GCC produce the exact same assembly for both of these: https://godbolt.org/z/3rKva7. IMHO it is reasonable to assume that a decent JIT will do the same.
A similar comment applies to isBmpCodePoint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS JITs are not as smart as C++ compilers, because they have to trade their own speed against that of the run-time.
This implementation of isValidCodePoint
compiles down to:
0x1c43e1d42c27 87 41c1e810 shrl r8, 16
0x1c43e1d42c2b 8b 4183f810 cmpl r8,0x10
0x1c43e1d42c2f 8f 0f878a000000 ja 0x1c43e1d42cbf <+0x11f>
while the "naive" (codePoint >= 0) && (codePoint <= 0x10ffff)
compiles to
0x30cdf2ac2c27 87 4183f800 cmpl r8,0x0
0x30cdf2ac2c2b 8b 0f8c97000000 jl 0x30cdf2ac2cc8 <+0x128>
0x30cdf2ac2c31 91 4181f8ffff1000 cmpl r8,0x10ffff
0x30cdf2ac2c38 98 0f8f8a000000 jg 0x30cdf2ac2cc8 <+0x128>
There is clearly only one comparison in the former, versus two in the latter.
Similarly, (codePoint & ~0xffff) == 0
compiles to
0x3a7b5d182d87 87 41f7c00000ffff testl r8,0xffff0000
0x3a7b5d182d8e 8e 0f858a000000 jnz 0x3a7b5d182e1e <+0x11e>
while (codePoint >= 0) && (codePoint <= 0xffff)
compiles to
0x1a2679602c27 87 4183f800 cmpl r8,0x0
0x1a2679602c2b 8b 0f8c97000000 jl 0x1a2679602cc8 <+0x128>
0x1a2679602c31 91 4181f8ffff0000 cmpl r8,0xffff
0x1a2679602c38 98 0f8f8a000000 jg 0x1a2679602cc8 <+0x128>
which is this time one test
instruction versus two cmp
instructions.
That said, the burden of proof is probably not worth it, so I've reverted to the dumb comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really cool. How did you get these? (I was looking for something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ node --print-code whatever.js
but you need to make sure that whatever.js
executes the thing you want to inspect enough times to actually trigger its compilation. Executing once will just execute it from the bytecode, without JIT compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I saw this on SO. But it looked way too annoying. I guess it is :-/
@@ -46,7 +46,7 @@ case $FULLVER in | |||
REVERSI_OPT_GZ_EXPECTEDSIZE=33000 | |||
;; | |||
2.13.3) | |||
REVERSI_PREOPT_EXPECTEDSIZE=686000 | |||
REVERSI_PREOPT_EXPECTEDSIZE=687000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea what this is caused by?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was caused the change of implementation of j.l.Character.hashCode() = Character.hashCode(charValue())
, which adds ~20 characters in the output because of one more intermediate local variable. The size of 2.13.3 happened to be 5 bytes away from the limit, so those 20 characters made it topple in the next category. But there's nothing to worry about.
42cb7ff
to
61b2d95
Compare
It turns out this is a special case because `"İ".toLowerCase()` returns two code points: `"i\u0307"`, i.e., a lower-case ASCII `i` followed by a Dot Above. As demonstrated by the script that generates tests, this is the only code point for which we need a special case.
The precise result `String.compareTo` is actually specified, so that testing only `< 0` or `> 0` is not good enough. We now test the precise results, and adapt the implementation to conform to them. Likewise, the way `equalsIgnoreCase` and `compareToIgnoreCase` compare characters is specified as a char-by-char normalization. We fix our implementations to conform to the specified comparison and adapt our tests accordingly.
The JDK `trim()` specifies that characters '\u0000' through '\u0020' (' ') and only those are considered whitespace, but the JavaScript `trim()` function has a different definition. It is therefore incorrect to rely on JavaScript's `trim()`. Instead we must implement the function ourselves.
We group all the methods of `Character` that manipulate code points and code units at the beginning of the object (following their order in the JavaDoc). We add the following methods, which complete the set of low-level methods in `Character`: * `hashCode(Char)` * `toString(Int)` (added in JDK 11) * `highSurrogate(Int)` * `lowSurrogate(Int)` * `reverseBytes(Char)`
And simplify the implementation of the existing locale-insensitive `String.format(...)`.
It was only used for `toLowerCase()` and `toUpperCase()` at this point. It seems clearer to simply inline the methods using `js.Dynamic` instead (as already done in some other methods, such as `length()` and `charAt()`).
61b2d95
to
1df3189
Compare
Character.toLowerCase
String.compareTo
,equalsIgnoreCase
andcompareToIgnoreCase
String.trim()
Character
dealing with code units and surrogatesString.format(Locale, ...)