-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8352316: More MergeStoreBench #24108
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back swen! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
The performance numbers show that putNull_unsafePutInt and putNull_utf16_unsafePutLong perform more than 10 times better. It can be seen that MergeStore is very suitable for these scenarios. Sciptgit remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
git clone 23dba8c52454ae90eab4cb1b0a168c6e7249dd38
make test TEST="micro:vm.compiler.MergeStoreBench.putNull" 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)Benchmark Mode Cnt Score Error Units
MergeStoreBench.putNull_arraycopy avgt 5 6715.041 ± 18.765 ns/op
MergeStoreBench.putNull_getBytes avgt 5 5880.725 ± 12.261 ns/op
MergeStoreBench.putNull_getChars avgt 5 11972.642 ± 24.990 ns/op
MergeStoreBench.putNull_string_builder avgt 5 15643.372 ± 4526.932 ns/op
MergeStoreBench.putNull_unsafePutInt avgt 5 280.570 ± 0.669 ns/op
MergeStoreBench.putNull_utf16_arrayCopy avgt 5 13053.191 ± 24.954 ns/op
MergeStoreBench.putNull_utf16_string_builder avgt 5 16349.747 ± 5029.799 ns/op
MergeStoreBench.putNull_utf16_unsafePutLong avgt 5 579.580 ± 0.710 ns/op
3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)Benchmark Mode Cnt Score Error Units
MergeStoreBench.putNull_arraycopy avgt 5 8029.622 ± 60.856 ns/op
MergeStoreBench.putNull_getBytes avgt 5 7444.635 ± 39.552 ns/op
MergeStoreBench.putNull_getChars avgt 5 16657.442 ± 147.301 ns/op
MergeStoreBench.putNull_string_builder avgt 5 23008.159 ± 6143.167 ns/op
MergeStoreBench.putNull_unsafePutInt avgt 5 235.302 ± 2.004 ns/op
MergeStoreBench.putNull_utf16_arrayCopy avgt 5 18330.317 ± 142.242 ns/op
MergeStoreBench.putNull_utf16_string_builder avgt 5 25843.593 ± 7089.392 ns/op
MergeStoreBench.putNull_utf16_unsafePutLong avgt 5 1860.076 ± 16.703 ns/op 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)Benchmark Mode Cnt Score Error Units
MergeStoreBench.putNull_arraycopy avgt 5 8114.176 ± 36.685 ns/op
MergeStoreBench.putNull_getBytes avgt 5 6171.538 ± 5.845 ns/op
MergeStoreBench.putNull_getChars avgt 5 10432.681 ± 26.401 ns/op
MergeStoreBench.putNull_string_builder avgt 5 21238.753 ± 1428.244 ns/op
MergeStoreBench.putNull_unsafePutInt avgt 5 349.233 ± 1.521 ns/op
MergeStoreBench.putNull_utf16_arrayCopy avgt 5 16063.018 ± 22.127 ns/op
MergeStoreBench.putNull_utf16_string_builder avgt 5 22327.827 ± 414.499 ns/op
MergeStoreBench.putNull_utf16_unsafePutLong avgt 5 863.733 ± 0.693 ns/op
|
Webrevs
|
@wenshao Do you have any insight from this benchmark? What was your motivation for it? I also wonder if an IR test for some of the cases would be helpful. IR tests give us more info about what the compiler produced, and if there is a change in VM behaviour the IR test catches it in regular testing. Benchmarks are not run regularly, and regressions would therefore not be caught. |
I'm a developer of fastjson2. According to third-party benchmarks from https://github.com/fabienrenaud/java-json-benchmark, our library demonstrates the best performance. I would like to contribute some of these optimization techniques to OpenJDK, ideally by having C2 (the JIT compiler) directly support them. Below is an example related to this PR. We have a JavaBean that needs to be serialized to a JSON string:
class Bean {
public int value;
}
{"value":123}
class BeanJSONSerializer {
private static final String name = "\"value\":";
private static final byte[] nameBytes = name.getBytes();
private satic final long nameLong = UNSAFE.getLong(nameBytes, ARRAY_BYTE_BASE_OFFSET);
int writeNameValue0(byte[] bytes, int off, int value) {
name.getBytes(0, 8, bytes, off);
off += 8;
return writeInt32(bytes, off, value);
}
int writeNameValue1(byte[] bytes, int off, int value) {
System.arraycopy(nameBytes, 0, bytes, off, 8);
off += 8;
return writeInt32(bytes, off, value);
}
int writeNameValue2(byte[] bytes, int off, int value) {
UNSAFE.putLong(bytes, ARRAY_BYTE_BASE_OFFSET + off, nameLong);
off += 8;
return writeInt32(bytes, off, value);
}
} We propose that the C2 compiler could optimize cases where the field name length is 4 or 8 bytes by automatically using direct memory operations similar to writeNameValue2. This would eliminate the need for manual unsafe operations in user code and improve serialization performance for common patterns. |
I submitted this benchmark to prove that the performance of System.arraycopy or String.getBytes can be improved by Unsafe.putInt/putLong. I hope C2 can do this optimization automatically. |
Did you check if it does or does not do that? Can you investigate what the generated code is for I'm also not convinced that you are comparing apples to apples here.
This does an array copy, so an array load AND an array store, right? This one even has to do allocations, loads and stores (though you need to investigate and check):
On the other hand, this does NOT have to do an array load or allocations, just a simple store:
Is there actually a benchmark in this series that makes use of individual byte stores that get merged to an int store? Because that is the whole point of MergeStores, right? Do you really need to use Back to this:
Can you investigate what code it generates, and what kinds of optimizations are missing to make it close in performance to the I don't have time to do all the deep investigations myself. But feel free to ask me if you have more questions. |
@wenshao Since we don't seem to be comparing apples to apples here, it would be even more important to leave comments at the benchmarks to say what operations (loads, stores, allocations, etc) are happening. And what we know is optimized, and what we think could be optimized in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On naming:
null
usages confuse me (NULL_STR
et al,putNull
). Why "null" is special? Can you just use an arbitrary 4-byte string?- PR proposes a mix of snake & camel case while the code around uses camel case. Worth considering grouping similar benchmarks into an inner class.
By default, in OpenJDK, COMPACT_STRINGS = true, and the String coder without UTF16 characters is LATIN1, which is implemented using System.arraycopy. However, since String is immutable and System.arraycopy is directly performed on byte[], C2 should have more opportunities for optimization. class String {
@Stable
private final byte[] value;
private final byte coder;
boolean isLatin1() {
return COMPACT_STRINGS && coder == LATIN1;
}
public void getBytes(int srcBegin, int srcEnd, byte[] dst, int dstBegin) {
checkBoundsBeginEnd(srcBegin, srcEnd, length());
Objects.requireNonNull(dst);
checkBoundsOffCount(dstBegin, srcEnd - srcBegin, dst.length);
if (isLatin1()) {
StringLatin1.getBytes(value, srcBegin, srcEnd, dst, dstBegin);
} else {
StringUTF16.getBytes(value, srcBegin, srcEnd, dst, dstBegin);
}
}
}
class StringLatin1 {
public static void getBytes(byte[] value, int srcBegin, int srcEnd, byte[] dst, int dstBegin) {
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
}
} |
"null" is very common, here, because its length is 4. When coder = LATIN1, the length of byte[] value is 4, and when coder = UTF16, the length of byte[] value is 8, which is easy to compare with Unsafe.putInt/putLong. If the string is not a multiple of 4, we can also use a combination. For example, when the length is 5, we can use the putInt + putByte combination. String str = "a1234";
str.getBytes(bytes, 0, 5, bytes, off); UNSAFE.putInt(bytes, Unsafe.ARRAY_BYTE_BASE_OFFSET + off, 0x33323161); // 0x33323161 is "a123"
USNAFE.putByte(bytes, Unsafe.ARRAY_BYTE_BASE_OFFSET + off + 4, '4');
|
Ok, I don't have anything against a fixed string constant. But existing names ( |
According to @iwanowww's suggestion, I changed the original name of putNull to str4, and added the benchmarks of str5 and str7. The following are the new performance numbers, which show that using ArraySetConst or UnsafePut has better performance. 1. Scipt
2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
|
I added a new scenario The performance numbers below show that ArraySetConst/StringBuilderUnsafePut/UnsafePut have better performance. These numbers show that Stable Value's arraycopy has great performance optimization potential, which is worth more optimization for C2. 1. Scipt
2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
|
Added performance tests related to String.getBytes/String.getChars/StringBuilder.append/System.arraycopy in constant scenarios to verify whether MergeStore works
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24108/head:pull/24108
$ git checkout pull/24108
Update a local copy of the PR:
$ git checkout pull/24108
$ git pull https://git.openjdk.org/jdk.git pull/24108/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24108
View PR using the GUI difftool:
$ git pr show -t 24108
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24108.diff
Using Webrev
Link to Webrev Comment