Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8352316: More MergeStoreBench #24108

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

wenshao
Copy link
Contributor

@wenshao wenshao commented Mar 19, 2025

Added performance tests related to String.getBytes/String.getChars/StringBuilder.append/System.arraycopy in constant scenarios to verify whether MergeStore works


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24108/head:pull/24108
$ git checkout pull/24108

Update a local copy of the PR:
$ git checkout pull/24108
$ git pull https://git.openjdk.org/jdk.git pull/24108/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24108

View PR using the GUI difftool:
$ git pr show -t 24108

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24108.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 19, 2025

👋 Welcome back swen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 19, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Mar 19, 2025

@wenshao The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 19, 2025
@wenshao
Copy link
Contributor Author

wenshao commented Mar 19, 2025

The performance numbers show that putNull_unsafePutInt and putNull_utf16_unsafePutLong perform more than 10 times better. It can be seen that MergeStore is very suitable for these scenarios.

Scipt

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
git clone 23dba8c52454ae90eab4cb1b0a168c6e7249dd38
make test TEST="micro:vm.compiler.MergeStoreBench.putNull"

2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   6715.041 ±   18.765  ns/op
MergeStoreBench.putNull_getBytes              avgt    5   5880.725 ±   12.261  ns/op
MergeStoreBench.putNull_getChars              avgt    5  11972.642 ±   24.990  ns/op
MergeStoreBench.putNull_string_builder        avgt    5  15643.372 ± 4526.932  ns/op
MergeStoreBench.putNull_unsafePutInt          avgt    5    280.570 ±    0.669  ns/op
MergeStoreBench.putNull_utf16_arrayCopy       avgt    5  13053.191 ±   24.954  ns/op
MergeStoreBench.putNull_utf16_string_builder  avgt    5  16349.747 ± 5029.799  ns/op
MergeStoreBench.putNull_utf16_unsafePutLong   avgt    5    579.580 ±    0.710  ns/op

3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   8029.622 ±   60.856  ns/op
MergeStoreBench.putNull_getBytes              avgt    5   7444.635 ±   39.552  ns/op
MergeStoreBench.putNull_getChars              avgt    5  16657.442 ±  147.301  ns/op
MergeStoreBench.putNull_string_builder        avgt    5  23008.159 ± 6143.167  ns/op
MergeStoreBench.putNull_unsafePutInt          avgt    5    235.302 ±    2.004  ns/op
MergeStoreBench.putNull_utf16_arrayCopy       avgt    5  18330.317 ±  142.242  ns/op
MergeStoreBench.putNull_utf16_string_builder  avgt    5  25843.593 ± 7089.392  ns/op
MergeStoreBench.putNull_utf16_unsafePutLong   avgt    5   1860.076 ±   16.703  ns/op

4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   8114.176 ±   36.685  ns/op
MergeStoreBench.putNull_getBytes              avgt    5   6171.538 ±    5.845  ns/op
MergeStoreBench.putNull_getChars              avgt    5  10432.681 ±   26.401  ns/op
MergeStoreBench.putNull_string_builder        avgt    5  21238.753 ± 1428.244  ns/op
MergeStoreBench.putNull_unsafePutInt          avgt    5    349.233 ±    1.521  ns/op
MergeStoreBench.putNull_utf16_arrayCopy       avgt    5  16063.018 ±   22.127  ns/op
MergeStoreBench.putNull_utf16_string_builder  avgt    5  22327.827 ±  414.499  ns/op
MergeStoreBench.putNull_utf16_unsafePutLong   avgt    5    863.733 ±    0.693  ns/op

@wenshao wenshao changed the title More MergeStoreBench 8352316: More MergeStoreBench Mar 19, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 19, 2025
@mlbridge
Copy link

mlbridge bot commented Mar 19, 2025

@eme64
Copy link
Contributor

eme64 commented Mar 25, 2025

@wenshao Do you have any insight from this benchmark? What was your motivation for it?

I also wonder if an IR test for some of the cases would be helpful. IR tests give us more info about what the compiler produced, and if there is a change in VM behaviour the IR test catches it in regular testing. Benchmarks are not run regularly, and regressions would therefore not be caught.

@wenshao
Copy link
Contributor Author

wenshao commented Mar 26, 2025

I'm a developer of fastjson2. According to third-party benchmarks from https://github.com/fabienrenaud/java-json-benchmark, our library demonstrates the best performance. I would like to contribute some of these optimization techniques to OpenJDK, ideally by having C2 (the JIT compiler) directly support them.

Below is an example related to this PR. We have a JavaBean that needs to be serialized to a JSON string:

  • JavaBean
class Bean {
	public int value;
}
  • Target JSON Output
{"value":123}
  • CodeGen-Generated JSONSerializer
    fastjson2 uses ASM to generate a serializer class like the following. The methods writeNameValue0, writeNameValue1, and writeNameValue2 are candidate implementations. Among them, writeNameValue2 is the fastest when the field name length is 8, as it leverages UNSAFE.putLong for direct memory operations:
class BeanJSONSerializer {
	private static final String name = "\"value\":";
	private static final byte[] nameBytes = name.getBytes();
	private satic final long nameLong = UNSAFE.getLong(nameBytes, ARRAY_BYTE_BASE_OFFSET);	

	int writeNameValue0(byte[] bytes, int off, int value) {
		name.getBytes(0, 8, bytes, off);
		off += 8;
		return writeInt32(bytes, off, value);
	}

	int writeNameValue1(byte[] bytes, int off, int value) {
		System.arraycopy(nameBytes, 0, bytes, off, 8);
		off += 8;
		return writeInt32(bytes, off, value);
	}


	int writeNameValue2(byte[] bytes, int off, int value) {
		UNSAFE.putLong(bytes, ARRAY_BYTE_BASE_OFFSET + off, nameLong);
		off += 8;
		return writeInt32(bytes, off, value);
	}
}

We propose that the C2 compiler could optimize cases where the field name length is 4 or 8 bytes by automatically using direct memory operations similar to writeNameValue2. This would eliminate the need for manual unsafe operations in user code and improve serialization performance for common patterns.

@wenshao
Copy link
Contributor Author

wenshao commented Mar 27, 2025

@wenshao Do you have any insight from this benchmark? What was your motivation for it?

I also wonder if an IR test for some of the cases would be helpful. IR tests give us more info about what the compiler produced, and if there is a change in VM behaviour the IR test catches it in regular testing. Benchmarks are not run regularly, and regressions would therefore not be caught.

I submitted this benchmark to prove that the performance of System.arraycopy or String.getBytes can be improved by Unsafe.putInt/putLong. I hope C2 can do this optimization automatically.

@eme64
Copy link
Contributor

eme64 commented Mar 27, 2025

@wenshao

I hope C2 can do this optimization automatically.

Did you check if it does or does not do that? Can you investigate what the generated code is for String.getBytes? Does that not create an allocation, which would make things much slower? And it may even do some more complicated encoding things, which is a lot of overhead. So that would explain your performance result, at least partially, right?

I'm also not convinced that you are comparing apples to apples here.

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   8029.622 ±   60.856  ns/op

This does an array copy, so an array load AND an array store, right?

This one even has to do allocations, loads and stores (though you need to investigate and check):

MergeStoreBench.putNull_getBytes              avgt    5   6171.538 ±    5.845  ns/op

On the other hand, this does NOT have to do an array load or allocations, just a simple store:

MergeStoreBench.putNull_unsafePutInt          avgt    5    235.302 ±    2.004  ns/op

Is there actually a benchmark in this series that makes use of individual byte stores that get merged to an int store? Because that is the whole point of MergeStores, right?

Do you really need to use String.getBytes? I mean maybe with proper escape analysis etc the whole allocation could be avoided. But that would require a much deeper analysis.

Back to this:

I hope C2 can do this optimization automatically.

Can you investigate what code it generates, and what kinds of optimizations are missing to make it close in performance to the Unsafe benchmark?

I don't have time to do all the deep investigations myself. But feel free to ask me if you have more questions.

@eme64
Copy link
Contributor

eme64 commented Mar 27, 2025

@wenshao Since we don't seem to be comparing apples to apples here, it would be even more important to leave comments at the benchmarks to say what operations (loads, stores, allocations, etc) are happening. And what we know is optimized, and what we think could be optimized in the future.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On naming:

  • null usages confuse me (NULL_STR et al, putNull). Why "null" is special? Can you just use an arbitrary 4-byte string?
  • PR proposes a mix of snake & camel case while the code around uses camel case. Worth considering grouping similar benchmarks into an inner class.

@wenshao
Copy link
Contributor Author

wenshao commented Mar 29, 2025

@wenshao

I hope C2 can do this optimization automatically.

Did you check if it does or does not do that? Can you investigate what the generated code is for String.getBytes? Does that not create an allocation, which would make things much slower? And it may even do some more complicated encoding things, which is a lot of overhead. So that would explain your performance result, at least partially, right?

I'm also not convinced that you are comparing apples to apples here.

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   8029.622 ±   60.856  ns/op

This does an array copy, so an array load AND an array store, right?

This one even has to do allocations, loads and stores (though you need to investigate and check):

MergeStoreBench.putNull_getBytes              avgt    5   6171.538 ±    5.845  ns/op

On the other hand, this does NOT have to do an array load or allocations, just a simple store:

MergeStoreBench.putNull_unsafePutInt          avgt    5    235.302 ±    2.004  ns/op

Is there actually a benchmark in this series that makes use of individual byte stores that get merged to an int store? Because that is the whole point of MergeStores, right?

Do you really need to use String.getBytes? I mean maybe with proper escape analysis etc the whole allocation could be avoided. But that would require a much deeper analysis.

Back to this:

I hope C2 can do this optimization automatically.

Can you investigate what code it generates, and what kinds of optimizations are missing to make it close in performance to the Unsafe benchmark?

I don't have time to do all the deep investigations myself. But feel free to ask me if you have more questions.

By default, in OpenJDK, COMPACT_STRINGS = true, and the String coder without UTF16 characters is LATIN1, which is implemented using System.arraycopy. However, since String is immutable and System.arraycopy is directly performed on byte[], C2 should have more opportunities for optimization.

class String {
    @Stable
    private final byte[] value;
    private final byte coder;

    boolean isLatin1() {
        return COMPACT_STRINGS && coder == LATIN1;
    }

    public void getBytes(int srcBegin, int srcEnd, byte[] dst, int dstBegin) {
        checkBoundsBeginEnd(srcBegin, srcEnd, length());
        Objects.requireNonNull(dst);
        checkBoundsOffCount(dstBegin, srcEnd - srcBegin, dst.length);
        if (isLatin1()) {
            StringLatin1.getBytes(value, srcBegin, srcEnd, dst, dstBegin);
        } else {
            StringUTF16.getBytes(value, srcBegin, srcEnd, dst, dstBegin);
        }
    }	
}


class StringLatin1 {
    public static void getBytes(byte[] value, int srcBegin, int srcEnd, byte[] dst, int dstBegin) {
        System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
    }
}

@wenshao
Copy link
Contributor Author

wenshao commented Mar 29, 2025

On naming:

  • null usages confuse me (NULL_STR et al, putNull). Why "null" is special? Can you just use an arbitrary 4-byte string?

"null" is very common, here, because its length is 4. When coder = LATIN1, the length of byte[] value is 4, and when coder = UTF16, the length of byte[] value is 8, which is easy to compare with Unsafe.putInt/putLong.

If the string is not a multiple of 4, we can also use a combination. For example, when the length is 5, we can use the putInt + putByte combination.

String str = "a1234";
str.getBytes(bytes, 0, 5, bytes, off);
UNSAFE.putInt(bytes, Unsafe.ARRAY_BYTE_BASE_OFFSET + off, 0x33323161); // 0x33323161 is "a123"
USNAFE.putByte(bytes, Unsafe.ARRAY_BYTE_BASE_OFFSET + off + 4, '4');
  • PR proposes a mix of snake & camel case while the code around uses camel case. Worth considering grouping similar benchmarks into an inner class.

@iwanowww
Copy link
Contributor

Ok, I don't have anything against a fixed string constant. But existing names (NULL_STR et al, putNull) add confusion IMO (especially, when there's Unsafe in play).

@wenshao
Copy link
Contributor Author

wenshao commented Mar 29, 2025

According to @iwanowww's suggestion, I changed the original name of putNull to str4, and added the benchmarks of str5 and str7. The following are the new performance numbers, which show that using ArraySetConst or UnsafePut has better performance.

1. Scipt

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
git checkout a5eb3b98ece8cf1aa6eaa3d1287148e1b0510f4b
make test TEST="micro:vm.compiler.MergeStoreBench.str"

2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

Benchmark                               Mode  Cnt      Score      Error  Units
MergeStoreBench.str4ArraySetConst       avgt    5   1343.148 ±    3.995  ns/op
MergeStoreBench.str4Arraycopy           avgt    5   7293.298 ±   35.868  ns/op
MergeStoreBench.str4GetBytes            avgt    5   6175.505 ±   17.465  ns/op
MergeStoreBench.str4GetChars            avgt    5  13954.105 ± 1561.080  ns/op
MergeStoreBench.str4StringBuilder       avgt    5  15633.944 ± 4579.011  ns/op
MergeStoreBench.str4UnsafePut           avgt    5   1325.916 ±    6.126  ns/op
MergeStoreBench.str4Utf16ArrayCopy      avgt    5  13998.302 ±  938.311  ns/op
MergeStoreBench.str4Utf16ArraySetConst  avgt    5   1514.040 ±    6.774  ns/op
MergeStoreBench.str4Utf16StringBuilder  avgt    5  16382.059 ± 4943.649  ns/op
MergeStoreBench.str4Utf16UnsafePut      avgt    5   1616.452 ±    9.472  ns/op
MergeStoreBench.str5ArraySetConst       avgt    5   2609.046 ±   28.409  ns/op
MergeStoreBench.str5Arraycopy           avgt    5   9519.887 ±   54.364  ns/op
MergeStoreBench.str5GetBytes            avgt    5   5987.410 ±   14.277  ns/op
MergeStoreBench.str5GetChars            avgt    5  13598.285 ±  241.078  ns/op
MergeStoreBench.str5StringBuilder       avgt    5  16556.510 ± 2962.211  ns/op
MergeStoreBench.str5UnsafePut           avgt    5   2431.841 ±   24.299  ns/op
MergeStoreBench.str5Utf16ArrayCopy      avgt    5  21433.158 ±  131.466  ns/op
MergeStoreBench.str5Utf16ArraySetConst  avgt    5   2935.785 ±    3.777  ns/op
MergeStoreBench.str5Utf16StringBuilder  avgt    5  18746.936 ± 3680.162  ns/op
MergeStoreBench.str5Utf16UnsafePut      avgt    5   2878.038 ±   10.055  ns/op
MergeStoreBench.str7ArraySetConst       avgt    5   3594.628 ±   24.397  ns/op
MergeStoreBench.str7Arraycopy           avgt    5  12314.423 ±   81.095  ns/op
MergeStoreBench.str7GetBytes            avgt    5   9014.943 ±  222.911  ns/op
MergeStoreBench.str7GetChars            avgt    5  16866.491 ±  178.543  ns/op
MergeStoreBench.str7StringBuilder       avgt    5  25238.440 ± 2757.460  ns/op
MergeStoreBench.str7UnsafePut           avgt    5   3597.008 ±   26.531  ns/op
MergeStoreBench.str7Utf16ArrayCopy      avgt    5  21325.797 ±  111.975  ns/op
MergeStoreBench.str7Utf16ArraySetConst  avgt    5   3934.164 ±   97.003  ns/op
MergeStoreBench.str7Utf16StringBuilder  avgt    5  19315.320 ± 1960.379  ns/op
MergeStoreBench.str7Utf16UnsafePut      avgt    5   4190.362 ±    8.042  ns/op

3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

Benchmark                               Mode  Cnt      Score      Error  Units
MergeStoreBench.str4ArraySetConst       avgt    5   1558.348 ±    0.959  ns/op
MergeStoreBench.str4Arraycopy           avgt    5   5837.069 ±    3.166  ns/op
MergeStoreBench.str4GetBytes            avgt    5   5875.195 ±   12.562  ns/op
MergeStoreBench.str4GetChars            avgt    5  12679.307 ±   62.069  ns/op
MergeStoreBench.str4StringBuilder       avgt    5  16588.064 ±   75.515  ns/op
MergeStoreBench.str4UnsafePut           avgt    5   1543.947 ±    4.780  ns/op
MergeStoreBench.str4Utf16ArrayCopy      avgt    5  13973.910 ±  329.196  ns/op
MergeStoreBench.str4Utf16ArraySetConst  avgt    5   2591.923 ±    6.758  ns/op
MergeStoreBench.str4Utf16StringBuilder  avgt    5  17719.390 ± 5016.367  ns/op
MergeStoreBench.str4Utf16UnsafePut      avgt    5   2539.849 ±    8.091  ns/op
MergeStoreBench.str5ArraySetConst       avgt    5   3004.459 ±    9.575  ns/op
MergeStoreBench.str5Arraycopy           avgt    5   7153.397 ±   52.069  ns/op
MergeStoreBench.str5GetBytes            avgt    5   5566.344 ±    4.400  ns/op
MergeStoreBench.str5GetChars            avgt    5  14444.069 ±  224.157  ns/op
MergeStoreBench.str5StringBuilder       avgt    5  18371.573 ±  293.271  ns/op
MergeStoreBench.str5UnsafePut           avgt    5   2879.242 ±    9.412  ns/op
MergeStoreBench.str5Utf16ArrayCopy      avgt    5   4548.225 ±   14.172  ns/op
MergeStoreBench.str5Utf16ArraySetConst  avgt    5   3864.536 ±    4.208  ns/op
MergeStoreBench.str5Utf16StringBuilder  avgt    5  20413.600 ± 1513.652  ns/op
MergeStoreBench.str5Utf16UnsafePut      avgt    5   3858.928 ±    2.923  ns/op
MergeStoreBench.str7ArraySetConst       avgt    5   4658.730 ±    4.558  ns/op
MergeStoreBench.str7Arraycopy           avgt    5  12130.150 ±   13.268  ns/op
MergeStoreBench.str7GetBytes            avgt    5  11941.311 ±  201.509  ns/op
MergeStoreBench.str7GetChars            avgt    5  21081.423 ± 1892.526  ns/op
MergeStoreBench.str7StringBuilder       avgt    5  14661.312 ±  768.749  ns/op
MergeStoreBench.str7UnsafePut           avgt    5   4662.649 ±    2.974  ns/op
MergeStoreBench.str7Utf16ArrayCopy      avgt    5   4973.827 ±    2.841  ns/op
MergeStoreBench.str7Utf16ArraySetConst  avgt    5   5407.768 ±   19.989  ns/op
MergeStoreBench.str7Utf16StringBuilder  avgt    5  25378.418 ± 9377.505  ns/op
MergeStoreBench.str7Utf16UnsafePut      avgt    5   5494.466 ±    5.377  ns/op

4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

Benchmark                               Mode  Cnt      Score       Error  Units
MergeStoreBench.str4ArraySetConst       avgt    5   2232.675 ±     0.858  ns/op
MergeStoreBench.str4Arraycopy           avgt    5   8342.762 ±    22.772  ns/op
MergeStoreBench.str4GetBytes            avgt    5   6988.049 ±    11.874  ns/op
MergeStoreBench.str4GetChars            avgt    5  12363.100 ±    30.414  ns/op
MergeStoreBench.str4StringBuilder       avgt    5  21257.805 ±  1371.310  ns/op
MergeStoreBench.str4UnsafePut           avgt    5   2234.198 ±     1.698  ns/op
MergeStoreBench.str4Utf16ArrayCopy      avgt    5  16381.011 ±   102.719  ns/op
MergeStoreBench.str4Utf16ArraySetConst  avgt    5   3109.010 ±     8.955  ns/op
MergeStoreBench.str4Utf16StringBuilder  avgt    5  22010.040 ±   908.358  ns/op
MergeStoreBench.str4Utf16UnsafePut      avgt    5   2868.544 ±    12.469  ns/op
MergeStoreBench.str5ArraySetConst       avgt    5   3780.322 ±     5.041  ns/op
MergeStoreBench.str5Arraycopy           avgt    5  10649.712 ±    39.440  ns/op
MergeStoreBench.str5GetBytes            avgt    5   6612.562 ±     7.260  ns/op
MergeStoreBench.str5GetChars            avgt    5  15521.451 ±   157.817  ns/op
MergeStoreBench.str5StringBuilder       avgt    5  22938.577 ±  1814.071  ns/op
MergeStoreBench.str5UnsafePut           avgt    5   3769.850 ±     0.524  ns/op
MergeStoreBench.str5Utf16ArrayCopy      avgt    5   5832.413 ±     5.256  ns/op
MergeStoreBench.str5Utf16ArraySetConst  avgt    5   4644.579 ±    41.694  ns/op
MergeStoreBench.str5Utf16StringBuilder  avgt    5  26369.411 ±  8050.710  ns/op
MergeStoreBench.str5Utf16UnsafePut      avgt    5   4497.980 ±    42.817  ns/op
MergeStoreBench.str7ArraySetConst       avgt    5   5913.136 ±    12.055  ns/op
MergeStoreBench.str7Arraycopy           avgt    5  14427.669 ±    80.229  ns/op
MergeStoreBench.str7GetBytes            avgt    5  11712.364 ±    13.206  ns/op
MergeStoreBench.str7GetChars            avgt    5  21309.046 ±   519.416  ns/op
MergeStoreBench.str7StringBuilder       avgt    5  18882.777 ±  2659.525  ns/op
MergeStoreBench.str7UnsafePut           avgt    5   5926.995 ±    11.841  ns/op
MergeStoreBench.str7Utf16ArrayCopy      avgt    5   6362.405 ±     5.381  ns/op
MergeStoreBench.str7Utf16ArraySetConst  avgt    5   4339.133 ±     2.066  ns/op
MergeStoreBench.str7Utf16StringBuilder  avgt    5  30761.366 ± 13408.497  ns/op
MergeStoreBench.str7Utf16UnsafePut      avgt    5   6345.575 ±   128.697  ns/op

@wenshao
Copy link
Contributor Author

wenshao commented Mar 29, 2025

I added a new scenario StringBuilderUnsafePut, using Unsafe to modify StringBuilder directly to implement append constants.

The performance numbers below show that ArraySetConst/StringBuilderUnsafePut/UnsafePut have better performance.

These numbers show that Stable Value's arraycopy has great performance optimization potential, which is worth more optimization for C2.

1. Scipt

git remote add wenshao git@github.com:wenshao/jdk.git
git fetch wenshao
git checkout cd1d8fb3b137a741446c894d1893e7180535ce8f
make test TEST="micro:vm.compiler.MergeStoreBench.str"

2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

Benchmark                                         Mode  Cnt      Score       Error  Units
MergeStoreBench.str4ArraySetConst                 avgt    5   1338.414 ±     3.209  ns/op
MergeStoreBench.str4Arraycopy                     avgt    5   7271.203 ±    19.400  ns/op
MergeStoreBench.str4GetBytes                      avgt    5   6154.684 ±     9.910  ns/op
MergeStoreBench.str4GetChars                      avgt    5  14078.790 ±    59.175  ns/op
MergeStoreBench.str4StringBuilder                 avgt    5  15766.528 ±  4634.119  ns/op
MergeStoreBench.str4StringBuilderAppendChar       avgt    5  41388.364 ±  9871.409  ns/op
MergeStoreBench.str4StringBuilderUnsafePut        avgt    5   1575.792 ±     4.102  ns/op
MergeStoreBench.str4UnsafePut                     avgt    5   1326.499 ±     2.400  ns/op
MergeStoreBench.str4Utf16ArrayCopy                avgt    5  13949.307 ±  1045.255  ns/op
MergeStoreBench.str4Utf16ArraySetConst            avgt    5   1511.967 ±     5.250  ns/op
MergeStoreBench.str4Utf16StringBuilder            avgt    5  18030.261 ±  1656.463  ns/op
MergeStoreBench.str4Utf16StringBuilderAppendChar  avgt    5  35047.855 ± 16674.635  ns/op
MergeStoreBench.str4Utf16StringBuilderUnsafePut   avgt    5   2785.792 ±     5.571  ns/op
MergeStoreBench.str4Utf16UnsafePut                avgt    5   1613.812 ±     1.249  ns/op
MergeStoreBench.str5ArraySetConst                 avgt    5   2599.310 ±     8.667  ns/op
MergeStoreBench.str5Arraycopy                     avgt    5   9487.926 ±    29.234  ns/op
MergeStoreBench.str5GetBytes                      avgt    5   5972.453 ±    16.035  ns/op
MergeStoreBench.str5GetChars                      avgt    5  13516.943 ±    10.978  ns/op
MergeStoreBench.str5StringBuilder                 avgt    5  16539.070 ±  3097.339  ns/op
MergeStoreBench.str5StringBuilderAppendChar       avgt    5  50506.770 ± 11536.414  ns/op
MergeStoreBench.str5StringBuilderUnsafePut        avgt    5   2653.493 ±     7.397  ns/op
MergeStoreBench.str5UnsafePut                     avgt    5   2431.003 ±    10.690  ns/op
MergeStoreBench.str5Utf16ArrayCopy                avgt    5  20949.585 ±  1128.737  ns/op
MergeStoreBench.str5Utf16ArraySetConst            avgt    5   2933.045 ±     5.864  ns/op
MergeStoreBench.str5Utf16StringBuilder            avgt    5  21769.670 ±  4910.378  ns/op
MergeStoreBench.str5Utf16StringBuilderAppendChar  avgt    5  47491.137 ± 15262.349  ns/op
MergeStoreBench.str5Utf16StringBuilderUnsafePut   avgt    5   2652.690 ±     5.348  ns/op
MergeStoreBench.str5Utf16UnsafePut                avgt    5   2871.860 ±     5.845  ns/op
MergeStoreBench.str7ArraySetConst                 avgt    5   3583.059 ±    22.359  ns/op
MergeStoreBench.str7Arraycopy                     avgt    5  12289.685 ±    14.769  ns/op
MergeStoreBench.str7GetBytes                      avgt    5   8968.316 ±    34.194  ns/op
MergeStoreBench.str7GetChars                      avgt    5  16792.196 ±    72.787  ns/op
MergeStoreBench.str7StringBuilder                 avgt    5  25231.342 ±  2851.998  ns/op
MergeStoreBench.str7StringBuilderAppendChar       avgt    5  67351.162 ±    51.074  ns/op
MergeStoreBench.str7StringBuilderUnsafePut        avgt    5   3397.856 ±     7.576  ns/op
MergeStoreBench.str7UnsafePut                     avgt    5   3578.465 ±     3.344  ns/op
MergeStoreBench.str7Utf16ArrayCopy                avgt    5  21314.607 ±   117.545  ns/op
MergeStoreBench.str7Utf16ArraySetConst            avgt    5   3915.540 ±     7.042  ns/op
MergeStoreBench.str7Utf16StringBuilder            avgt    5  21113.390 ±  1452.353  ns/op
MergeStoreBench.str7Utf16StringBuilderAppendChar  avgt    5  79597.044 ±   176.197  ns/op
MergeStoreBench.str7Utf16StringBuilderUnsafePut   avgt    5   6413.179 ±    11.302  ns/op
MergeStoreBench.str7Utf16UnsafePut                avgt    5   4180.867 ±     7.475  ns/op

3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

Benchmark                                         Mode  Cnt      Score       Error  Units
MergeStoreBench.str4ArraySetConst                 avgt    5   1558.502 ±     2.989  ns/op
MergeStoreBench.str4Arraycopy                     avgt    5   5855.148 ±    10.116  ns/op
MergeStoreBench.str4GetBytes                      avgt    5   5874.873 ±     3.767  ns/op
MergeStoreBench.str4GetChars                      avgt    5  12674.479 ±   103.618  ns/op
MergeStoreBench.str4StringBuilder                 avgt    5  16564.323 ±   229.666  ns/op
MergeStoreBench.str4StringBuilderAppendChar       avgt    5  39590.870 ± 14968.244  ns/op
MergeStoreBench.str4StringBuilderUnsafePut        avgt    5   1797.398 ±     3.972  ns/op
MergeStoreBench.str4UnsafePut                     avgt    5   1547.226 ±     1.950  ns/op
MergeStoreBench.str4Utf16ArrayCopy                avgt    5  13984.076 ±   332.735  ns/op
MergeStoreBench.str4Utf16ArraySetConst            avgt    5   2592.408 ±     5.338  ns/op
MergeStoreBench.str4Utf16StringBuilder            avgt    5  18244.127 ±  2436.822  ns/op
MergeStoreBench.str4Utf16StringBuilderAppendChar  avgt    5  36861.665 ± 10735.884  ns/op
MergeStoreBench.str4Utf16StringBuilderUnsafePut   avgt    5   3103.648 ±     0.809  ns/op
MergeStoreBench.str4Utf16UnsafePut                avgt    5   2539.181 ±    11.556  ns/op
MergeStoreBench.str5ArraySetConst                 avgt    5   3006.719 ±     4.606  ns/op
MergeStoreBench.str5Arraycopy                     avgt    5   7152.151 ±    27.593  ns/op
MergeStoreBench.str5GetBytes                      avgt    5   5572.568 ±     9.664  ns/op
MergeStoreBench.str5GetChars                      avgt    5  14478.429 ±   597.483  ns/op
MergeStoreBench.str5StringBuilder                 avgt    5  18249.007 ±   359.685  ns/op
MergeStoreBench.str5StringBuilderAppendChar       avgt    5  48156.310 ± 21354.806  ns/op
MergeStoreBench.str5StringBuilderUnsafePut        avgt    5   3039.131 ±     5.040  ns/op
MergeStoreBench.str5UnsafePut                     avgt    5   2885.440 ±     4.323  ns/op
MergeStoreBench.str5Utf16ArrayCopy                avgt    5   4648.957 ±   115.805  ns/op
MergeStoreBench.str5Utf16ArraySetConst            avgt    5   3862.566 ±     3.036  ns/op
MergeStoreBench.str5Utf16StringBuilder            avgt    5  24592.386 ±  6936.461  ns/op
MergeStoreBench.str5Utf16StringBuilderAppendChar  avgt    5  44162.880 ± 36224.171  ns/op
MergeStoreBench.str5Utf16StringBuilderUnsafePut   avgt    5   3042.734 ±     9.256  ns/op
MergeStoreBench.str5Utf16UnsafePut                avgt    5   3858.479 ±     2.273  ns/op
MergeStoreBench.str7ArraySetConst                 avgt    5   4656.166 ±     3.053  ns/op
MergeStoreBench.str7Arraycopy                     avgt    5  12139.304 ±    10.065  ns/op
MergeStoreBench.str7GetBytes                      avgt    5  11909.980 ±    14.371  ns/op
MergeStoreBench.str7GetChars                      avgt    5  20885.722 ±  3159.820  ns/op
MergeStoreBench.str7StringBuilder                 avgt    5  14813.587 ±   354.177  ns/op
MergeStoreBench.str7StringBuilderAppendChar       avgt    5  61647.309 ±   153.877  ns/op
MergeStoreBench.str7StringBuilderUnsafePut        avgt    5   4256.645 ±     1.095  ns/op
MergeStoreBench.str7UnsafePut                     avgt    5   4662.482 ±     2.893  ns/op
MergeStoreBench.str7Utf16ArrayCopy                avgt    5   4939.354 ±    12.117  ns/op
MergeStoreBench.str7Utf16ArraySetConst            avgt    5   5401.214 ±     5.342  ns/op
MergeStoreBench.str7Utf16StringBuilder            avgt    5  25070.599 ±  8313.323  ns/op
MergeStoreBench.str7Utf16StringBuilderAppendChar  avgt    5  84853.104 ±   210.843  ns/op
MergeStoreBench.str7Utf16StringBuilderUnsafePut   avgt    5   5290.793 ±    21.012  ns/op
MergeStoreBench.str7Utf16UnsafePut                avgt    5   5502.576 ±    11.820  ns/op

4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

Benchmark                                         Mode  Cnt       Score      Error  Units
MergeStoreBench.str4ArraySetConst                 avgt    5    2229.455 ±    2.024  ns/op
MergeStoreBench.str4Arraycopy                     avgt    5    8323.527 ±   60.470  ns/op
MergeStoreBench.str4GetBytes                      avgt    5    7008.143 ±    6.658  ns/op
MergeStoreBench.str4GetChars                      avgt    5   12343.528 ±    6.584  ns/op
MergeStoreBench.str4StringBuilder                 avgt    5   21238.814 ± 1410.339  ns/op
MergeStoreBench.str4StringBuilderAppendChar       avgt    5   68667.406 ±  720.511  ns/op
MergeStoreBench.str4StringBuilderUnsafePut        avgt    5    2281.267 ±    1.324  ns/op
MergeStoreBench.str4UnsafePut                     avgt    5    2230.367 ±    0.626  ns/op
MergeStoreBench.str4Utf16ArrayCopy                avgt    5   16338.896 ±   74.446  ns/op
MergeStoreBench.str4Utf16ArraySetConst            avgt    5    3098.749 ±   35.606  ns/op
MergeStoreBench.str4Utf16StringBuilder            avgt    5   21491.710 ± 2598.145  ns/op
MergeStoreBench.str4Utf16StringBuilderAppendChar  avgt    5   67748.629 ± 2224.953  ns/op
MergeStoreBench.str4Utf16StringBuilderUnsafePut   avgt    5    3840.268 ±    2.786  ns/op
MergeStoreBench.str4Utf16UnsafePut                avgt    5    2858.839 ±   46.434  ns/op
MergeStoreBench.str5ArraySetConst                 avgt    5    3769.990 ±    2.877  ns/op
MergeStoreBench.str5Arraycopy                     avgt    5   10604.229 ±   85.266  ns/op
MergeStoreBench.str5GetBytes                      avgt    5    6604.073 ±    4.599  ns/op
MergeStoreBench.str5GetChars                      avgt    5   15499.577 ±  166.819  ns/op
MergeStoreBench.str5StringBuilder                 avgt    5   22817.332 ± 1330.696  ns/op
MergeStoreBench.str5StringBuilderAppendChar       avgt    5   86993.698 ±  419.806  ns/op
MergeStoreBench.str5StringBuilderUnsafePut        avgt    5    3803.737 ±    0.974  ns/op
MergeStoreBench.str5UnsafePut                     avgt    5    3765.698 ±    1.774  ns/op
MergeStoreBench.str5Utf16ArrayCopy                avgt    5    5691.730 ±    4.200  ns/op
MergeStoreBench.str5Utf16ArraySetConst            avgt    5    4620.050 ±   73.237  ns/op
MergeStoreBench.str5Utf16StringBuilder            avgt    5   26974.200 ± 9799.822  ns/op
MergeStoreBench.str5Utf16StringBuilderAppendChar  avgt    5   84214.630 ± 1770.595  ns/op
MergeStoreBench.str5Utf16StringBuilderUnsafePut   avgt    5    3803.749 ±    2.164  ns/op
MergeStoreBench.str5Utf16UnsafePut                avgt    5    4463.146 ±   94.255  ns/op
MergeStoreBench.str7ArraySetConst                 avgt    5    5905.221 ±   17.324  ns/op
MergeStoreBench.str7Arraycopy                     avgt    5   14400.712 ±   68.866  ns/op
MergeStoreBench.str7GetBytes                      avgt    5   11693.448 ±   11.413  ns/op
MergeStoreBench.str7GetChars                      avgt    5   21262.620 ±  393.963  ns/op
MergeStoreBench.str7StringBuilder                 avgt    5   21559.944 ±   97.469  ns/op
MergeStoreBench.str7StringBuilderAppendChar       avgt    5  120774.017 ±  927.175  ns/op
MergeStoreBench.str7StringBuilderUnsafePut        avgt    5    5520.405 ±    5.431  ns/op
MergeStoreBench.str7UnsafePut                     avgt    5    5918.814 ±    8.237  ns/op
MergeStoreBench.str7Utf16ArrayCopy                avgt    5    6348.146 ±    2.766  ns/op
MergeStoreBench.str7Utf16ArraySetConst            avgt    5    4333.009 ±    1.980  ns/op
MergeStoreBench.str7Utf16StringBuilder            avgt    5   29406.714 ± 9703.134  ns/op
MergeStoreBench.str7Utf16StringBuilderAppendChar  avgt    5  117801.880 ±  811.216  ns/op
MergeStoreBench.str7Utf16StringBuilderUnsafePut   avgt    5    6684.164 ±   16.496  ns/op
MergeStoreBench.str7Utf16UnsafePut                avgt    5    6286.796 ±  316.658  ns/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants