Description
Before Creating the Enhancement Request
- I have confirmed that this should be classified as an enhancement rather than a bug/feature.
Summary
In Netty, there are two EventExecutorChoosers, GenericEventExecutorChooser and PowerOfTwoEventExecutorChooser. When the number of threads is set to 3, GenericEventExecutorChooser will be used, which will cause performance to decrease as the number of calls increases,This is because the delay of the AND instruction is 1 nanosecond, while the delay of IDIV is 60-80 nanoseconds (as described in the Intel Performance Optimization Manual).
At the same time, in warmMapedFile, there may be a problem where the C2 compiler optimizes the for loop to CountedLoop during OSR, which makes it impossible to detect thread safety points and prevents VMthread from STW. Therefore, the for loop value is modified to long, and a strong conversion is performed during internal put to avoid this issue. Only by breaking the pre detection condition of CountedLoop: For requires triple detection, it can meet the requirements
Motivation
- Removing the IDIV instruction significantly improves Netty's speed in selecting the next thread
- Remove the 'for' loop from warmAppedFile and set the number of loops to long to avoid forced conversion
The following code is a test on the x86 platform, and the test results show that as the number of cycles increases, the performance significantly decreases:
public class PowerOfTwoTest {
public static Object[] powerOfTwoObjects = new Object[16];
public static Object[] nonPowerOfTwoObjects = new Object[15];
public static long powerOfTwoObjectsIdx;
public static long nonPowerOfTwoObjectsIdx;
public static long loopCount = 9999999999L;
public static final int powerOfTwoObjectsMask = powerOfTwoObjects.length - 1;
public static Object testPowerOfTwo() {
return powerOfTwoObjects[(int) powerOfTwoObjectsIdx++ & powerOfTwoObjectsMask];
}
public static Object testNonPowerOfTwoObjects() {
return nonPowerOfTwoObjects[(int) Math.abs(nonPowerOfTwoObjectsIdx++ % nonPowerOfTwoObjects.length)];
}
public static void main(String[] args) {
MyUtils.Watch watch = MyUtils.newWatch();
watch.start();
for (long i = 0; i < loopCount; i++) {
testPowerOfTwo();
}
watch.stop();
watch.printUseTime();
watch.start();
for (long i = 0; i < loopCount; i++) {
testNonPowerOfTwoObjects();
}
watch.stop();
watch.printUseTime();
}
}
result:
4312ms
19811ms
For the OSR issue, I have extracted it into the following code, which can be directly tested and reproduced:
public class OSRDemo {
static int counter_1;
public static void main(String[] var0) throws Exception {
System.out.println("main start");
startBusinessThread();
startProblemThread();
Thread.sleep(1000L);
System.out.println("start gc");
System.gc(); // Trigger STW
System.out.println("main end");
}
// Possible reason for not detecting thread safety points due to OSR
public static void startProblemThread() {
new Thread(() -> {
// Here, you can separate int i and j, or set them to short to complete semantics
for (int i = 0; i < 8888888; i++) {
for (int j = 0; j < 8888888; j++) {
counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
}
}
}).start();
}
public static void startBusinessThread() {
(new Thread(() -> {
System.out.println("Business Thread - 1 start");
while (true) {
System.out.println("Execute Business 1");
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
})).start();
(new Thread(() -> {
System.out.println("Business Thread - 2 start");
while (true) {
System.out.println("Execute Business 2");
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
})).start();
}
}
The following code is used by the C2 compiler to optimize OSR and determine whether to generate thread safety points. In the first judgment, variables can be extracted to avoid being optimized as CountedLoop and removing the code for detecting safety points
bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) {
if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) { // Extracting the initialization list of for variables will break the second condition
return false;
}
// omit...
if (cmp_op != Op_CmpI) // Setting it as' long 'will break the conditions here, but it will also disrupt the original computational semantics and avoid strong conversions, resulting in performance loss
return false;
}
Describe the Solution You'd Like
// Removing the IDIV instruction
public class NettyServerConfig implements Cloneable {
private int serverSelectorThreads = 4;
}
// Remove the 'for' loop from warmAppedFile
public void warmMappedFile(FlushDiskType type, int pages) {
this.mappedByteBufferAccessCountSinceLastSwap++;
long beginTime = System.currentTimeMillis();
ByteBuffer byteBuffer = this.mappedByteBuffer.slice();
long flush = 0;
int i = 0, j = 0; // Breaking the triple check of C2 compiler CountedLoop optimization to avoid thread safety issues,reference JVM source code:PhaseIdealLoop::is_counted_loop
for (; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) {
byteBuffer.put(i, (byte) 0);
// force flush when flush disk type is sync
if (type == FlushDiskType.SYNC_FLUSH) {
if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
flush = i;
mappedByteBuffer.force();
}
}
}
}
Describe Alternatives You've Considered
From the source code, it can be seen that only when the comparison value is int, it will be optimized to CountedLoop. If the number of loops is within the range of short, the initialization list parameter of for can also be modified to short type:
for (short i = 0, j = 0; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) {
byteBuffer.put((int) i, (byte) 0);
// force flush when flush disk type is sync
if (type == FlushDiskType.SYNC_FLUSH) {
if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
flush = i;
mappedByteBuffer.force();
}
}
}
bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) {
if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) { // Extracting the initialization list of for variables will break the second condition
return false;
}
// omit...
if (cmp_op != Op_CmpI) // Setting it as' long 'will break the conditions here, but it will also disrupt the original computational semantics and avoid strong conversions, resulting in performance loss
return false;
}
Additional Context
No response