[Enhancement] Optimize Netty thread count and warmMappedFile OSR Problem

### Before Creating the Enhancement Request

- [x] I have confirmed that this should be classified as an enhancement rather than a bug/feature.


### Summary

In Netty, there are two EventExecutorChoosers, GenericEventExecutorChooser and PowerOfTwoEventExecutorChooser. When the number of threads is set to 3, GenericEventExecutorChooser will be used, which will cause performance to decrease as the number of calls increases，This is because the delay of the AND instruction is 1 nanosecond, while the delay of IDIV is 60-80 nanoseconds (as described in the Intel Performance Optimization Manual).

At the same time, in warmMapedFile, there may be a problem where the C2 compiler optimizes the for loop to CountedLoop during OSR, which makes it impossible to detect thread safety points and prevents VMthread from STW. Therefore, the for loop value is modified to long, and a strong conversion is performed during internal put to avoid this issue. Only by breaking the pre detection condition of CountedLoop: For requires triple detection, it can meet the requirements

### Motivation

1. Removing the IDIV instruction significantly improves Netty's speed in selecting the next thread
2. Remove the 'for' loop from warmAppedFile and set the number of loops to long to avoid forced conversion

The following code is a test on the x86 platform, and the test results show that as the number of cycles increases, the performance significantly decreases：

public class PowerOfTwoTest {
    public static Object[] powerOfTwoObjects = new Object[16];
    public static Object[] nonPowerOfTwoObjects = new Object[15];
    public static long powerOfTwoObjectsIdx;
    public static long nonPowerOfTwoObjectsIdx;
    public static long loopCount = 9999999999L;

    public static final int powerOfTwoObjectsMask = powerOfTwoObjects.length - 1;

    public static Object testPowerOfTwo() {
        return powerOfTwoObjects[(int) powerOfTwoObjectsIdx++ & powerOfTwoObjectsMask];
    }

    public static Object testNonPowerOfTwoObjects() {
        return nonPowerOfTwoObjects[(int) Math.abs(nonPowerOfTwoObjectsIdx++ % nonPowerOfTwoObjects.length)];
    }


    public static void main(String[] args) {
        MyUtils.Watch watch = MyUtils.newWatch();
        watch.start();
        for (long i = 0; i < loopCount; i++) {
            testPowerOfTwo();
        }
        watch.stop();
        watch.printUseTime();


        watch.start();
        for (long i = 0; i < loopCount; i++) {
            testNonPowerOfTwoObjects();
        }
        watch.stop();
        watch.printUseTime();
    }
}

result： 
4312ms
19811ms

For the OSR issue, I have extracted it into the following code, which can be directly tested and reproduced：

public class OSRDemo {
    static int counter_1;


    public static void main(String[] var0) throws Exception {
        System.out.println("main start");

        startBusinessThread();
        startProblemThread();

        Thread.sleep(1000L);

        System.out.println("start gc");
        System.gc(); // Trigger STW
        System.out.println("main end");
    }

    // Possible reason for not detecting thread safety points due to OSR
    public static void startProblemThread() {
        new Thread(() -> {
            // Here, you can separate int i and j, or set them to short to complete semantics
            for (int i = 0; i < 8888888; i++) {
                for (int j = 0; j < 8888888; j++) {
                    counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
                    counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
                }
            }

        }).start();
    }

    public static void startBusinessThread() {
        (new Thread(() -> {
            System.out.println("Business Thread - 1 start");

            while (true) {
                System.out.println("Execute Business 1");

                try {
                    Thread.sleep(1000L);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        })).start();

        (new Thread(() -> {
            System.out.println("Business Thread - 2 start");

            while (true) {
                System.out.println("Execute Business 2");

                try {
                    Thread.sleep(1000L);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        })).start();
    }
}




The following code is used by the C2 compiler to optimize OSR and determine whether to generate thread safety points. In the first judgment, variables can be extracted to avoid being optimized as CountedLoop and removing the code for detecting safety points

bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) { 
  if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) { // Extracting the initialization list of for variables will break the second condition
    return false;
  }
  // omit...
  if (cmp_op != Op_CmpI)  // Setting it as' long 'will break the conditions here, but it will also disrupt the original computational semantics and avoid strong conversions, resulting in performance loss
    return false;        
}

### Describe the Solution You'd Like

// Removing the IDIV instruction 
public class NettyServerConfig implements Cloneable {
        private int serverSelectorThreads = 4;
}



// Remove the 'for' loop from warmAppedFile
public void warmMappedFile(FlushDiskType type, int pages) {
        this.mappedByteBufferAccessCountSinceLastSwap++;

        long beginTime = System.currentTimeMillis();
        ByteBuffer byteBuffer = this.mappedByteBuffer.slice();
        long flush = 0;
        int i = 0, j = 0; // Breaking the triple check of C2 compiler CountedLoop optimization to avoid thread safety issues，reference JVM source code：PhaseIdealLoop::is_counted_loop
        for (; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) {
            byteBuffer.put(i, (byte) 0);
            // force flush when flush disk type is sync
            if (type == FlushDiskType.SYNC_FLUSH) {
                if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                    flush = i;
                    mappedByteBuffer.force();
                }
            }
        }
}

### Describe Alternatives You've Considered

From the source code, it can be seen that only when the comparison value is int, it will be optimized to CountedLoop. If the number of loops is within the range of short, the initialization list parameter of for can also be modified to short type：

    for (short i = 0, j = 0; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) {
            byteBuffer.put((int) i, (byte) 0);
            // force flush when flush disk type is sync
            if (type == FlushDiskType.SYNC_FLUSH) {
                if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                    flush = i;
                    mappedByteBuffer.force();
                }
            }
        }



bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) { 
  if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) { // Extracting the initialization list of for variables will break the second condition
    return false;
  }
  // omit...
  if (cmp_op != Op_CmpI)  // Setting it as' long 'will break the conditions here, but it will also disrupt the original computational semantics and avoid strong conversions, resulting in performance loss
    return false;        
}

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Optimize Netty thread count and warmMappedFile OSR Problem #9449

Before Creating the Enhancement Request

Summary

Motivation

Describe the Solution You'd Like

Describe Alternatives You've Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Optimize Netty thread count and warmMappedFile OSR Problem #9449

Description

Before Creating the Enhancement Request

Summary

Motivation

Describe the Solution You'd Like

Describe Alternatives You've Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions