Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jit: Implement runtime instruction profiling #189

Closed
qwe661234 opened this issue Aug 13, 2023 · 4 comments · Fixed by #333
Closed

jit: Implement runtime instruction profiling #189

qwe661234 opened this issue Aug 13, 2023 · 4 comments · Fixed by #333
Assignees

Comments

@qwe661234
Copy link
Collaborator

qwe661234 commented Aug 13, 2023

The existing JIT compilation process depends on heuristics that utilize fixed thresholds (#159) to determine when to transition from interpretation to JIT compilation when executing RISC-V instructions. This approach lacks flexibility and leads to inconsistent performance patterns. Consequently, there is a clear need for a more pragmatic method that involves gathering profiling data during interpretation. Furthermore, we require a defined strategy for making this transition based on the sampled data rather than relying on predetermined thresholds.

Let's consider the Java virtual machine (JVM), particularly HotSpot, which has a crucial objective: generating efficient machine code while minimizing runtime costs. To accomplish this, HotSpot employs a range of strategies, including tiered compilation, dynamic profiling, speculation, deoptimization, and various compiler optimizations, both architecture-specific and architecture-independent.

Typically, the execution of a method begins in the interpreter, which is the simplest and most cost-effective means available to HotSpot for executing code. During method execution, whether through interpretation or in compiled form, the dynamic profile of the method is collected through instrumentation. This profile is then used by several heuristics to make decisions, such as whether the method should be compiled, recompiled at a different optimization level, and which optimizations should be applied.

When an application starts, the JVM initially interprets all bytecode while gathering profiling information about it. The JIT compiler then leverages this collected profiling information to identify hotspots. Initially, the JIT compiler compiles frequently executed code sections with C1 to rapidly achieve native code performance. Later, as more profiling information becomes available, C2 comes into play. C2 recompiles the code with more aggressive and time-intensive optimizations to further enhance performance.
tiered-compilation

Another advantageous aspect of tiered compilation is the acquisition of more accurate profiling information. Prior to tiered compilation, the JVM collected profiling information only during interpretation. However, with tiered compilation enabled, the JVM also gathers profiling information on the code compiled with C1. As the compiled code delivers better performance, it allows the JVM to accumulate more profiling samples.

Reference:

@jserv jserv changed the title jit: Implement a runtime instruction profiling tool jit: Implement runtime instruction profiling Nov 30, 2023
@jserv
Copy link
Contributor

jserv commented Dec 12, 2023

Reference:

@qwe661234
Copy link
Collaborator Author

Currently, we launch tier1 JIT compiler(T1C) when the used frequency of a block is greater than 4096, but we still need a profiler to correctly detect hotspots.

After we implementing block-chaining, graph-like-IR, LFU block cache and tier1 JIT compiler, we have some profiling information when executing by interpreter mode:

  1. Used frequency of block
  2. The number of IR within a chained block collection
  3. A Basic block has backward jump or not
  4. The chained block collection has loop or not
  5. The number of invoking times of the machine code generated by tier1 JIT compiler

We collect profiling information from the benchmarks:

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.

Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots:

  1. Backward jump
  2. Loop
  3. Used frequency

Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

@jserv
Copy link
Contributor

jserv commented Dec 31, 2023

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.
[..]
Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

Are we ready to build a tier-2 compiler (T2C) that operates as a separate thread or child process, generating optimized code from profiling data? Alongside, we need to develop a stable, race-free compilation queue for T2C. This means if the main thread requests compilation of a specific block, but another covering block is already queued, the covering block should be prioritized instead. This approach minimizes the number of blocks compiled.

Each block entering the compilation queue for T2C will have an associated counter, initially set to 1. This counter increases each time the main thread flags the block for compilation. This system ensures prioritization of blocks in highest demand, rather than simply compiling newer blocks that might not be executed again. This method prioritizes necessity and efficiency in the compilation process, ensuring the most critical blocks are addressed first.

Thus, we have the opportunity to develop T2C, building upon our previous experiments and using Clang as the base. Later, we can transition to using llvm-c, similar to the approach taken by wasm-micro-runtime.

@qwe661234
Copy link
Collaborator Author

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.
[..]
Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

Are we ready to build a tier-2 compiler (T2C) that operates as a separate thread or child process, generating optimized code from profiling data? Alongside, we need to develop a stable, race-free compilation queue for T2C. This means if the main thread requests compilation of a specific block, but another covering block is already queued, the covering block should be prioritized instead. This approach minimizes the number of blocks compiled.

Each block entering the compilation queue for T2C will have an associated counter, initially set to 1. This counter increases each time the main thread flags the block for compilation. This system ensures prioritization of blocks in highest demand, rather than simply compiling newer blocks that might not be executed again. This method prioritizes necessity and efficiency in the compilation process, ensuring the most critical blocks are addressed first.

Thus, we have the opportunity to develop T2C, building upon our previous experiments and using Clang as the base. Later, we can transition to using llvm-c, similar to the approach taken by wasm-micro-runtime.

Perhaps we can develop a runtime profiler and a tier-2 compiler (T2C) initially without a multithreading mechanism. Once the implementation is complete and the performance improvement is deemed acceptable, we can then proceed with integrating a multithreading mechanism.

qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Jan 21, 2024
Based on our observation, a high percentage of true hotspots involve
loops or backward jumps, but the number of IR is unstable within these
true hotspots.

Therefore, we believe our profiler can use three indices to detect
hotspots:

1. Backward jump
2. Loop
3. Used frequency

Close: sysprog21#189
@qwe661234 qwe661234 mentioned this issue Jan 21, 2024
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Jan 27, 2024
Based on our observation, a high percentage of true hotspots involve
loops or backward jumps, but the number of IR is unstable within these
true hotspots.

Therefore, we believe our profiler can use three indices to detect
hotspots:

1. Backward jump
2. Loop
3. Used frequency

Close: sysprog21#189
HenryChaing pushed a commit to HenryChaing/rv32emu that referenced this issue Feb 5, 2024
Based on our observation, a high percentage of true hotspots involve
loops or backward jumps, but the number of IR is unstable within these
true hotspots.

Therefore, we believe our profiler can use three indices to detect
hotspots:

1. Backward jump
2. Loop
3. Used frequency

Close: sysprog21#189
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants