jit: Implement runtime instruction profiling #189

qwe661234 · 2023-08-13T08:09:36Z

The existing JIT compilation process depends on heuristics that utilize fixed thresholds (#159) to determine when to transition from interpretation to JIT compilation when executing RISC-V instructions. This approach lacks flexibility and leads to inconsistent performance patterns. Consequently, there is a clear need for a more pragmatic method that involves gathering profiling data during interpretation. Furthermore, we require a defined strategy for making this transition based on the sampled data rather than relying on predetermined thresholds.

Let's consider the Java virtual machine (JVM), particularly HotSpot, which has a crucial objective: generating efficient machine code while minimizing runtime costs. To accomplish this, HotSpot employs a range of strategies, including tiered compilation, dynamic profiling, speculation, deoptimization, and various compiler optimizations, both architecture-specific and architecture-independent.

Typically, the execution of a method begins in the interpreter, which is the simplest and most cost-effective means available to HotSpot for executing code. During method execution, whether through interpretation or in compiled form, the dynamic profile of the method is collected through instrumentation. This profile is then used by several heuristics to make decisions, such as whether the method should be compiled, recompiled at a different optimization level, and which optimizations should be applied.

When an application starts, the JVM initially interprets all bytecode while gathering profiling information about it. The JIT compiler then leverages this collected profiling information to identify hotspots. Initially, the JIT compiler compiles frequently executed code sections with C1 to rapidly achieve native code performance. Later, as more profiling information becomes available, C2 comes into play. C2 recompiles the code with more aggressive and time-intensive optimizations to further enhance performance.

Another advantageous aspect of tiered compilation is the acquisition of more accurate profiling information. Prior to tiered compilation, the JVM collected profiling information only during interpretation. However, with tiered compilation enabled, the JVM also gathers profiling information on the code compiled with C1. As the compiled code delivers better performance, it allows the JVM to accumulate more profiling samples.

Reference:

jserv · 2023-12-12T14:39:44Z

Reference:

QEMU has a builtin profiler. See Translator internals.
The Mono log profiler
PyTorch Profiler
vmprof-python

qwe661234 · 2023-12-31T08:46:37Z

Currently, we launch tier1 JIT compiler(T1C) when the used frequency of a block is greater than 4096, but we still need a profiler to correctly detect hotspots.

After we implementing block-chaining, graph-like-IR, LFU block cache and tier1 JIT compiler, we have some profiling information when executing by interpreter mode:

Used frequency of block
The number of IR within a chained block collection
A Basic block has backward jump or not
The chained block collection has loop or not
The number of invoking times of the machine code generated by tier1 JIT compiler

We collect profiling information from the benchmarks:

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.

Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots:

Backward jump
Loop
Used frequency

Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

jserv · 2023-12-31T09:06:19Z

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.
[..]
Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

Are we ready to build a tier-2 compiler (T2C) that operates as a separate thread or child process, generating optimized code from profiling data? Alongside, we need to develop a stable, race-free compilation queue for T2C. This means if the main thread requests compilation of a specific block, but another covering block is already queued, the covering block should be prioritized instead. This approach minimizes the number of blocks compiled.

Each block entering the compilation queue for T2C will have an associated counter, initially set to 1. This counter increases each time the main thread flags the block for compilation. This system ensures prioritization of blocks in highest demand, rather than simply compiling newer blocks that might not be executed again. This method prioritizes necessity and efficiency in the compilation process, ensuring the most critical blocks are addressed first.

Thus, we have the opportunity to develop T2C, building upon our previous experiments and using Clang as the base. Later, we can transition to using llvm-c, similar to the approach taken by wasm-micro-runtime.

qwe661234 · 2024-01-01T08:14:56Z

From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information.
[..]
Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number.

Are we ready to build a tier-2 compiler (T2C) that operates as a separate thread or child process, generating optimized code from profiling data? Alongside, we need to develop a stable, race-free compilation queue for T2C. This means if the main thread requests compilation of a specific block, but another covering block is already queued, the covering block should be prioritized instead. This approach minimizes the number of blocks compiled.

Each block entering the compilation queue for T2C will have an associated counter, initially set to 1. This counter increases each time the main thread flags the block for compilation. This system ensures prioritization of blocks in highest demand, rather than simply compiling newer blocks that might not be executed again. This method prioritizes necessity and efficiency in the compilation process, ensuring the most critical blocks are addressed first.

Thus, we have the opportunity to develop T2C, building upon our previous experiments and using Clang as the base. Later, we can transition to using llvm-c, similar to the approach taken by wasm-micro-runtime.

Perhaps we can develop a runtime profiler and a tier-2 compiler (T2C) initially without a multithreading mechanism. Once the implementation is complete and the performance improvement is deemed acceptable, we can then proceed with integrating a multithreading mechanism.

Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots: 1. Backward jump 2. Loop 3. Used frequency Close: sysprog21#189

jserv assigned qwe661234 Nov 30, 2023

jserv changed the title ~~jit: Implement a runtime instruction profiling tool~~ jit: Implement runtime instruction profiling Nov 30, 2023

jserv mentioned this issue Nov 30, 2023

Implement fast tier-1 JIT compiler #283

Closed

qwe661234 mentioned this issue Jan 21, 2024

Add profiler #333

Merged

jserv closed this as completed in #333 Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jit: Implement runtime instruction profiling #189

jit: Implement runtime instruction profiling #189

qwe661234 commented Aug 13, 2023 •

edited by jserv

Loading

jserv commented Dec 12, 2023 •

edited

Loading

qwe661234 commented Dec 31, 2023

jserv commented Dec 31, 2023

qwe661234 commented Jan 1, 2024

jit: Implement runtime instruction profiling #189

jit: Implement runtime instruction profiling #189

Comments

qwe661234 commented Aug 13, 2023 • edited by jserv Loading

jserv commented Dec 12, 2023 • edited Loading

qwe661234 commented Dec 31, 2023

jserv commented Dec 31, 2023

qwe661234 commented Jan 1, 2024

qwe661234 commented Aug 13, 2023 •

edited by jserv

Loading

jserv commented Dec 12, 2023 •

edited

Loading