-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jit: Implement runtime instruction profiling #189
Comments
Reference:
|
Currently, we launch tier1 JIT compiler(T1C) when the used frequency of a block is greater than 4096, but we still need a profiler to correctly detect hotspots. After we implementing block-chaining, graph-like-IR, LFU block cache and tier1 JIT compiler, we have some profiling information when executing by interpreter mode:
We collect profiling information from the benchmarks: From the profiling information, we observe that certain machine code is invoked frequently, indicating genuine hotspots. It is advisable to offload these hotspots to T2C for the generation of highly optimized machine code. To pinpoint the true hotspots, we select machine code with an invoked times exceeding 4096 as the target for extracting useful profiling information. Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots:
Nevertheless, the number of IR remains useful information for T2C. We can launch the T2C process only when the number of IR exceeds a certain number. |
Are we ready to build a tier-2 compiler (T2C) that operates as a separate thread or child process, generating optimized code from profiling data? Alongside, we need to develop a stable, race-free compilation queue for T2C. This means if the main thread requests compilation of a specific block, but another covering block is already queued, the covering block should be prioritized instead. This approach minimizes the number of blocks compiled. Each block entering the compilation queue for T2C will have an associated counter, initially set to 1. This counter increases each time the main thread flags the block for compilation. This system ensures prioritization of blocks in highest demand, rather than simply compiling newer blocks that might not be executed again. This method prioritizes necessity and efficiency in the compilation process, ensuring the most critical blocks are addressed first. Thus, we have the opportunity to develop T2C, building upon our previous experiments and using Clang as the base. Later, we can transition to using llvm-c, similar to the approach taken by wasm-micro-runtime. |
Perhaps we can develop a runtime profiler and a tier-2 compiler (T2C) initially without a multithreading mechanism. Once the implementation is complete and the performance improvement is deemed acceptable, we can then proceed with integrating a multithreading mechanism. |
Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots: 1. Backward jump 2. Loop 3. Used frequency Close: sysprog21#189
Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots: 1. Backward jump 2. Loop 3. Used frequency Close: sysprog21#189
Based on our observation, a high percentage of true hotspots involve loops or backward jumps, but the number of IR is unstable within these true hotspots. Therefore, we believe our profiler can use three indices to detect hotspots: 1. Backward jump 2. Loop 3. Used frequency Close: sysprog21#189
The existing JIT compilation process depends on heuristics that utilize fixed thresholds (#159) to determine when to transition from interpretation to JIT compilation when executing RISC-V instructions. This approach lacks flexibility and leads to inconsistent performance patterns. Consequently, there is a clear need for a more pragmatic method that involves gathering profiling data during interpretation. Furthermore, we require a defined strategy for making this transition based on the sampled data rather than relying on predetermined thresholds.
Let's consider the Java virtual machine (JVM), particularly HotSpot, which has a crucial objective: generating efficient machine code while minimizing runtime costs. To accomplish this, HotSpot employs a range of strategies, including tiered compilation, dynamic profiling, speculation, deoptimization, and various compiler optimizations, both architecture-specific and architecture-independent.
Typically, the execution of a method begins in the interpreter, which is the simplest and most cost-effective means available to HotSpot for executing code. During method execution, whether through interpretation or in compiled form, the dynamic profile of the method is collected through instrumentation. This profile is then used by several heuristics to make decisions, such as whether the method should be compiled, recompiled at a different optimization level, and which optimizations should be applied.
When an application starts, the JVM initially interprets all bytecode while gathering profiling information about it. The JIT compiler then leverages this collected profiling information to identify hotspots. Initially, the JIT compiler compiles frequently executed code sections with C1 to rapidly achieve native code performance. Later, as more profiling information becomes available, C2 comes into play. C2 recompiles the code with more aggressive and time-intensive optimizations to further enhance performance.
Another advantageous aspect of tiered compilation is the acquisition of more accurate profiling information. Prior to tiered compilation, the JVM collected profiling information only during interpretation. However, with tiered compilation enabled, the JVM also gathers profiling information on the code compiled with C1. As the compiled code delivers better performance, it allows the JVM to accumulate more profiling samples.
Reference:
The text was updated successfully, but these errors were encountered: