Enable -Oft=min on CUDA to decrease compile-time mem and time pressure#171
Conversation
|
I paste here for posterity the report of my other findings (analysis of ptxas verbose output, PTX code, and the disassembled CUBIN binary) compiled by ChatGPT: NVRTC Optimization InvestigationObjectiveInvestigate the large discrepancy between compilation time and runtime performance when using NVRTC default optimization versus Observed:
1. PTX Size and Structural MetricsPTX sizePTX structureInterpretation
Conclusion: the dominant effect is inlining and specialization within functions, not simple cloning of additional functions. 2. Function Structure DifferencesDiffing function declarations shows:
Interpretation
This directly explains the PTX growth: logic is duplicated across many contexts instead of reused. 3. Function Size AnalysisLargest functions (default)Largest functions (
|
Description
This addresses the very high observed memory usage required to build kernels on GPU.
I ran some tests with different configurations, including:
-Ofast-compileoptions__noinline__on the functions called from the mega-kerneltrack_line-Xptxas -O NoptionsBelow are the results:
Setup: LHC line, 1000 particles, 10 turns
GPU: Nvidia H100 NVL 47 GB, NVRTC 12.9, CuPy 14.0.1
Host: Intel Xeon Platinum 8468, 56 GB RAM
__noinline__only--Ofast-compile=min--Ofast-compile=min+ noinline--Ofast-compile=midmin--Ofast-compile=mid+ noinlinemin+ noinline--Ofast-compile=max--Ofast-compile=max+ noinline--Ofast-compile=max+ PTX O0 + no-expensive + noinline-Xptxas -O0(no fast front-end)Adding more
__noinline__statements to other functions in thetrack_magnetpath yielded more (up to 4x) compile time speedup, but at a significant cost of tracking time. (On the other hand, very minimal noinlining can provide a tangible benefit for OpenCL, see below).It seems that whatever the difference between
-Oft=0and-Oft=min, the yielded optimisations are negligible for us at a tragic expense in tracking time (150x slower). (It seems setting-Oftdisables cloning which we presumably require a lot of:mindoes not even disable "expensive-optimizations"...)In the meantime, I did not manage to accomplish a similar gain with OpenCL:
-cl-nv-opt-level, however it seems to be ignored by the compiler in the CUDA version under test. The general flag-cl-opt-disableis a nuclear option that produces unacceptably slow code, and is therefore a no-go. Perhaps this is not an interesting case in practice, who'd useContextPyopenclon an Nvidia card in the wild?-Oft.__attribute__((noinline))on the main per-element tracking functions improved the build time to 36 s, so in this context could be explored?Checklist
Mandatory:
Optional: