Skip to content

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure#171

Merged
szymonlopaciuk merged 1 commit intoxsuite:mainfrom
szymonlopaciuk:cuda-enable-min-fast-compile
Apr 28, 2026
Merged

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure#171
szymonlopaciuk merged 1 commit intoxsuite:mainfrom
szymonlopaciuk:cuda-enable-min-fast-compile

Conversation

@szymonlopaciuk
Copy link
Copy Markdown
Contributor

@szymonlopaciuk szymonlopaciuk commented Apr 27, 2026

Description

This addresses the very high observed memory usage required to build kernels on GPU.

I ran some tests with different configurations, including:

  • -Ofast-compile options
  • __noinline__ on the functions called from the mega-kernel track_line
  • -Xptxas -O N options

Below are the results:

Setup: LHC line, 1000 particles, 10 turns
GPU: Nvidia H100 NVL 47 GB, NVRTC 12.9, CuPy 14.0.1
Host: Intel Xeon Platinum 8468, 56 GB RAM

Config Build time Track time vs baseline build vs baseline track Notes
Baseline 1168 s 1.306 s Reference
__noinline__ only 978 s 1.372 s −16% +5% Not worth the complexity on its own
--Ofast-compile=min 8.6 s 1.543 s −99.3% +18% Best overall: fast build, near-baseline tracking speed
--Ofast-compile=min + noinline 7.0 s 1.628 s −99.4% +25% Noinline saves 1.5 s build but hurts tracking
--Ofast-compile=mid 8.7 s 1.659 s −99.3% +27% Good tradeoff, slightly worse tracking than min
--Ofast-compile=mid + noinline 7.2 s 1.758 s −99.4% +35% Same pattern as min + noinline
--Ofast-compile=max 5.4 s 23.240 s −99.5% +18× Fast front-end skips all ptxas optimisation, tragic in tracking
--Ofast-compile=max + noinline 5.5 s 23.002 s −99.5% +18× Not meaningfully different to above
--Ofast-compile=max + PTX O0 + no-expensive + noinline 5.5 s 22.970 s −99.5% +18× Additional flags irrelevant
-Xptxas -O0 (no fast front-end) OOM (consumes more memory than baseline??)

Adding more __noinline__ statements to other functions in the track_magnet path yielded more (up to 4x) compile time speedup, but at a significant cost of tracking time. (On the other hand, very minimal noinlining can provide a tangible benefit for OpenCL, see below).

It seems that whatever the difference between -Oft=0 and -Oft=min, the yielded optimisations are negligible for us at a tragic expense in tracking time (150x slower). (It seems setting -Oft disables cloning which we presumably require a lot of: min does not even disable "expensive-optimizations"...)

In the meantime, I did not manage to accomplish a similar gain with OpenCL:

  • On CUDA OpenCL there is a flag -cl-nv-opt-level, however it seems to be ignored by the compiler in the CUDA version under test. The general flag -cl-opt-disable is a nuclear option that produces unacceptably slow code, and is therefore a no-go. Perhaps this is not an interesting case in practice, who'd use ContextPyopencl on an Nvidia card in the wild?
  • On AMD OpenCL (Radeon VII used for testing) the problem does not seem nearly as bad as on CUDA: memory consumption is not noteworthy, compile time is suboptimal at 2 min, but still 10x less than 20 min, with tracking time ~3s in the same scenario as above.
    • I could not identify other tweaks to the compiler to speed it up similarly to -Oft.
    • __attribute__((noinline)) on the main per-element tracking functions improved the build time to 36 s, so in this context could be explored?
    • Tracking time discrepancy surely explained by Radeon VII vs H100.

Checklist

Mandatory:

  • I have added tests to cover my changes
  • All the tests are passing, including my new ones
  • I described my changes in this PR description

Optional:

  • The code I wrote follows good style practices (see PEP 8 and PEP 20).
  • I have updated the docs in relation to my changes, if applicable
  • I have tested also GPU contexts

@szymonlopaciuk
Copy link
Copy Markdown
Contributor Author

szymonlopaciuk commented Apr 28, 2026

I paste here for posterity the report of my other findings (analysis of ptxas verbose output, PTX code, and the disassembled CUBIN binary) compiled by ChatGPT:

NVRTC Optimization Investigation

Objective

Investigate the large discrepancy between compilation time and runtime performance when using NVRTC default optimization versus --Ofast-compile=min.

Observed:

  • default compile: ~20 minutes
  • --Ofast-compile=min: ~8 seconds
  • runtime slowdown (min): ~18%

1. PTX Size and Structural Metrics

PTX size

default.ptx: 32,321,462 bytes
ofc_min.ptx:  3,088,884 bytes
ratio: ~10.5×

PTX structure

Function count:
  default: 97
  ofc_min: 106

call instructions:
  default: 6432
  ofc_min: 730     (~8.8× fewer)

branch instructions:
  default: 53822
  ofc_min: 7286    (~7.4× fewer)

Interpretation

  • The number of functions is comparable.
  • The major difference lies in function body size and internal complexity.
  • Default optimization produces significantly more control flow and call sites.

Conclusion: the dominant effect is inlining and specialization within functions, not simple cloning of additional functions.

2. Function Structure Differences

Diffing function declarations shows:

  • Default:

    • Fewer standalone helpers
    • Many _with_transformations variants embedded
  • --Ofast-compile=min:

    • More explicit helper functions:

      track_magnet_drift_single_particle
      track_rf_kick_single_particle
      track_misalignment_*
      ...
      

Interpretation

  • ofc_min preserves a modular call graph
  • Default absorbs helper logic into callers

This directly explains the PTX growth: logic is duplicated across many contexts instead of reused.

3. Function Size Analysis

Largest functions (default)

track_magnet_body_single_particle              ~1.8 MB
ThickSliceCavity_track_local_particle*         ~1.5 MB
Cavity_track_local_particle*                   ~1.5 MB
track_line                                     ~1.1 MB

Largest functions (ofc_min)

track_line                                     ~0.43 MB
other functions mostly 40–130 KB

Kernel-level examples

Cavity_track_particles:
  default: 1,379,893 bytes
  ofc_min:    13,191 bytes

DriftSlice_track_particles:
  default: 2,698,330 bytes
  ofc_min:   175,451 bytes

track_line:
  default: 1,118,348 bytes
  ofc_min:   434,724 bytes

Interpretation

Default optimization performs large-scale duplication of execution paths inside kernels, not just minor specialization.

4. SASS (Machine Code) Comparison

SASS lines:
  default: 3,529,434
  ofc_min:   107,709
  ratio: ~32.7×

CALL:
  default: 70,753
  ofc_min:  1,587   (~44×)

BRA:
  default: 124,295
  ofc_min:  2,754   (~45×)

Interpretation

  • Code expansion propagates fully to machine code.
  • Default produces vastly more executable instructions.

This confirms that the compile-time cost corresponds to actual emitted code, not just analysis overhead.

5. Register and Stack Usage

Default (selected kernels)

track_line                    255 regs, 1616 stack
RBend_track_particles         255 regs, 1216 stack
Bend_track_particles          255 regs, 1184 stack
Quadrupole_track_particles    255 regs, 1104 stack
...

--Ofast-compile=min

track_line                    254 regs, 2048 stack
RBend_track_particles         254 regs, 1496 stack
Quadrupole_track_particles    254 regs, 1368 stack
...

Observations

  • Both modes reach the register limit (~255).
  • ofc_min often has higher stack usage.
  • Default sometimes has slightly better stack/register balance.

Interpretation

  • Default optimization does not reduce register usage significantly.
  • Some runtime benefit likely comes from better control-flow simplification and instruction scheduling, not reduced register pressure.

6. Compile-Time Scaling Behavior

From subset experiments:

compile time ≈ linear in number of elements
~1 minute per element type
~20 minutes for full kernel

Interpretation

  • The cost is not exponential cloning
  • Instead, each element contributes a roughly fixed optimization cost
  • Shared leaf helpers are repeatedly re-specialized in different contexts

7. Role of Leaf Helpers

Leaf functions such as:

track_magnet_particles
track_rf_particles

have:

  • many parameters
  • many optional code paths

These rely heavily on:

  • constant propagation
  • branch elimination
  • dead code elimination

Experiment

Manual __noinline__ on these helpers:

  • worse runtime than ofc_min
  • still non-trivial compile cost

Interpretation

  • These helpers are where optimization is actually valuable
  • Preventing their inlining removes critical simplifications

8. Synthesis

What default optimization does:

  • aggressive inlining of helpers
  • context-specific specialization
  • branch elimination
  • duplication of large code regions across element types

Result

  • massive PTX expansion (~10×)
  • massive SASS expansion (~30×)
  • large compile-time cost (~150×)
  • moderate runtime improvement (~18%)

Key insight

The optimizer is not wasting effort entirely; it is producing real performance gains. However, it achieves this through large-scale duplication of logic across similar contexts.

9. Implications

  • --Ofast-compile=min removes most duplication but loses ~18% performance.
  • Default optimization is too expensive to use indiscriminately in dynamic compilation scenarios.
  • Naïve suppression strategies (noinline) degrade runtime without providing much benefit for compilation.

@szymonlopaciuk szymonlopaciuk merged commit 35d14c9 into xsuite:main Apr 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant