Enable -Oft=min on CUDA to decrease compile-time mem and time pressure by szymonlopaciuk · Pull Request #171 · xsuite/xobjects

szymonlopaciuk · 2026-04-27T15:08:18Z

Description

This addresses the very high observed memory usage required to build kernels on GPU.

I ran some tests with different configurations, including:

-Ofast-compile options
__noinline__ on the functions called from the mega-kernel track_line
-Xptxas -O N options

Below are the results:

Setup: LHC line, 1000 particles, 10 turns
GPU: Nvidia H100 NVL 47 GB, NVRTC 12.9, CuPy 14.0.1
Host: Intel Xeon Platinum 8468, 56 GB RAM

Config	Build time	Track time	vs baseline build	vs baseline track	Notes
Baseline	1168 s	1.306 s	—	—	Reference
`__noinline__` only	978 s	1.372 s	−16%	+5%	Not worth the complexity on its own
`--Ofast-compile=min`	8.6 s	1.543 s	−99.3%	+18%	Best overall: fast build, near-baseline tracking speed
`--Ofast-compile=min` + noinline	7.0 s	1.628 s	−99.4%	+25%	Noinline saves 1.5 s build but hurts tracking
`--Ofast-compile=mid`	8.7 s	1.659 s	−99.3%	+27%	Good tradeoff, slightly worse tracking than `min`
`--Ofast-compile=mid` + noinline	7.2 s	1.758 s	−99.4%	+35%	Same pattern as `min` + noinline
`--Ofast-compile=max`	5.4 s	23.240 s	−99.5%	+18×	Fast front-end skips all ptxas optimisation, tragic in tracking
`--Ofast-compile=max` + noinline	5.5 s	23.002 s	−99.5%	+18×	Not meaningfully different to above
`--Ofast-compile=max` + PTX O0 + no-expensive + noinline	5.5 s	22.970 s	−99.5%	+18×	Additional flags irrelevant
`-Xptxas -O0` (no fast front-end)	—	—	—	—	OOM (consumes more memory than baseline??)

Adding more __noinline__ statements to other functions in the track_magnet path yielded more (up to 4x) compile time speedup, but at a significant cost of tracking time. (On the other hand, very minimal noinlining can provide a tangible benefit for OpenCL, see below).

It seems that whatever the difference between -Oft=0 and -Oft=min, the yielded optimisations are negligible for us at a tragic expense in tracking time (150x slower). (It seems setting -Oft disables cloning which we presumably require a lot of: min does not even disable "expensive-optimizations"...)

In the meantime, I did not manage to accomplish a similar gain with OpenCL:

On CUDA OpenCL there is a flag -cl-nv-opt-level, however it seems to be ignored by the compiler in the CUDA version under test. The general flag -cl-opt-disable is a nuclear option that produces unacceptably slow code, and is therefore a no-go. Perhaps this is not an interesting case in practice, who'd use ContextPyopencl on an Nvidia card in the wild?
On AMD OpenCL (Radeon VII used for testing) the problem does not seem nearly as bad as on CUDA: memory consumption is not noteworthy, compile time is suboptimal at 2 min, but still 10x less than 20 min, with tracking time ~3s in the same scenario as above.
- I could not identify other tweaks to the compiler to speed it up similarly to -Oft.
- __attribute__((noinline)) on the main per-element tracking functions improved the build time to 36 s, so in this context could be explored?
- Tracking time discrepancy surely explained by Radeon VII vs H100.

Checklist

Mandatory:

I have added tests to cover my changes
All the tests are passing, including my new ones
I described my changes in this PR description

Optional:

The code I wrote follows good style practices (see PEP 8 and PEP 20).
I have updated the docs in relation to my changes, if applicable
I have tested also GPU contexts

szymonlopaciuk · 2026-04-28T14:18:51Z

I paste here for posterity the report of my other findings (analysis of ptxas verbose output, PTX code, and the disassembled CUBIN binary) compiled by ChatGPT:

NVRTC Optimization Investigation

Objective

Investigate the large discrepancy between compilation time and runtime performance when using NVRTC default optimization versus --Ofast-compile=min.

Observed:

default compile: ~20 minutes
--Ofast-compile=min: ~8 seconds
runtime slowdown (min): ~18%

1. PTX Size and Structural Metrics

PTX size

default.ptx: 32,321,462 bytes
ofc_min.ptx:  3,088,884 bytes
ratio: ~10.5×

PTX structure

Function count:
  default: 97
  ofc_min: 106

call instructions:
  default: 6432
  ofc_min: 730     (~8.8× fewer)

branch instructions:
  default: 53822
  ofc_min: 7286    (~7.4× fewer)

Interpretation

The number of functions is comparable.
The major difference lies in function body size and internal complexity.
Default optimization produces significantly more control flow and call sites.

Conclusion: the dominant effect is inlining and specialization within functions, not simple cloning of additional functions.

2. Function Structure Differences

Diffing function declarations shows:

Default:
- Fewer standalone helpers
- Many _with_transformations variants embedded

--Ofast-compile=min:

More explicit helper functions:

track_magnet_drift_single_particle
track_rf_kick_single_particle
track_misalignment_*
...

Interpretation

ofc_min preserves a modular call graph
Default absorbs helper logic into callers

This directly explains the PTX growth: logic is duplicated across many contexts instead of reused.

3. Function Size Analysis

Largest functions (default)

track_magnet_body_single_particle              ~1.8 MB
ThickSliceCavity_track_local_particle*         ~1.5 MB
Cavity_track_local_particle*                   ~1.5 MB
track_line                                     ~1.1 MB

Largest functions (`ofc_min`)

track_line                                     ~0.43 MB
other functions mostly 40–130 KB

Kernel-level examples

Cavity_track_particles:
  default: 1,379,893 bytes
  ofc_min:    13,191 bytes

DriftSlice_track_particles:
  default: 2,698,330 bytes
  ofc_min:   175,451 bytes

track_line:
  default: 1,118,348 bytes
  ofc_min:   434,724 bytes

Interpretation

Default optimization performs large-scale duplication of execution paths inside kernels, not just minor specialization.

4. SASS (Machine Code) Comparison

SASS lines:
  default: 3,529,434
  ofc_min:   107,709
  ratio: ~32.7×

CALL:
  default: 70,753
  ofc_min:  1,587   (~44×)

BRA:
  default: 124,295
  ofc_min:  2,754   (~45×)

Interpretation

Code expansion propagates fully to machine code.
Default produces vastly more executable instructions.

This confirms that the compile-time cost corresponds to actual emitted code, not just analysis overhead.

5. Register and Stack Usage

Default (selected kernels)

track_line                    255 regs, 1616 stack
RBend_track_particles         255 regs, 1216 stack
Bend_track_particles          255 regs, 1184 stack
Quadrupole_track_particles    255 regs, 1104 stack
...

`--Ofast-compile=min`

track_line                    254 regs, 2048 stack
RBend_track_particles         254 regs, 1496 stack
Quadrupole_track_particles    254 regs, 1368 stack
...

Observations

Both modes reach the register limit (~255).
ofc_min often has higher stack usage.
Default sometimes has slightly better stack/register balance.

Interpretation

Default optimization does not reduce register usage significantly.
Some runtime benefit likely comes from better control-flow simplification and instruction scheduling, not reduced register pressure.

6. Compile-Time Scaling Behavior

From subset experiments:

compile time ≈ linear in number of elements
~1 minute per element type
~20 minutes for full kernel

Interpretation

The cost is not exponential cloning
Instead, each element contributes a roughly fixed optimization cost
Shared leaf helpers are repeatedly re-specialized in different contexts

7. Role of Leaf Helpers

Leaf functions such as:

track_magnet_particles
track_rf_particles

have:

many parameters
many optional code paths

These rely heavily on:

constant propagation
branch elimination
dead code elimination

Experiment

Manual __noinline__ on these helpers:

worse runtime than ofc_min
still non-trivial compile cost

Interpretation

These helpers are where optimization is actually valuable
Preventing their inlining removes critical simplifications

8. Synthesis

What default optimization does:

aggressive inlining of helpers
context-specific specialization
branch elimination
duplication of large code regions across element types

Result

massive PTX expansion (~10×)
massive SASS expansion (~30×)
large compile-time cost (~150×)
moderate runtime improvement (~18%)

Key insight

The optimizer is not wasting effort entirely; it is producing real performance gains. However, it achieves this through large-scale duplication of logic across similar contexts.

9. Implications

--Ofast-compile=min removes most duplication but loses ~18% performance.
Default optimization is too expensive to use indiscriminately in dynamic compilation scenarios.
Naïve suppression strategies (noinline) degrade runtime without providing much benefit for compilation.

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure

9414f93

szymonlopaciuk mentioned this pull request Apr 28, 2026

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure #172

Merged

szymonlopaciuk merged commit 35d14c9 into xsuite:main Apr 28, 2026
3 checks passed

szymonlopaciuk mentioned this pull request Apr 29, 2026

CUDA: enable -Ofc only when supported by the compiler #173

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure#171

Enable -Oft=min on CUDA to decrease compile-time mem and time pressure#171
szymonlopaciuk merged 1 commit intoxsuite:mainfrom
szymonlopaciuk:cuda-enable-min-fast-compile

szymonlopaciuk commented Apr 27, 2026 •

edited

Loading

Uh oh!

szymonlopaciuk commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szymonlopaciuk commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

szymonlopaciuk commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NVRTC Optimization Investigation

Objective

1. PTX Size and Structural Metrics

PTX size

PTX structure

Interpretation

2. Function Structure Differences

Interpretation

3. Function Size Analysis

Largest functions (default)

Largest functions (ofc_min)

Kernel-level examples

Interpretation

4. SASS (Machine Code) Comparison

Interpretation

5. Register and Stack Usage

Default (selected kernels)

--Ofast-compile=min

Observations

Interpretation

6. Compile-Time Scaling Behavior

Interpretation

7. Role of Leaf Helpers

Experiment

Interpretation

8. Synthesis

Result

Key insight

9. Implications

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szymonlopaciuk commented Apr 27, 2026 •

edited

Loading

szymonlopaciuk commented Apr 28, 2026 •

edited

Loading

Largest functions (`ofc_min`)

`--Ofast-compile=min`