Skip to content

Benchmark CI crashing due to RuntimeError: Triton Error [CUDA]: misaligned address after autotuning finds the best config #630

@yf225

Description

@yf225

Example job: https://github.com/pytorch/helion/actions/runs/17839047302/job/50723523136

 64%|██████▍   | 9/14 [29:20<19:36, 235.22s/it]W0918 21:09:35.299000 51876 torch/_dynamo/utils.py:1915] ChromiumEventLogger: Start event not in stack, ignoring

...

One can hardcode the best config and skip autotuning with:
    @helion.kernel(config=helion.Config(block_sizes=[1], indexing='block_ptr', num_stages=7, num_warps=16, pid_type='flat', range_flattens=[None], range_multi_buffers=[None], range_num_stages=[0], range_unroll_factors=[0], range_warp_specializes=[], reduction_loops=[32]))


 64%|██████▍   | 9/14 [33:15<18:28, 221.72s/it]
Caught exception, terminating early with partial results
Traceback (most recent call last):
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 979, in run
    y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
                                                  ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 964, in _reduce_benchmarks
    acc[bm_name] = self._do_bench(
                   ^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1358, in _do_bench
    metrics.latency = do_bench_wrapper(
                      ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 413, in do_bench_wrapper
    raise e
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 403, in do_bench_wrapper
    times=bench_fn(
          ^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 192, in _do_bench_profiler
    estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 250, in benchmark_gpu
    _callable()
  File "/__w/helion/helion/examples/sum.py", line 64, in compute_sum
    result = sum_kernel(x_2d)
             ^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 285, in __call__
    return self.bind(args)(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 608, in __call__
    return self._run(*args)
           ^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_root/62/c62kstawhc3f3u5h2njxdpek2fgehlo5thwlidup324nfw7zmbuj.py", line 31, in sum_kernel
    _launcher(_helion_sum_kernel, (1,), x, out, _REDUCTION_BLOCK_1, num_warps=16, num_stages=7)
  File "/__w/helion/helion/helion/runtime/__init__.py", line 63, in default_launcher
    return triton_kernel.run(
           ^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 699, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 712, in __call__
    self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
RuntimeError: Triton Error [CUDA]: misaligned address

It happens after autotuning is done (i.e. after One can hardcode the best config ... is printed). Likely the exception is leaking out somehow.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions