Benchmark CI crashing due to `CUDA error: misaligned address` during autotuning

Seems to happen to `jsd` and `kl_div` kernels.

Example jobs
- https://github.com/pytorch/helion/actions/runs/17843976215/job/50739578258
- https://github.com/pytorch/helion/actions/runs/17843976215/job/50739578257
```
============================================================
Kernel: jsd
============================================================

Running jsd benchmark with Helion implementation...


  0%|          | 0/6 [00:00<?, ?it/s]W0918 23:42:06.815000 4121 torch/_dynamo/utils.py:1915] ChromiumEventLogger: Start event not in stack, ignoring
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
[24s] Initial population: failed=35 min=0.3087 mid=1.6106 max=115.1351 best=Config(block_sizes=[1024], indexing='pointer', num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[])

  0%|          | 0/6 [00:41<?, ?it/s]
Caught exception, terminating early with partial results
Traceback (most recent call last):
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 979, in run
    y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
                                                  ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 964, in _reduce_benchmarks
    acc[bm_name] = self._do_bench(
                   ^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1358, in _do_bench
    metrics.latency = do_bench_wrapper(
                      ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 413, in do_bench_wrapper
    raise e
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 403, in do_bench_wrapper
    times=bench_fn(
          ^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 192, in _do_bench_profiler
    estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 250, in benchmark_gpu
    _callable()
  File "/__w/helion/helion/examples/jsd.py", line 314, in <lambda>
    return lambda: helion_jsd(log_q, log_p)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/examples/jsd.py", line 193, in forward
    loss, dX = jsd_forward(
               ^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 285, in __call__
    return self.bind(args)(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 602, in __call__
    self.autotune(args)
  File "/__w/helion/helion/helion/runtime/kernel.py", line 492, in autotune
    config = self.settings.autotuner_fn(self, args, **kwargs).autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_cache.py", line 165, in autotune
    config = self.autotuner.autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 229, in autotune
    best = self._autotune()
           ^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 98, in _autotune
    replaced = self.evolve_population()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 83, in evolve_population
    for i, candidate in self.iter_candidates():
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 76, in iter_candidates
    self.parallel_benchmark_flat(
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 345, in parallel_benchmark_flat
    to_check, configs, self.parallel_benchmark(configs), strict=True
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 200, in parallel_benchmark
    [
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 172, in start_precompile_and_check_for_hangs
    fn(*self.args, _launcher=extract_launcher)
  File "/tmp/torchinductor_root/t6/ct6rmz4h5qsoloqq53dxme2ua5o247g4vgf7vkzcms6czgsfpph3.py", line 133, in jsd_forward
    loss = torch.zeros(_input.shape, dtype=torch.float32, device=_input.device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: misaligned address
Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Update:
- `jsd` repro: https://gist.github.com/yf225/69efcbfdf091539a5997a4218e7630b1 `CUDA error: misaligned address`. Minimal Triton repro: https://gist.github.com/yf225/4ea83e61777b85227ad4937c466b591c. Potential fix: https://github.com/pytorch/helion/commit/5ae337b1c82878978d7cb3cfba8761daeab13a69.
- `kl_div` repro: https://gist.github.com/yf225/3d4373b2afc0be08ac42afbb37f8597d `CUDA error: unspecified launch failure`. The error is likely due to 1x128 TMA shape being unsupported by CUDA. Potential fix: https://github.com/pytorch/helion/commit/d2007aec7e969149c5dc78c774241b023e82f918.

Repro configs: https://github.com/pytorch/helion/commit/fe48b3d2a8d3a9ca77baa03b6d87c54da98979b4#diff-52fb42eb93861175c8cad84e97d9a0aff002f6e5b735c57e9c4f6817760a1cd0R41 then run `rm -rf /tmp/torchinductor_willfeng/ && HELION_PRINT_OUTPUT_CODE=1 CUDA_LAUNCH_BLOCKING=1 HELION_AUTOTUNE_RANDOM_SEED=4201 python benchmarks/run.py --op <kernel_name> --num-inputs 1 --metrics speedup --latency-measure-mode profiler --exit-on-exception` on Helion main branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark CI crashing due to `CUDA error: misaligned address` during autotuning #634

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark CI crashing due to CUDA error: misaligned address during autotuning #634

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Benchmark CI crashing due to `CUDA error: misaligned address` during autotuning #634