Skip to content

Benchmark CI crashing due to CUDA error: misaligned address during autotuning #634

@yf225

Description

@yf225

Seems to happen to jsd and kl_div kernels.

Example jobs

============================================================
Kernel: jsd
============================================================

Running jsd benchmark with Helion implementation...


  0%|          | 0/6 [00:00<?, ?it/s]W0918 23:42:06.815000 4121 torch/_dynamo/utils.py:1915] ChromiumEventLogger: Start event not in stack, ignoring
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
[24s] Initial population: failed=35 min=0.3087 mid=1.6106 max=115.1351 best=Config(block_sizes=[1024], indexing='pointer', num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[])

  0%|          | 0/6 [00:41<?, ?it/s]
Caught exception, terminating early with partial results
Traceback (most recent call last):
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 979, in run
    y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
                                                  ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 964, in _reduce_benchmarks
    acc[bm_name] = self._do_bench(
                   ^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1358, in _do_bench
    metrics.latency = do_bench_wrapper(
                      ^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 413, in do_bench_wrapper
    raise e
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 403, in do_bench_wrapper
    times=bench_fn(
          ^^^^^^^^^
  File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 192, in _do_bench_profiler
    estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 250, in benchmark_gpu
    _callable()
  File "/__w/helion/helion/examples/jsd.py", line 314, in <lambda>
    return lambda: helion_jsd(log_q, log_p)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/examples/jsd.py", line 193, in forward
    loss, dX = jsd_forward(
               ^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 285, in __call__
    return self.bind(args)(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/runtime/kernel.py", line 602, in __call__
    self.autotune(args)
  File "/__w/helion/helion/helion/runtime/kernel.py", line 492, in autotune
    config = self.settings.autotuner_fn(self, args, **kwargs).autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_cache.py", line 165, in autotune
    config = self.autotuner.autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 229, in autotune
    best = self._autotune()
           ^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 98, in _autotune
    replaced = self.evolve_population()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 83, in evolve_population
    for i, candidate in self.iter_candidates():
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 76, in iter_candidates
    self.parallel_benchmark_flat(
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 345, in parallel_benchmark_flat
    to_check, configs, self.parallel_benchmark(configs), strict=True
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 200, in parallel_benchmark
    [
  File "/__w/helion/helion/helion/autotuner/base_search.py", line 172, in start_precompile_and_check_for_hangs
    fn(*self.args, _launcher=extract_launcher)
  File "/tmp/torchinductor_root/t6/ct6rmz4h5qsoloqq53dxme2ua5o247g4vgf7vkzcms6czgsfpph3.py", line 133, in jsd_forward
    loss = torch.zeros(_input.shape, dtype=torch.float32, device=_input.device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: misaligned address
Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Update:

Repro configs: fe48b3d#diff-52fb42eb93861175c8cad84e97d9a0aff002f6e5b735c57e9c4f6817760a1cd0R41 then run rm -rf /tmp/torchinductor_willfeng/ && HELION_PRINT_OUTPUT_CODE=1 CUDA_LAUNCH_BLOCKING=1 HELION_AUTOTUNE_RANDOM_SEED=4201 python benchmarks/run.py --op <kernel_name> --num-inputs 1 --metrics speedup --latency-measure-mode profiler --exit-on-exception on Helion main branch.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions