-
Notifications
You must be signed in to change notification settings - Fork 59
Closed
Description
Seems to happen to jsd and kl_div kernels.
Example jobs
- https://github.com/pytorch/helion/actions/runs/17843976215/job/50739578258
- https://github.com/pytorch/helion/actions/runs/17843976215/job/50739578257
============================================================
Kernel: jsd
============================================================
Running jsd benchmark with Helion implementation...
0%| | 0/6 [00:00<?, ?it/s]W0918 23:42:06.815000 4121 torch/_dynamo/utils.py:1915] ChromiumEventLogger: Start event not in stack, ignoring
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
[24s] Initial population: failed=35 min=0.3087 mid=1.6106 max=115.1351 best=Config(block_sizes=[1024], indexing='pointer', num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[])
0%| | 0/6 [00:41<?, ?it/s]
Caught exception, terminating early with partial results
Traceback (most recent call last):
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 979, in run
y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 964, in _reduce_benchmarks
acc[bm_name] = self._do_bench(
^^^^^^^^^^^^^^^
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1358, in _do_bench
metrics.latency = do_bench_wrapper(
^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 413, in do_bench_wrapper
raise e
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 403, in do_bench_wrapper
times=bench_fn(
^^^^^^^^^
File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/components/do_bench/run.py", line 192, in _do_bench_profiler
estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/benchmarking.py", line 250, in benchmark_gpu
_callable()
File "/__w/helion/helion/examples/jsd.py", line 314, in <lambda>
return lambda: helion_jsd(log_q, log_p)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/examples/jsd.py", line 193, in forward
loss, dX = jsd_forward(
^^^^^^^^^^^^
File "/__w/helion/helion/helion/runtime/kernel.py", line 285, in __call__
return self.bind(args)(*args)
^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/runtime/kernel.py", line 602, in __call__
self.autotune(args)
File "/__w/helion/helion/helion/runtime/kernel.py", line 492, in autotune
config = self.settings.autotuner_fn(self, args, **kwargs).autotune()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/base_cache.py", line 165, in autotune
config = self.autotuner.autotune()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/base_search.py", line 229, in autotune
best = self._autotune()
^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 98, in _autotune
replaced = self.evolve_population()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 83, in evolve_population
for i, candidate in self.iter_candidates():
^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/differential_evolution.py", line 76, in iter_candidates
self.parallel_benchmark_flat(
File "/__w/helion/helion/helion/autotuner/base_search.py", line 345, in parallel_benchmark_flat
to_check, configs, self.parallel_benchmark(configs), strict=True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/__w/helion/helion/helion/autotuner/base_search.py", line 200, in parallel_benchmark
[
File "/__w/helion/helion/helion/autotuner/base_search.py", line 172, in start_precompile_and_check_for_hangs
fn(*self.args, _launcher=extract_launcher)
File "/tmp/torchinductor_root/t6/ct6rmz4h5qsoloqq53dxme2ua5o247g4vgf7vkzcms6czgsfpph3.py", line 133, in jsd_forward
loss = torch.zeros(_input.shape, dtype=torch.float32, device=_input.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: misaligned address
Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Update:
jsdrepro: https://gist.github.com/yf225/69efcbfdf091539a5997a4218e7630b1CUDA error: misaligned address. Minimal Triton repro: https://gist.github.com/yf225/4ea83e61777b85227ad4937c466b591c. Potential fix: 5ae337b.kl_divrepro: https://gist.github.com/yf225/3d4373b2afc0be08ac42afbb37f8597dCUDA error: unspecified launch failure. The error is likely due to 1x128 TMA shape being unsupported by CUDA. Potential fix: d2007ae.
Repro configs: fe48b3d#diff-52fb42eb93861175c8cad84e97d9a0aff002f6e5b735c57e9c4f6817760a1cd0R41 then run rm -rf /tmp/torchinductor_willfeng/ && HELION_PRINT_OUTPUT_CODE=1 CUDA_LAUNCH_BLOCKING=1 HELION_AUTOTUNE_RANDOM_SEED=4201 python benchmarks/run.py --op <kernel_name> --num-inputs 1 --metrics speedup --latency-measure-mode profiler --exit-on-exception on Helion main branch.