Skip to content

Conversation

@jansel
Copy link
Contributor

@jansel jansel commented Sep 22, 2025

jansel added a commit that referenced this pull request Sep 22, 2025
stack-info: PR: #654, branch: jansel/stack/146
This was referenced Sep 22, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025
Copy link
Contributor

@oulgen oulgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check how much this increases benchmarking time versus how much better results we gain?

@jansel jansel changed the base branch from jansel/stack/145 to main September 22, 2025 22:23
jansel added a commit that referenced this pull request Sep 22, 2025
stack-info: PR: #654, branch: jansel/stack/146
@jansel jansel changed the base branch from main to jansel/stack/144 September 22, 2025 22:23
jansel added a commit that referenced this pull request Sep 23, 2025
stack-info: PR: #654, branch: jansel/stack/146
@jansel
Copy link
Contributor Author

jansel commented Sep 23, 2025

did you check how much this increases benchmarking time versus how much better results we gain?

This change is somehow causing misaligned memory addresses my local GPU why autotuning matmul. I am a bit puzzled by how the PR could be causing that since it doesn't touch codegen. So I am still debugging. Ideas welcome!

@jansel jansel marked this pull request as draft September 23, 2025 00:37
jansel added a commit that referenced this pull request Sep 23, 2025
stack-info: PR: #654, branch: jansel/stack/146
@jansel jansel changed the base branch from jansel/stack/144 to main September 23, 2025 00:38
@oulgen
Copy link
Contributor

oulgen commented Sep 23, 2025

am a bit puzzled by how the PR could be causing that since it doesn't touch codegen. So I am still debugging. Ideas welcome!

  1. Did you verify that same config does not cause misaligned memory on previous rev? maybe you updated triton as well?
  2. @yf225 noticed that sometimes we get misaligned memory because relaxed ordering of pipelining results in read before write or vice-versa.
  3. if both of these fail, you can try to get the triton code and run with triton_interp mode and see if that helps.

In general, I think we need to prune the set of configs more for autotuning because we are having similar cuda errors on CI benchmarking

@yf225
Copy link
Contributor

yf225 commented Sep 23, 2025

yeah +1 I've seen it in #634 and #630. I wonder whether the "misaligned memory address" are due to autotuning now exploring a different part of the config space.

#649 (merged) should allow us to see the full kernel config for reproing the misaligned memory issues. I can help with debugging some of the issues (I've seen two so far, both related to multi-stage pipelining: 5ae337b#diff-cb3b5c8f9dd5a38792e17c09e227adfe5346bb85e3ff62ddecadc28c085b1cecR264 and 89488e3, although I am not 100% confident on the root cause yet).

@jansel
Copy link
Contributor Author

jansel commented Sep 23, 2025

Yeah it's a bit strange because the configs pass on the first run, then the same config (not even recompiled same fn) fails on the rerun.

@yf225
Copy link
Contributor

yf225 commented Sep 23, 2025

@jansel I found that this can deterministically repro the matmul "misaligned address" issue:

# rm -rf /tmp/torchinductor_${USER}/ && HELION_AUTOTUNE_RANDOM_SEED=2011902841 CUDA_LAUNCH_BLOCKING=1 python examples/matmul.py

Testing helion correctness...
[0s] Set autotune random seed to 2011902841
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
Traceback (most recent call last):
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 129, in benchmark_function
    fn(*self.args)  # make sure the kernel is compiled
    ^^^^^^^^^^^^^^
  File "/tmp/torchinductor_willfeng/6o/c6o7gxb4etif4stiuvwlyzavwmvtigmpty7whaklj5wbf4vkymag.py", line 60, in matmul
    _launcher(_helion_matmul, (_NUM_SM,), x, y, out, _NUM_SM, _BLOCK_SIZE_1, 1, _BLOCK_SIZE_2, num_warps=1, num_stages=4)
  File "/data/users/willfeng/helion/helion/runtime/__init__.py", line 63, in default_launcher
    return triton_kernel.run(
           ^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/miniconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 757, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/home/willfeng/local/miniconda3/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 712, in __call__
    self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
RuntimeError: Triton Error [CUDA]: misaligned address

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/willfeng/helion/examples/matmul.py", line 182, in <module>
    main()
  File "/data/users/willfeng/helion/examples/matmul.py", line 177, in main
    check(1024, 1024, 1024)
  File "/data/users/willfeng/helion/examples/matmul.py", line 95, in check
    run_example(matmul, torch.matmul, (x, y))
  File "/data/users/willfeng/helion/helion/_testing.py", line 461, in run_example
    func(*args).to(torch.float32),
    ^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 285, in __call__
    return self.bind(args)(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 617, in __call__
    self.autotune(args)
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 506, in autotune
    config = self.settings.autotuner_fn(self, args, **kwargs).autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_cache.py", line 168, in autotune
    config = self.autotuner.autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 253, in autotune
    best = self._autotune()
           ^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/differential_evolution.py", line 96, in _autotune
    self.initial_two_generations()
  File "/data/users/willfeng/helion/helion/autotuner/differential_evolution.py", line 59, in initial_two_generations
    self.parallel_benchmark_flat(
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 376, in parallel_benchmark_flat
    to_check, configs, self.parallel_benchmark(configs), strict=True
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 237, in parallel_benchmark
    results.append((config, fn, self.benchmark_function(config, fn)))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 146, in benchmark_function
    raise exc.TritonError(
helion.exc.TritonError: Error running generated Triton program:
@helion.kernel(config=helion.Config(block_sizes=[1, 16, 16], indexing='tensor_descriptor', l2_groupings=[2], loop_orders=[[0, 1]], num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4]), static_shapes=True)
RuntimeError: Triton Error [CUDA]: misaligned address

This config can trigger the issue @helion.kernel(config=helion.Config(block_sizes=[1, 16, 16], indexing='tensor_descriptor', l2_groupings=[2], loop_orders=[[0, 1]], num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4]), static_shapes=True). Still looking into why it causes misaligned address.

@yf225
Copy link
Contributor

yf225 commented Sep 23, 2025

@jansel Without this PR, I also found that rm -rf /tmp/torchinductor_${USER}/ && HELION_AUTOTUNE_RANDOM_SEED=2011902841 CUDA_LAUNCH_BLOCKING=1 python examples/matmul.py still causes the "misaligned address" issue with the same config. So maybe this PR doesn't introduce regression and we could land it.

@oulgen
Copy link
Contributor

oulgen commented Sep 23, 2025

@yf225 Does adding a tl.debug_barrier between the store and load fix it?

@yf225
Copy link
Contributor

yf225 commented Sep 23, 2025

@yf225 Does adding a tl.debug_barrier between the store and load fix it?

@oulgen I believe for the matmul error, it's due to TMA tile size too small for the matmul instructions - just opened a fix PR at #662

For the kl_div kernel I believe it's the store-then-load pattern - tl.debug_barrier could work but I also wonder if that has influence on the performance, or if there is another way to just skip the bad configs.

@jansel
Copy link
Contributor Author

jansel commented Sep 23, 2025

Thanks! Let me do some more testing with your fix, I still want to confirm this actually makes results more stable. (Which I wasn't able to do before because of that error.)

@jansel jansel changed the base branch from jansel/stack/149 to main September 26, 2025 01:45
@jansel jansel changed the base branch from main to jansel/stack/149 September 26, 2025 01:45
@jansel jansel changed the base branch from jansel/stack/149 to main September 26, 2025 05:12
@jansel jansel force-pushed the jansel/stack/146 branch 2 times, most recently from eb406f4 to 1320c51 Compare September 26, 2025 05:13
@jansel jansel changed the base branch from main to jansel/stack/150 September 26, 2025 05:13
@jansel jansel changed the base branch from jansel/stack/150 to main September 26, 2025 16:00
@jansel jansel force-pushed the jansel/stack/146 branch 2 times, most recently from 992ebe3 to 7d77fcc Compare September 26, 2025 16:00
@jansel jansel changed the base branch from main to jansel/stack/150 September 26, 2025 16:00
@jansel jansel changed the base branch from jansel/stack/150 to main September 26, 2025 18:48
@jansel jansel changed the base branch from main to jansel/stack/153 September 26, 2025 18:48
jansel added a commit that referenced this pull request Sep 26, 2025
stack-info: PR: #654, branch: jansel/stack/146
jansel added a commit that referenced this pull request Sep 26, 2025
stack-info: PR: #654, branch: jansel/stack/146
@jansel jansel changed the base branch from jansel/stack/153 to main September 27, 2025 00:26
@jansel jansel force-pushed the jansel/stack/146 branch 2 times, most recently from 3650824 to 707faa8 Compare September 27, 2025 18:51
stack-info: PR: #654, branch: jansel/stack/146
@jansel jansel merged commit 68811be into main Sep 28, 2025
13 checks passed
yf225 pushed a commit that referenced this pull request Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants