Rebenchmark configs to avoid noise #654

jansel · 2025-09-22T02:44:16Z

Stacked PRs (oldest at bottom):

Rebenchmark configs to avoid noise

stack-info: PR: #654, branch: jansel/stack/146

oulgen

did you check how much this increases benchmarking time versus how much better results we gain?

helion/autotuner/base_search.py

stack-info: PR: #654, branch: jansel/stack/146

jansel · 2025-09-23T00:37:42Z

did you check how much this increases benchmarking time versus how much better results we gain?

This change is somehow causing misaligned memory addresses my local GPU why autotuning matmul. I am a bit puzzled by how the PR could be causing that since it doesn't touch codegen. So I am still debugging. Ideas welcome!

stack-info: PR: #654, branch: jansel/stack/146

oulgen · 2025-09-23T00:44:21Z

am a bit puzzled by how the PR could be causing that since it doesn't touch codegen. So I am still debugging. Ideas welcome!

Did you verify that same config does not cause misaligned memory on previous rev? maybe you updated triton as well?
@yf225 noticed that sometimes we get misaligned memory because relaxed ordering of pipelining results in read before write or vice-versa.
if both of these fail, you can try to get the triton code and run with triton_interp mode and see if that helps.

In general, I think we need to prune the set of configs more for autotuning because we are having similar cuda errors on CI benchmarking

yf225 · 2025-09-23T01:58:08Z

yeah +1 I've seen it in #634 and #630. I wonder whether the "misaligned memory address" are due to autotuning now exploring a different part of the config space.

#649 (merged) should allow us to see the full kernel config for reproing the misaligned memory issues. I can help with debugging some of the issues (I've seen two so far, both related to multi-stage pipelining: 5ae337b#diff-cb3b5c8f9dd5a38792e17c09e227adfe5346bb85e3ff62ddecadc28c085b1cecR264 and 89488e3, although I am not 100% confident on the root cause yet).

jansel · 2025-09-23T04:29:51Z

Yeah it's a bit strange because the configs pass on the first run, then the same config (not even recompiled same fn) fails on the rerun.

yf225 · 2025-09-23T19:02:24Z

@jansel I found that this can deterministically repro the matmul "misaligned address" issue:

# rm -rf /tmp/torchinductor_${USER}/ && HELION_AUTOTUNE_RANDOM_SEED=2011902841 CUDA_LAUNCH_BLOCKING=1 python examples/matmul.py

Testing helion correctness...
[0s] Set autotune random seed to 2011902841
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
Traceback (most recent call last):
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 129, in benchmark_function
    fn(*self.args)  # make sure the kernel is compiled
    ^^^^^^^^^^^^^^
  File "/tmp/torchinductor_willfeng/6o/c6o7gxb4etif4stiuvwlyzavwmvtigmpty7whaklj5wbf4vkymag.py", line 60, in matmul
    _launcher(_helion_matmul, (_NUM_SM,), x, y, out, _NUM_SM, _BLOCK_SIZE_1, 1, _BLOCK_SIZE_2, num_warps=1, num_stages=4)
  File "/data/users/willfeng/helion/helion/runtime/__init__.py", line 63, in default_launcher
    return triton_kernel.run(
           ^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/miniconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 757, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/home/willfeng/local/miniconda3/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 712, in __call__
    self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
RuntimeError: Triton Error [CUDA]: misaligned address

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/willfeng/helion/examples/matmul.py", line 182, in <module>
    main()
  File "/data/users/willfeng/helion/examples/matmul.py", line 177, in main
    check(1024, 1024, 1024)
  File "/data/users/willfeng/helion/examples/matmul.py", line 95, in check
    run_example(matmul, torch.matmul, (x, y))
  File "/data/users/willfeng/helion/helion/_testing.py", line 461, in run_example
    func(*args).to(torch.float32),
    ^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 285, in __call__
    return self.bind(args)(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 617, in __call__
    self.autotune(args)
  File "/data/users/willfeng/helion/helion/runtime/kernel.py", line 506, in autotune
    config = self.settings.autotuner_fn(self, args, **kwargs).autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_cache.py", line 168, in autotune
    config = self.autotuner.autotune()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 253, in autotune
    best = self._autotune()
           ^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/differential_evolution.py", line 96, in _autotune
    self.initial_two_generations()
  File "/data/users/willfeng/helion/helion/autotuner/differential_evolution.py", line 59, in initial_two_generations
    self.parallel_benchmark_flat(
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 376, in parallel_benchmark_flat
    to_check, configs, self.parallel_benchmark(configs), strict=True
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 237, in parallel_benchmark
    results.append((config, fn, self.benchmark_function(config, fn)))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/helion/helion/autotuner/base_search.py", line 146, in benchmark_function
    raise exc.TritonError(
helion.exc.TritonError: Error running generated Triton program:
@helion.kernel(config=helion.Config(block_sizes=[1, 16, 16], indexing='tensor_descriptor', l2_groupings=[2], loop_orders=[[0, 1]], num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4]), static_shapes=True)
RuntimeError: Triton Error [CUDA]: misaligned address

This config can trigger the issue @helion.kernel(config=helion.Config(block_sizes=[1, 16, 16], indexing='tensor_descriptor', l2_groupings=[2], loop_orders=[[0, 1]], num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4]), static_shapes=True). Still looking into why it causes misaligned address.

yf225 · 2025-09-23T19:13:32Z

@jansel Without this PR, I also found that rm -rf /tmp/torchinductor_${USER}/ && HELION_AUTOTUNE_RANDOM_SEED=2011902841 CUDA_LAUNCH_BLOCKING=1 python examples/matmul.py still causes the "misaligned address" issue with the same config. So maybe this PR doesn't introduce regression and we could land it.

oulgen · 2025-09-23T20:41:44Z

@yf225 Does adding a tl.debug_barrier between the store and load fix it?

yf225 · 2025-09-23T21:39:34Z

@yf225 Does adding a tl.debug_barrier between the store and load fix it?

@oulgen I believe for the matmul error, it's due to TMA tile size too small for the matmul instructions - just opened a fix PR at #662

For the kl_div kernel I believe it's the store-then-load pattern - tl.debug_barrier could work but I also wonder if that has influence on the performance, or if there is another way to just skip the bad configs.

jansel · 2025-09-23T23:44:55Z

Thanks! Let me do some more testing with your fix, I still want to confirm this actually makes results more stable. (Which I wasn't able to do before because of that error.)

stack-info: PR: #654, branch: jansel/stack/146

jansel added a commit that referenced this pull request Sep 22, 2025

[wip] Rebenchmark configs to avoid noise

8a7ccb7

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/145 branch from a63cc7d to 427da19 Compare September 22, 2025 02:44

jansel force-pushed the jansel/stack/146 branch from 300c778 to 8a7ccb7 Compare September 22, 2025 02:44

This was referenced Sep 22, 2025

[lint] Remove UP038 reference #650

Merged

Add HELION_SKIP_CACHE env #653

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025

oulgen approved these changes Sep 22, 2025

View reviewed changes

helion/autotuner/base_search.py Outdated Show resolved Hide resolved

yf225 approved these changes Sep 22, 2025

View reviewed changes

jansel changed the base branch from jansel/stack/145 to main September 22, 2025 22:23

jansel added a commit that referenced this pull request Sep 22, 2025

[wip] Rebenchmark configs to avoid noise

fd109de

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/146 branch from 8a7ccb7 to fd109de Compare September 22, 2025 22:23

jansel changed the base branch from main to jansel/stack/144 September 22, 2025 22:23

jansel force-pushed the jansel/stack/144 branch from 3cd028f to 2626b41 Compare September 23, 2025 00:35

jansel added a commit that referenced this pull request Sep 23, 2025

[wip] Rebenchmark configs to avoid noise

3703e20

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/146 branch from fd109de to 3703e20 Compare September 23, 2025 00:35

jansel marked this pull request as draft September 23, 2025 00:37

jansel added a commit that referenced this pull request Sep 23, 2025

[wip] Rebenchmark configs to avoid noise

c3cc755

stack-info: PR: #654, branch: jansel/stack/146

jansel changed the base branch from jansel/stack/144 to main September 23, 2025 00:38

jansel force-pushed the jansel/stack/146 branch from 3703e20 to c3cc755 Compare September 23, 2025 00:38

yf225 mentioned this pull request Sep 23, 2025

Log autotune random seed for easier repro #661

Merged

yf225 mentioned this pull request Sep 23, 2025

Fix misaligned address error for matmul #662

Merged

jansel changed the base branch from jansel/stack/149 to main September 26, 2025 01:45

jansel changed the base branch from main to jansel/stack/149 September 26, 2025 01:45

jansel mentioned this pull request Sep 26, 2025

Add test for no options found in autotuner #693

Merged

jansel changed the base branch from jansel/stack/149 to main September 26, 2025 05:12

jansel force-pushed the jansel/stack/146 branch 2 times, most recently from eb406f4 to 1320c51 Compare September 26, 2025 05:13

jansel mentioned this pull request Sep 26, 2025

Run autotune with TRITON_LOCAL_BUILD=1 #695

Merged

jansel changed the base branch from main to jansel/stack/150 September 26, 2025 05:13

jansel mentioned this pull request Sep 26, 2025

Add PatternSearch autotuning algorithm #696

Merged

jansel changed the base branch from jansel/stack/150 to main September 26, 2025 16:00

jansel force-pushed the jansel/stack/146 branch 2 times, most recently from 992ebe3 to 7d77fcc Compare September 26, 2025 16:00

jansel mentioned this pull request Sep 26, 2025

Increase test timeout to 60 #697

Merged

jansel changed the base branch from main to jansel/stack/150 September 26, 2025 16:00

jansel changed the base branch from jansel/stack/150 to main September 26, 2025 18:48

jansel force-pushed the jansel/stack/146 branch from 7d77fcc to 3a1c79e Compare September 26, 2025 18:48

jansel changed the base branch from main to jansel/stack/153 September 26, 2025 18:48

jansel force-pushed the jansel/stack/153 branch from 4e5f0b8 to d8a8fb7 Compare September 26, 2025 20:01

jansel added a commit that referenced this pull request Sep 26, 2025

Rebenchmark configs to avoid noise

1d3f9e6

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/146 branch from 3a1c79e to 1d3f9e6 Compare September 26, 2025 20:01

jansel force-pushed the jansel/stack/153 branch from d8a8fb7 to 158780e Compare September 26, 2025 20:02

jansel added a commit that referenced this pull request Sep 26, 2025

Rebenchmark configs to avoid noise

7599175

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/146 branch from 1d3f9e6 to 7599175 Compare September 26, 2025 20:02

jansel changed the base branch from jansel/stack/153 to main September 27, 2025 00:26

jansel force-pushed the jansel/stack/146 branch 2 times, most recently from 3650824 to 707faa8 Compare September 27, 2025 18:51

Rebenchmark configs to avoid noise

0ec45ee

stack-info: PR: #654, branch: jansel/stack/146

jansel force-pushed the jansel/stack/146 branch from 707faa8 to 0ec45ee Compare September 28, 2025 03:18

jansel merged commit 68811be into main Sep 28, 2025
13 checks passed

yf225 pushed a commit that referenced this pull request Sep 30, 2025

Rebenchmark configs to avoid noise (#654)

a7b6987

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebenchmark configs to avoid noise #654

Rebenchmark configs to avoid noise #654

Uh oh!

jansel commented Sep 22, 2025 •

edited

Loading

Uh oh!

oulgen left a comment

Uh oh!

Uh oh!

jansel commented Sep 23, 2025

Uh oh!

oulgen commented Sep 23, 2025

Uh oh!

yf225 commented Sep 23, 2025

Uh oh!

jansel commented Sep 23, 2025

Uh oh!

yf225 commented Sep 23, 2025

Uh oh!

yf225 commented Sep 23, 2025 •

edited

Loading

Uh oh!

oulgen commented Sep 23, 2025

Uh oh!

yf225 commented Sep 23, 2025 •

edited

Loading

Uh oh!

jansel commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rebenchmark configs to avoid noise #654

Rebenchmark configs to avoid noise #654

Uh oh!

Conversation

jansel commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!