[float8 moe training] make using triton kernels for per-group scaling configurable #2405

danielvegamyhre · 2025-06-18T14:57:52Z

Summary

Make using triton kernels for per-group scaling configurable so we can measure speedup from avoiding d2h sync when benchmarking e2e training, and validate the expected speedup. We can remove this before graduating out of prototype.
Improve benchmarking script, make compile configurable via an argument (I discovered a bug in torch.compile tune_scaled_grouped_mm so want the benchmark script to be usable as a repro for debugging)

pytorch-bot · 2025-06-18T14:57:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2405

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 3 Pending, 1 Unrelated Failure

As of commit a285fc8 with merge base 8b12ddf ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-06-18T14:59:35Z

@drisspg @vkuzo for review

torchao/prototype/moe_training/benchmarks/benchmark_scaled_grouped_mm.py

torchao/prototype/moe_training/conversion_utils.py

torchao/prototype/moe_training/scaled_grouped_mm.py

drisspg · 2025-06-18T16:24:20Z

torchao/prototype/moe_training/tensor.py


-    def __init__(self, data: torch.Tensor):
+    def __new__(cls, data: torch.Tensor, use_triton_for_per_group_scales: bool = True):
+        cls.use_triton_for_per_group_scales = use_triton_for_per_group_scales


this looks weird, I think you should just let it fall through to init and then set it on the instance

I agree, I did that at first but then in the torch_function i don't have access to the instance because it's a classmethod. I'd like to make it less weird though, am open to ideas.

Technically since this is just temporary for benchmarking comparison, I could just condition on an env var and avoid plumbing this through everywhere entirely, but that seemed weird as well. In retrospect I actually think it would be cleaner though. What do you think?

One inputs to torch function has to be one of these subclasses, right? So can't you just grab it from whatever instance is the subclass?

True, that should work - updated.

… configurable (pytorch#2405) * improve moe training benchmarking * lint * readability improvements * grab use_triton for args instead of class attribute * add comment

danielvegamyhre added 2 commits June 18, 2025 07:54

improve moe training benchmarking

5c022be

lint

9119363

danielvegamyhre added the topic: not user facing label Jun 18, 2025

facebook-github-bot added the CLA Signed label Jun 18, 2025

danielvegamyhre mentioned this pull request Jun 18, 2025

[inductor] tune_scaled_grouped_mm fails with memory layout assertion, despite memory layout assertions prior to op call passing pytorch/pytorch#156325

Open

drisspg reviewed Jun 18, 2025

View reviewed changes

torchao/prototype/moe_training/benchmarks/benchmark_scaled_grouped_mm.py Outdated Show resolved Hide resolved

drisspg reviewed Jun 18, 2025

View reviewed changes

torchao/prototype/moe_training/conversion_utils.py Show resolved Hide resolved

drisspg reviewed Jun 18, 2025

View reviewed changes

torchao/prototype/moe_training/scaled_grouped_mm.py Outdated Show resolved Hide resolved

drisspg reviewed Jun 18, 2025

View reviewed changes

danielvegamyhre added 2 commits June 18, 2025 09:40

readability improvements

91a2787

grab use_triton for args instead of class attribute

40526cb

danielvegamyhre force-pushed the sync branch from ae760f7 to 40526cb Compare June 18, 2025 16:53

add comment

a285fc8

drisspg approved these changes Jun 18, 2025

View reviewed changes

danielvegamyhre merged commit 101c039 into main Jun 18, 2025
18 of 19 checks passed

alexsamardzic mentioned this pull request Jun 19, 2025

moe quant with dedicated kernels [wip] #2325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8 moe training] make using triton kernels for per-group scaling configurable #2405

[float8 moe training] make using triton kernels for per-group scaling configurable #2405

Uh oh!

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drisspg Jun 18, 2025

Uh oh!

danielvegamyhre Jun 18, 2025 •

edited

Loading

Uh oh!

drisspg Jun 18, 2025

Uh oh!

danielvegamyhre Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

[float8 moe training] make using triton kernels for per-group scaling configurable #2405

[float8 moe training] make using triton kernels for per-group scaling configurable #2405

Uh oh!

Conversation

danielvegamyhre commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2405

⏳ 3 Pending, 1 Unrelated Failure

Uh oh!

danielvegamyhre commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drisspg Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading

danielvegamyhre Jun 18, 2025 •

edited

Loading