Fix backward return count mismatch in _Float8GroupedMM by xiaobochen-amd · Pull Request #3956 · pytorch/ao

xiaobochen-amd · 2026-02-27T01:36:19Z

No description provided.

pytorch-bot · 2026-02-27T01:36:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3956

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 84d572b with merge base 4e18d87 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2026-02-27T02:10:12Z

surprised 1xh100 CI tests didn't catch this

xiaobochen-amd · 2026-02-27T02:13:49Z

Before my previous PR, the counts here were mismatched.

danielvegamyhre · 2026-02-27T02:17:41Z

i think i see the issue, we need to unskip this test:

ao/test/prototype/moe_training/test_scaled_grouped_mm.py

Line 58 in 4e18d87

@pytest.mark.skipif(

xiaobochen-amd · 2026-02-27T02:26:49Z

i think i see the issue, we need to unskip this test:

ao/test/prototype/moe_training/test_scaled_grouped_mm.py

Line 58 in 4e18d87

@pytest.mark.skipif(

When I tested locally, it was set to unskip and the test passed. I just noticed that #3788
hasn’t been fixed, so I left it as is.

danielvegamyhre · 2026-02-27T02:37:35Z

i think i see the issue, we need to unskip this test:

ao/test/prototype/moe_training/test_scaled_grouped_mm.py

Line 58 in 4e18d87

@pytest.mark.skipif(

When I tested locally, it was set to unskip and the test passed. I just noticed that #3788 hasn’t been fixed, so I left it as is.

I see, let me test on h100 if the issue still exists, i think it may have been a transient env/build issue we never root caused

danielvegamyhre · 2026-02-27T04:08:34Z

@xiaobochen-amd i tested and and the original cublas error is resolved, but there is a new error (numerics mismatch over the threshold). forward pass outputs are identical with torch.equal, however, gradients are slightly different, with most columns identical but some columns requiring atol/rtol=1 to pass.

I'll create an issue for this on the CUDA side. It doesn't block this, we can leave the test as skipped.

Fix backward return count mismatch in _Float8GroupedMM

84d572b

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 27, 2026

danielvegamyhre self-requested a review February 27, 2026 02:09

danielvegamyhre added module: training quantize_ api training flow moe labels Feb 27, 2026

danielvegamyhre approved these changes Feb 27, 2026

View reviewed changes

danielvegamyhre merged commit 9bdc0ca into pytorch:main Feb 27, 2026
19 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backward return count mismatch in _Float8GroupedMM#3956

Fix backward return count mismatch in _Float8GroupedMM#3956
danielvegamyhre merged 1 commit intopytorch:mainfrom
xiaobochen-amd:dev_fix

xiaobochen-amd commented Feb 27, 2026

Uh oh!

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

xiaobochen-amd commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

xiaobochen-amd commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaobochen-amd commented Feb 27, 2026

Uh oh!

pytorch-bot bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3956

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

xiaobochen-amd commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

xiaobochen-amd commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

danielvegamyhre commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading