[AsyncTP] Fixes AsyncMM #162040

fegin · 2025-09-03T05:52:53Z

Stack from ghstack (oldest at bottom):

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim

[ghstack-poisoned]

pytorch-bot · 2025-09-03T05:52:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162040

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 063094d with merge base 1e0656f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2025-09-03T20:02:11Z

If in the tests you set

torch.utils.deterministic.fill_uninitialized_memory=True
torch.use_deterministic_algorithms(True)

you should be able to reliably repro the issue

fegin · 2025-09-03T22:52:30Z

@ngimel You are right, I can always reproduce the error with the two options you mentioned. But that requires CUBLAS_WORKSPACE_CONFIG=:4096:8 to be exported before running the test. So I guess we cannot add that to the thest.

[ghstack-poisoned]

ngimel · 2025-09-04T17:21:26Z

huh cc @eqy, exporting this CUBLAS_WORKSPACE_CONFIG=:4096:8 env var is not longer necessary for deterministic mode. Are you getting any errors without it? If it's just warnings you can ignore.

fegin · 2025-09-04T18:33:54Z

@ngimel I get a hard error. So unfortunately, I cannot ignore it :(

ngimel · 2025-09-04T18:38:28Z

Ok, this requirement will be removed once #161749 is landed

ngimel · 2025-09-04T18:40:36Z

You probably can do torch.set_deterministic_debug_mode("warn") + torch.utils.deterministic.fill_uninitialized_memory=True and that way you'd still get the tests to fail, but it won't require env variable even today

fegin · 2025-09-05T23:00:09Z

You probably can do torch.set_deterministic_debug_mode("warn") + torch.utils.deterministic.fill_uninitialized_memory=True and that way you'd still get the tests to fail, but it won't require env variable even today

This works. Add this to the test and landing the PR.

[ghstack-poisoned]

fegin · 2025-09-08T08:12:54Z

@pytorchbot merge

pytorchmergebot · 2025-09-08T08:15:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: #162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: #162040

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

Update

503dbd2

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 3, 2025

fegin mentioned this pull request Sep 3, 2025

[AsyncTP] Use assertEqual instead of allClose for bf16 tests #162041

Closed

fegin added the ciflow/h100-symm-mem label Sep 3, 2025

fegin requested review from danielvegamyhre, kwen2501 and ngimel September 3, 2025 06:11

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 3, 2025

danielvegamyhre approved these changes Sep 3, 2025

View reviewed changes

Update

58e8b36

[ghstack-poisoned]

fegin mentioned this pull request Sep 4, 2025

[SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available #162142

Closed

Update

6cc1a7c

[ghstack-poisoned]

fegin added 2 commits September 5, 2025 16:00

Update

7739b92

[ghstack-poisoned]

Update

063094d

[ghstack-poisoned]

pytorchmergebot added the merging label Sep 8, 2025

pytorchmergebot closed this in 5b90e85 Sep 8, 2025

pytorchmergebot added Merged and removed merging labels Sep 8, 2025

github-actions bot deleted the gh/fegin/311/head branch October 9, 2025 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AsyncTP] Fixes AsyncMM #162040

[AsyncTP] Fixes AsyncMM #162040

Uh oh!

fegin commented Sep 3, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

ngimel commented Sep 3, 2025

Uh oh!

fegin commented Sep 3, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

fegin commented Sep 4, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

fegin commented Sep 5, 2025

Uh oh!

fegin commented Sep 8, 2025

Uh oh!

pytorchmergebot commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[AsyncTP] Fixes AsyncMM #162040

[AsyncTP] Fixes AsyncMM #162040

Uh oh!

Conversation

fegin commented Sep 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162040

✅ No Failures

Uh oh!

ngimel commented Sep 3, 2025

Uh oh!

fegin commented Sep 3, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

fegin commented Sep 4, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

ngimel commented Sep 4, 2025

Uh oh!

fegin commented Sep 5, 2025

Uh oh!

fegin commented Sep 8, 2025

Uh oh!

pytorchmergebot commented Sep 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented Sep 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading