Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Sep 3, 2025

Stack from ghstack (oldest at bottom):

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Sep 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162040

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 063094d with merge base 1e0656f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 3, 2025
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 3, 2025
@ngimel
Copy link
Collaborator

ngimel commented Sep 3, 2025

If in the tests you set

torch.utils.deterministic.fill_uninitialized_memory=True
torch.use_deterministic_algorithms(True)

you should be able to reliably repro the issue

@fegin
Copy link
Contributor Author

fegin commented Sep 3, 2025

@ngimel You are right, I can always reproduce the error with the two options you mentioned. But that requires CUBLAS_WORKSPACE_CONFIG=:4096:8 to be exported before running the test. So I guess we cannot add that to the thest.

[ghstack-poisoned]
[ghstack-poisoned]
@ngimel
Copy link
Collaborator

ngimel commented Sep 4, 2025

huh cc @eqy, exporting this CUBLAS_WORKSPACE_CONFIG=:4096:8 env var is not longer necessary for deterministic mode. Are you getting any errors without it? If it's just warnings you can ignore.

@fegin
Copy link
Contributor Author

fegin commented Sep 4, 2025

@ngimel I get a hard error. So unfortunately, I cannot ignore it :(

@ngimel
Copy link
Collaborator

ngimel commented Sep 4, 2025

Ok, this requirement will be removed once #161749 is landed

@ngimel
Copy link
Collaborator

ngimel commented Sep 4, 2025

You probably can do torch.set_deterministic_debug_mode("warn") + torch.utils.deterministic.fill_uninitialized_memory=True and that way you'd still get the tests to fail, but it won't require env variable even today

@fegin
Copy link
Contributor Author

fegin commented Sep 5, 2025

You probably can do torch.set_deterministic_debug_mode("warn") + torch.utils.deterministic.fill_uninitialized_memory=True and that way you'd still get the tests to fail, but it won't require env variable even today

This works. Add this to the test and landing the PR.

[ghstack-poisoned]
[ghstack-poisoned]
@fegin
Copy link
Contributor Author

fegin commented Sep 8, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 8, 2025
The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: #162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: #162040
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: pytorch#162040
Approved by: https://github.com/danielvegamyhre
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…#162041)

The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: pytorch#162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: pytorch#162040
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: pytorch#162040
Approved by: https://github.com/danielvegamyhre
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…#162041)

The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: pytorch#162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: pytorch#162040
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: pytorch#162040
Approved by: https://github.com/danielvegamyhre
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…#162041)

The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: pytorch#162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: pytorch#162040
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: pytorch#162040
Approved by: https://github.com/danielvegamyhre
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…#162041)

The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: pytorch#162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: pytorch#162040
@github-actions github-actions bot deleted the gh/fegin/311/head branch October 9, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants