-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[c10d] add bfloat16 support for NAN check #131131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131131
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fa482bf with merge base 8bf0be7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhangdevgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhangdevgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fbd550c Pull Request resolved: #131131
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Inspiring me that this is something we should have in GLOO as well
@pytorchbot merge -f "all checks passed" |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #131131 Approved by: https://github.com/wconstab, https://github.com/XilunWu (cherry picked from commit 406f510)
Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#131131 Approved by: https://github.com/wconstab, https://github.com/XilunWu
Stack from ghstack (oldest at bottom):
Summary:
Need another dispacher macro to support more data types
Test Plan:
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python
test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion
!isnan(data[i])
failed./home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion
!isnan(data[i])
failed.NCCL version 2.21.5+cuda12.0
devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure
'device-side assert triggered'
.
Ran 1 test in 9.416s
OK
Reviewers:
Subscribers:
Tasks:
Tags:
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o