New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410
Conversation
…d overflow FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit eaa2897 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
…d overflow FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/) ghstack-source-id: 127877083 Pull Request resolved: #57410
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This pull request has been merged in c07babb. |
Update the documentation to be consistent to #57410. Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/) [ghstack-poisoned]
Update the documentation to be consistent to #57410. Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/) ghstack-source-id: 128797174 Pull Request resolved: #58168
…d overflow (pytorch#57410) Summary: Pull Request resolved: pytorch#57410 FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. ghstack-source-id: 127877083 Test Plan: before chage f268909897 after change: f270950609 If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook. Reviewed By: SciPioneer Differential Revision: D28128628 fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0
…torch#58168) Summary: Pull Request resolved: pytorch#58168 Update the documentation to be consistent to pytorch#57410. ghstack-source-id: 128797174 Test Plan: N/A Reviewed By: agolynski, zhengwy888 Differential Revision: D28388160 fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65
Stack from ghstack:
FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
Differential Revision: D28128628