Skip to content

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Jun 7, 2021

Stack from ghstack:

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.

Differential Revision: D28941327

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.

Differential Revision: [D28941327](https://our.internmc.facebook.com/intern/diff/D28941327/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 7, 2021

💊 CI failures summary and remediations

As of commit b22840c (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

3 failures not recognized by patterns:

Job Step Action
GitHub Actions Label PRs & Issues / auto-label-rocm Unknown 🔁 rerun
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / test Post Checkout PyTorch 🔁 rerun
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results Unknown 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 7, 2021
wayi1 pushed a commit that referenced this pull request Jun 7, 2021
If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.

Differential Revision: [D28941327](https://our.internmc.facebook.com/intern/diff/D28941327/)

ghstack-source-id: 130754510
Pull Request resolved: #59576
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp because the revert was due to an issue in below diff which has been resolved.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2b398d0.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
…pytorch#59576)

Summary:
Pull Request resolved: pytorch#59576

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28941327

fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/145/head branch June 12, 2021 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants