Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

Closed
wants to merge 1 commit into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented May 1, 2021

Stack from ghstack:

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.

Differential Revision: D28128628

…d overflow

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.

Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 1, 2021

💊 CI failures summary and remediations

As of commit eaa2897 (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 01 05:48:50 AssertionError: False is not tr...lowed difference with rtol=0 and atol=0 is only 0!
May 01 05:48:50 ----------------------------------------------------------------------
May 01 05:48:50 Traceback (most recent call last):
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
May 01 05:48:50     self._join_processes(fn)
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 566, in _join_processes
May 01 05:48:50     self._check_return_codes(elapsed_time)
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 619, in _check_return_codes
May 01 05:48:50     i, first_process.exitcode, p.exitcode
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1397, in assertEqual
May 01 05:48:50     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
May 01 05:48:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1 and -11 gives a difference of 12, but the allowed difference with rtol=0 and atol=0 is only 0!
May 01 05:48:50 Expect process 1 exit code to match Process 0 exit code of -11, but got 1
May 01 05:48:50 
May 01 05:48:50 ----------------------------------------------------------------------
May 01 05:48:50 Ran 199 tests in 213.489s
May 01 05:48:50 
May 01 05:48:50 FAILED (failures=1, skipped=119)
May 01 05:48:50 
May 01 05:48:50 Generating XML reports...
May 01 05:48:50 Generated XML report: test-reports/dist-nccl/distributed.test_distributed_fork/TEST-TestDistBackendWithFork-20210501054517.xml
May 01 05:48:50 Traceback (most recent call last):

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 01 08:17:50 unknown file: Failure
May 01 08:17:50 [       OK ] Kernel.Softmax2D (26 ms)
May 01 08:17:50 [ RUN      ] Kernel.Softmax3D
May 01 08:17:50 [       OK ] Kernel.Softmax3D (142 ms)
May 01 08:17:50 [ RUN      ] Kernel.Softmax4D
May 01 08:17:50 [       OK ] Kernel.Softmax4D (192 ms)
May 01 08:17:50 [ RUN      ] Kernel.ConstantTensors
May 01 08:17:50 [       OK ] Kernel.ConstantTensors (19 ms)
May 01 08:17:50 [ RUN      ] Kernel.ConstantTensorsNonContiguous
May 01 08:17:50 [       OK ] Kernel.ConstantTensorsNonContiguous (18 ms)
May 01 08:17:50 [ RUN      ] Kernel.RunFast
May 01 08:17:50 unknown file: Failure
May 01 08:17:50 C++ exception with description "SimpleIREvaluator::call_raw is not implemented yet" thrown in the test body.
May 01 08:17:50 [  FAILED  ] Kernel.RunFast (5 ms)
May 01 08:17:50 [----------] 14 tests from Kernel (616 ms total)
May 01 08:17:50 
May 01 08:17:50 [----------] 140 tests from LoopNest
May 01 08:17:50 [ RUN      ] LoopNest.ExprSimple01
May 01 08:17:50 [       OK ] LoopNest.ExprSimple01 (1 ms)
May 01 08:17:50 [ RUN      ] LoopNest.ExprLower01
May 01 08:17:50 [       OK ] LoopNest.ExprLower01 (0 ms)
May 01 08:17:50 [ RUN      ] LoopNest.ExprSimple02

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2021
wayi1 pushed a commit that referenced this pull request May 1, 2021
…d overflow

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.

Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/)

ghstack-source-id: 127877083
Pull Request resolved: #57410
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c07babb.

@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/118/head branch May 11, 2021 14:16
wayi1 pushed a commit that referenced this pull request May 12, 2021
Update the documentation to be consistent to #57410.

Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request May 12, 2021
Update the documentation to be consistent to #57410.

Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/)

ghstack-source-id: 128797174
Pull Request resolved: #58168
facebook-github-bot pushed a commit that referenced this pull request May 12, 2021
…8168)

Summary:
Pull Request resolved: #58168

Update the documentation to be consistent to #57410.
ghstack-source-id: 128797174

Test Plan: N/A

Reviewed By: agolynski, zhengwy888

Differential Revision: D28388160

fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
…d overflow (pytorch#57410)

Summary:
Pull Request resolved: pytorch#57410

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
ghstack-source-id: 127877083

Test Plan:
before chage

f268909897

after change:
f270950609

If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook.

Reviewed By: SciPioneer

Differential Revision: D28128628

fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
…torch#58168)

Summary:
Pull Request resolved: pytorch#58168

Update the documentation to be consistent to pytorch#57410.
ghstack-source-id: 128797174

Test Plan: N/A

Reviewed By: agolynski, zhengwy888

Differential Revision: D28388160

fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants