[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

wayi1 · 2021-05-01T04:19:09Z

Stack from ghstack:

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410 [Gradient Compression] divide by world size before all_reduce to avoid overflow

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.

Differential Revision: D28128628

…d overflow FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/) [ghstack-poisoned]

facebook-github-bot · 2021-05-01T04:19:12Z

💊 CI failures summary and remediations

As of commit eaa2897 (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)
1/3 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 01 05:48:50 AssertionError: False is not tr...lowed difference with rtol=0 and atol=0 is only 0!

May 01 05:48:50 ----------------------------------------------------------------------
May 01 05:48:50 Traceback (most recent call last):
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
May 01 05:48:50     self._join_processes(fn)
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 566, in _join_processes
May 01 05:48:50     self._check_return_codes(elapsed_time)
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 619, in _check_return_codes
May 01 05:48:50     i, first_process.exitcode, p.exitcode
May 01 05:48:50   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1397, in assertEqual
May 01 05:48:50     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
May 01 05:48:50 AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1 and -11 gives a difference of 12, but the allowed difference with rtol=0 and atol=0 is only 0!
May 01 05:48:50 Expect process 1 exit code to match Process 0 exit code of -11, but got 1
May 01 05:48:50 
May 01 05:48:50 ----------------------------------------------------------------------
May 01 05:48:50 Ran 199 tests in 213.489s
May 01 05:48:50 
May 01 05:48:50 FAILED (failures=1, skipped=119)
May 01 05:48:50 
May 01 05:48:50 Generating XML reports...
May 01 05:48:50 Generated XML report: test-reports/dist-nccl/distributed.test_distributed_fork/TEST-TestDistBackendWithFork-20210501054517.xml
May 01 05:48:50 Traceback (most recent call last):

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 01 08:17:50 unknown file: Failure

May 01 08:17:50 [       OK ] Kernel.Softmax2D (26 ms)
May 01 08:17:50 [ RUN      ] Kernel.Softmax3D
May 01 08:17:50 [       OK ] Kernel.Softmax3D (142 ms)
May 01 08:17:50 [ RUN      ] Kernel.Softmax4D
May 01 08:17:50 [       OK ] Kernel.Softmax4D (192 ms)
May 01 08:17:50 [ RUN      ] Kernel.ConstantTensors
May 01 08:17:50 [       OK ] Kernel.ConstantTensors (19 ms)
May 01 08:17:50 [ RUN      ] Kernel.ConstantTensorsNonContiguous
May 01 08:17:50 [       OK ] Kernel.ConstantTensorsNonContiguous (18 ms)
May 01 08:17:50 [ RUN      ] Kernel.RunFast
May 01 08:17:50 unknown file: Failure
May 01 08:17:50 C++ exception with description "SimpleIREvaluator::call_raw is not implemented yet" thrown in the test body.
May 01 08:17:50 [  FAILED  ] Kernel.RunFast (5 ms)
May 01 08:17:50 [----------] 14 tests from Kernel (616 ms total)
May 01 08:17:50 
May 01 08:17:50 [----------] 140 tests from LoopNest
May 01 08:17:50 [ RUN      ] LoopNest.ExprSimple01
May 01 08:17:50 [       OK ] LoopNest.ExprSimple01 (1 ms)
May 01 08:17:50 [ RUN      ] LoopNest.ExprLower01
May 01 08:17:50 [       OK ] LoopNest.ExprLower01 (0 ms)
May 01 08:17:50 [ RUN      ] LoopNest.ExprSimple02

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…d overflow FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. Differential Revision: [D28128628](https://our.internmc.facebook.com/intern/diff/D28128628/) ghstack-source-id: 127877083 Pull Request resolved: #57410

rohan-varma

LGTM!

facebook-github-bot · 2021-05-07T19:24:58Z

This pull request has been merged in c07babb.

Update the documentation to be consistent to #57410. Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/) [ghstack-poisoned]

Update the documentation to be consistent to #57410. Differential Revision: [D28388160](https://our.internmc.facebook.com/intern/diff/D28388160/) ghstack-source-id: 128797174 Pull Request resolved: #58168

…8168) Summary: Pull Request resolved: #58168 Update the documentation to be consistent to #57410. ghstack-source-id: 128797174 Test Plan: N/A Reviewed By: agolynski, zhengwy888 Differential Revision: D28388160 fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65

…d overflow (pytorch#57410) Summary: Pull Request resolved: pytorch#57410 FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. ghstack-source-id: 127877083 Test Plan: before chage f268909897 after change: f270950609 If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook. Reviewed By: SciPioneer Differential Revision: D28128628 fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0

…torch#58168) Summary: Pull Request resolved: pytorch#58168 Update the documentation to be consistent to pytorch#57410. ghstack-source-id: 128797174 Test Plan: N/A Reviewed By: agolynski, zhengwy888 Differential Revision: D28388160 fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65

wayi1 requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners May 1, 2021 04:19

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2021

rohan-varma approved these changes May 1, 2021

View reviewed changes

facebook-github-bot closed this in c07babb May 7, 2021

facebook-github-bot added the Merged label May 7, 2021

facebook-github-bot deleted the gh/SciPioneer/118/head branch May 11, 2021 14:16

wayi1 mentioned this pull request May 12, 2021

[Gradient Compression] Update the docstring of fp16_compress_hook #58168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

wayi1 commented May 1, 2021 •

edited

facebook-github-bot commented May 1, 2021 •

edited

rohan-varma left a comment

facebook-github-bot commented May 7, 2021

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

[Gradient Compression] divide by world size before all_reduce to avoid overflow #57410

Conversation

wayi1 commented May 1, 2021 • edited

facebook-github-bot commented May 1, 2021 • edited

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (1/1)

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_py3_clang5_asan_test2 (1/1)

rohan-varma left a comment

Choose a reason for hiding this comment

facebook-github-bot commented May 7, 2021

wayi1 commented May 1, 2021 •

edited

facebook-github-bot commented May 1, 2021 •

edited