Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094

Closed
wants to merge 4 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Jan 26, 2021

Stack from ghstack:

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26070147

… by creating a util function that make a vanilla allreduce future

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 26, 2021

💊 CI failures summary and remediations

As of commit 2ca21e5 (more details on the Dr. CI page):


  • 9/9 failures introduced in this PR

🕵️ 9 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:53:48 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in
Jan 29 00:53:48     #7 0x560ac09b870b in PyEval_EvalCode /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731
Jan 29 00:53:48     #8 0x560ac0a38573 in run_mod /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025
Jan 29 00:53:48     #9 0x560ac0a3860c in PyRun_StringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:949
Jan 29 00:53:48     #10 0x560ac0a3866e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:445
Jan 29 00:53:48     #11 0x560ac0a3c472 in run_command /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:301
Jan 29 00:53:48     #12 0x560ac0a3c472 in Py_Main /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:749
Jan 29 00:53:48     #13 0x560ac090643d in main /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69
Jan 29 00:53:48     #14 0x7ffa1d8e283f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291
Jan 29 00:53:48     #15 0x560ac09e5d0a in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jan 29 00:53:48 
Jan 29 00:53:48 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in 
Jan 29 00:53:48 + retcode=1
Jan 29 00:53:48 + set -e
Jan 29 00:53:48 + return 1
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]]
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]]
Jan 29 00:53:48 + '[' -n https://github.com/pytorch/pytorch/pull/51094 ']'
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]]
Jan 29 00:53:48 ++ mktemp
Jan 29 00:53:48 + DETERMINE_FROM=/tmp/tmp.fpKEWaHwLo
Jan 29 00:53:48 + file_diff_from_base /tmp/tmp.fpKEWaHwLo

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (2/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:52:06 sccache: error: couldn't connect to server
Jan 29 00:52:06 +++ eval 'extract_trap_cmd '
Jan 29 00:52:06 ++++ extract_trap_cmd
Jan 29 00:52:06 ++++ printf '%s\n' ''
Jan 29 00:52:06 +++ printf '%s\n' cleanup
Jan 29 00:52:06 ++ trap -- '
Jan 29 00:52:06 cleanup' EXIT
Jan 29 00:52:06 ++ [[ pytorch-linux-bionic-py3.6-clang9-test != *pytorch-win-* ]]
Jan 29 00:52:06 ++ which sccache
Jan 29 00:52:06 ++ sccache --stop-server
Jan 29 00:52:06 Stopping sccache server...
Jan 29 00:52:06 sccache: error: couldn't connect to server
Jan 29 00:52:06 sccache: caused by: Connection refused (os error 111)
Jan 29 00:52:06 ++ true
Jan 29 00:52:06 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 00:52:06 ++ [[ pytorch-linux-bionic-py3.6-clang9-test == *rocm* ]]
Jan 29 00:52:06 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 00:52:06 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 00:52:06 ++ RUST_LOG=sccache::server=error
Jan 29 00:52:06 ++ sccache --start-server
Jan 29 00:52:06 sccache: Starting the server...
Jan 29 00:52:06 ++ sccache --zero-stats

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (5/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 01:12:31 sccache: error: couldn't connect to server
Jan 29 01:12:31 +++ eval 'extract_trap_cmd '
Jan 29 01:12:31 ++++ extract_trap_cmd
Jan 29 01:12:31 ++++ printf '%s\n' ''
Jan 29 01:12:31 +++ printf '%s\n' cleanup
Jan 29 01:12:31 ++ trap -- '
Jan 29 01:12:31 cleanup' EXIT
Jan 29 01:12:31 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test2 != *pytorch-win-* ]]
Jan 29 01:12:31 ++ which sccache
Jan 29 01:12:31 ++ sccache --stop-server
Jan 29 01:12:31 Stopping sccache server...
Jan 29 01:12:31 sccache: error: couldn't connect to server
Jan 29 01:12:31 sccache: caused by: Connection refused (os error 111)
Jan 29 01:12:31 ++ true
Jan 29 01:12:31 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 01:12:31 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test2 == *rocm* ]]
Jan 29 01:12:31 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 01:12:31 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 01:12:31 ++ RUST_LOG=sccache::server=error
Jan 29 01:12:31 ++ sccache --start-server
Jan 29 01:12:31 sccache: Starting the server...
Jan 29 01:12:31 ++ sccache --zero-stats

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test1 (6/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:53:45 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in
Jan 29 00:53:45     #7 0x56041871170b in PyEval_EvalCode /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731
Jan 29 00:53:45     #8 0x560418791573 in run_mod /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025
Jan 29 00:53:45     #9 0x56041879160c in PyRun_StringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:949
Jan 29 00:53:45     #10 0x56041879166e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:445
Jan 29 00:53:45     #11 0x560418795472 in run_command /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:301
Jan 29 00:53:45     #12 0x560418795472 in Py_Main /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:749
Jan 29 00:53:45     #13 0x56041865f43d in main /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69
Jan 29 00:53:45     #14 0x7fa9a753183f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291
Jan 29 00:53:45     #15 0x56041873ed0a in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jan 29 00:53:45 
Jan 29 00:53:45 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in 
Jan 29 00:53:45 + retcode=1
Jan 29 00:53:45 + set -e
Jan 29 00:53:45 + return 1
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 == *-NO_AVX-* ]]
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 == *-NO_AVX2-* ]]
Jan 29 00:53:45 + '[' -n https://github.com/pytorch/pytorch/pull/51094 ']'
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 != *coverage* ]]
Jan 29 00:53:45 ++ mktemp
Jan 29 00:53:45 + DETERMINE_FROM=/tmp/tmp.q1TOnBC06b
Jan 29 00:53:45 + file_diff_from_base /tmp/tmp.q1TOnBC06b

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (7/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (8/9)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: distributed/test_c10d failed!
  File "C:\Users\circleci\project\build\win_tmp\build\torch\distributed\algorithms\ddp_comm_hooks\__init__.py", line 7, in <module>
    from . import (
  File "C:\Users\circleci\project\build\win_tmp\build\torch\distributed\algorithms\ddp_comm_hooks\powerSGD_hook.py", line 7, in <module>
    import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
AttributeError: module 'torch.distributed.algorithms' has no attribute 'ddp_comm_hooks'
Traceback (most recent call last):
  File "run_test.py", line 922, in <module>
    main()
  File "run_test.py", line 901, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_c10d failed!

(base) C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1 
+ cleanup
+ retcode=1
+ set +x


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (9/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 01:12:12 sccache: error: couldn't connect to server
Jan 29 01:12:12 +++ eval 'extract_trap_cmd '
Jan 29 01:12:12 ++++ extract_trap_cmd
Jan 29 01:12:12 ++++ printf '%s\n' ''
Jan 29 01:12:12 +++ printf '%s\n' cleanup
Jan 29 01:12:12 ++ trap -- '
Jan 29 01:12:12 cleanup' EXIT
Jan 29 01:12:12 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test1 != *pytorch-win-* ]]
Jan 29 01:12:12 ++ which sccache
Jan 29 01:12:12 ++ sccache --stop-server
Jan 29 01:12:12 Stopping sccache server...
Jan 29 01:12:12 sccache: error: couldn't connect to server
Jan 29 01:12:12 sccache: caused by: Connection refused (os error 111)
Jan 29 01:12:12 ++ true
Jan 29 01:12:12 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 01:12:12 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test1 == *rocm* ]]
Jan 29 01:12:12 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 01:12:12 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 01:12:12 ++ RUST_LOG=sccache::server=error
Jan 29 01:12:12 ++ sccache --start-server
Jan 29 01:12:12 sccache: Starting the server...
Jan 29 01:12:12 ++ sccache --zero-stats

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

…SGD_hook.py by creating a util function that make a vanilla allreduce future"

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 26, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120376248

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
wayi added 2 commits January 28, 2021 16:02
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e7b3496.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 5a406c0.

@izdeby
Copy link
Contributor

izdeby commented Jan 29, 2021

Reverting due to a broken CI

@rohan-varma
Copy link
Member

@SciPioneer Looks like the failures on this PR were legit:

Jan 29 01:19:21 Traceback (most recent call last):
Jan 29 01:19:21   File "distributed/test_c10d.py", line 21, in <module>
Jan 29 01:19:21     import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
Jan 29 01:19:21   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py", line 7, in <module>
Jan 29 01:19:21     from . import (
Jan 29 01:19:21   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py", line 7, in <module>
Jan 29 01:19:21     import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
Jan 29 01:19:21 AttributeError: module 'torch.distributed.algorithms' has no attribute 'ddp_comm_hooks'
Jan 29 01:19:21 Traceback (most recent call last):
Jan 29 01:19:21   File "test/run_test.py", line 922, in <module>
Jan 29 01:19:21     main()
Jan 29 01:19:21   File "test/run_test.py", line 901, in main
Jan 29 01:19:21     raise RuntimeError(err_message)
Jan 29 01:19:21 RuntimeError: distributed/test_c10d failed!
Jan 29 01:19:22 + cleanup
Jan 29 01:19:22 + retcode=1
Jan 29 01:19:22 + set +x

wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120715333

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/49/head branch February 1, 2021 15:19
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)

Summary:
Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants