Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gloo] Support work-level timeouts in ProcessGroupGloo #40948

Closed
wants to merge 7 commits into from

Conversation

osalpekar
Copy link
Member

@osalpekar osalpekar commented Jul 2, 2020

Stack from ghstack:

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in waitSend and waitRecv functions from Gloo's unbound_buffer construct.

Differential Revision: D22173763

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 2, 2020

💊 CI failures summary and remediations

As of commit 0c14c90 (more details on the Dr. CI page):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 15 20:05:30 FAIL [0.032s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN)
Jul 15 20:05:28   test_sparse_default_std (__main__.TestNNInit) ... ok (0.007s) 
Jul 15 20:05:28   test_sparse_only_works_on_2d_inputs (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:05:29   test_trunc_normal (__main__.TestNNInit) ... ok (0.703s) 
Jul 15 20:05:30   test_uniform (__main__.TestNNInit) ... ok (0.904s) 
Jul 15 20:05:30   test_xavier_normal (__main__.TestNNInit) ... ok (0.105s) 
Jul 15 20:05:30   test_xavier_normal_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:05:30   test_xavier_uniform (__main__.TestNNInit) ... ok (0.079s) 
Jul 15 20:05:30   test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:05:30  
Jul 15 20:05:30 ====================================================================== 
Jul 15 20:05:30 FAIL [0.032s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
Jul 15 20:05:30 ---------------------------------------------------------------------- 
Jul 15 20:05:30 Traceback (most recent call last): 
Jul 15 20:05:30   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:05:30     method(*args, **kwargs) 
Jul 15 20:05:30   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:05:30     method(*args, **kwargs) 
Jul 15 20:05:30   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 674, in efail_fn_no_device 
Jul 15 20:05:30     return efail_fn(slf, None, *args, **kwargs) 
Jul 15 20:05:30   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 665, in efail_fn 
Jul 15 20:05:30     slf.fail('expected a non-deterministic error, but it was not raised') 

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (2/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised
  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.014s) 
 
====================================================================== 
FAIL [0.026s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 729.033s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715200851.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715200851.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715200851.xml 

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised
  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.016s) 
 
====================================================================== 
FAIL [0.025s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 654.907s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715193432.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715193432.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715193432.xml 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 23 times.

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Jul 2, 2020
Pull Request resolved: #40948

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107095290

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)
Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Jul 6, 2020
Pull Request resolved: #40948

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107187566

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test for this?

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
@osalpekar
Copy link
Member Author

Shall we add a test for this?

Added tests in the next PR in this stack (#41265)

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.

Differential Revision: [D22173763](https://our.internmc.facebook.com/intern/diff/D22173763/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b979129.

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/52/head branch July 20, 2020 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants