Skip to content

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Feb 22, 2021

Stack from ghstack:

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

Process 0 exited with error code 10

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Differential Revision: D26589274

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:


```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```

Differential Revision: [D26589274](https://our.internmc.facebook.com/intern/diff/D26589274/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 22, 2021

💊 CI failures summary and remediations

As of commit dd3c043 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

pritamdamania87 pushed a commit that referenced this pull request Feb 22, 2021
Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:


```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```

Differential Revision: [D26589274](https://our.internmc.facebook.com/intern/diff/D26589274/)

ghstack-source-id: 122239783
Pull Request resolved: #52632
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for adding this!

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:


```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```

Differential Revision: [D26589274](https://our.internmc.facebook.com/intern/diff/D26589274/)

[ghstack-poisoned]
Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:


```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```

Differential Revision: [D26589274](https://our.internmc.facebook.com/intern/diff/D26589274/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Feb 23, 2021
Pull Request resolved: #52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:


```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Differential Revision: [D26589274](https://our.internmc.facebook.com/intern/diff/D26589274/)
@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #52632 (dd3c043) into gh/pritamdamania87/204/base (ee04cd9) will decrease coverage by 0.30%.
The diff coverage is 37.50%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/204/base   #52632      +/-   ##
===============================================================
- Coverage                        80.77%   80.46%   -0.31%     
===============================================================
  Files                             1969     1969              
  Lines                           216063   216077      +14     
===============================================================
- Hits                            174515   173875     -640     
- Misses                           41548    42202     +654     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1c63cb2.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/204/head branch February 27, 2021 15:16
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
Pull Request resolved: pytorch#52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26589274

fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Pull Request resolved: pytorch#52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26589274

fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants