Skip to content

Conversation

gcramer23
Copy link
Contributor

@gcramer23 gcramer23 commented May 3, 2021

DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+

Stack from ghstack:

Differential Revision: D28296175

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 3, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 3, 2021

💊 CI failures summary and remediations

As of commit 55b5867 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/android_definitions.py
Auto-merging .circleci/cimodel/data/simple/android_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_definitions.py
Auto-merging .circleci/cimodel/data/binary_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/android_definitions.py
Auto-merging .circleci/cimodel/data/simple/android_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_definitions.py
Auto-merging .circleci/cimodel/data/binary_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

gcramer23 added a commit that referenced this pull request May 3, 2021
ghstack-source-id: 68908f5
Pull Request resolved: #57454
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 3, 2021
ghstack-source-id: 28403cc
Pull Request resolved: #57454
@gcramer23 gcramer23 marked this pull request as ready for review May 3, 2021 17:09
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 5, 2021
ghstack-source-id: ab78ab1
Pull Request resolved: #57454
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 6, 2021
ghstack-source-id: 58add1e
Pull Request resolved: #57454
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 6, 2021
ghstack-source-id: 9738599
Pull Request resolved: #57454
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. Stamp to unblock. Please consider addressing the minor comments before landing. I will create Github issues if I discover sth after landing. Thanks for working on this!

DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```




[ghstack-poisoned]
@gcramer23
Copy link
Contributor Author

@gcramer23 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```


Differential Revision: [D28296175](https://our.internmc.facebook.com/intern/diff/D28296175)

[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 7, 2021
ghstack-source-id: de7ae0a
Pull Request resolved: #57454
@gcramer23
Copy link
Contributor Author

@gcramer23 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gcramer23 merged this pull request in bc2540f.

@gcramer23 gcramer23 deleted the gh/gcramer23/5/head branch May 8, 2021 03:19
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#57454

DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```

Test Plan: Imported from OSS

Reviewed By: H-Huang, ngimel

Differential Revision: D28296175

Pulled By: gcramer23

fbshipit-source-id: 5dd208fc86f8b5558d7c8860d685bb25c2e09fe7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants