Add benchmark for torch.distributed.pipeline.sync.Pipe #49577

pritamdamania87 · 2020-12-18T00:17:13Z

Stack from ghstack:

Add benchmark for torch.distributed.pipeline.sync.Pipe #49577 Add benchmark for torch.distributed.pipeline.sync.Pipe
Improve documentation for pipeline parallelism. #48638 Improve documentation for pipeline parallelism.
Cleanup APIs for pipeline parallelism. #48630 Cleanup APIs for pipeline parallelism.
Test pipeline parallelism works with DDP. #48470 Test pipeline parallelism works with DDP.

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:

Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,

Differential Revision: D25628721

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)! [ghstack-poisoned]

facebook-github-bot · 2020-12-18T00:17:24Z

💊 CI failures summary and remediations

As of commit 978ef60 (more details on the Dr. CI page):

5/5 failures possibly* introduced in this PR
- 1/5 non-CircleCI failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.


Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.


Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_windows_vs2019_py36_cuda10.1_test1 (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_nn failed!

  test_ReflectionPad2d_alert_nondeterministic_cuda (__main__.TestNN) ... ok (0.015s)
  test_ReflectionPad2d_cuda (__main__.TestNN) ... ok (0.014s)
  test_ReplicationPad1d (__main__.TestNN) ... ok (0.064s)
  test_ReplicationPad1d_alert_nondeterministic_cuda (__main__.TestNN) ... ok (0.010s)
  test_ReplicationPad1d_cuda (__main__.TestNN) ... No data to combine
Traceback (most recent call last):
  File "run_test.py", line 905, in <module>
    main()
  File "run_test.py", line 888, in main
    raise RuntimeError(err_message)
RuntimeError: test_nn failed!

(base) circleci@PACKER-5FD865C5 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1 
+ cleanup
+ retcode=1
+ set +x


Exited with code exit status 1

pytorch_linux_xenial_py3_6_gcc5_4_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.


Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 10 times.

Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)! ghstack-source-id: 118864113 Pull Request resolved: #49577

rohan-varma · 2020-12-18T04:10:02Z