Skip to content

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Dec 18, 2020

Stack from ghstack:

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:

Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,

Differential Revision: D25628721

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```

Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)!

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 18, 2020
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 18, 2020

💊 CI failures summary and remediations

As of commit 978ef60 (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 1/5 non-CircleCI failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_nn failed!
  test_ReflectionPad2d_alert_nondeterministic_cuda (__main__.TestNN) ... ok (0.015s)
  test_ReflectionPad2d_cuda (__main__.TestNN) ... ok (0.014s)
  test_ReplicationPad1d (__main__.TestNN) ... ok (0.064s)
  test_ReplicationPad1d_alert_nondeterministic_cuda (__main__.TestNN) ... ok (0.010s)
  test_ReplicationPad1d_cuda (__main__.TestNN) ... No data to combine
Traceback (most recent call last):
  File "run_test.py", line 905, in <module>
    main()
  File "run_test.py", line 888, in main
    raise RuntimeError(err_message)
RuntimeError: test_nn failed!

(base) circleci@PACKER-5FD865C5 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1 
+ cleanup
+ retcode=1
+ set +x


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git reset --hard 978ef60b46178247f819f50b9b51d09651132c4e
HEAD is now at 978ef60b46 Update on "Add benchmark for torch.distributed.pipeline.sync.Pipe"
+ git merge --allow-unrelated-histories --no-edit --no-ff fb755ad33ef84844e4f4c2d285b71737316814f8
CONFLICT (add/add): Merge conflict in torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging torch/testing/_internal/distributed/pipe_with_ddp_test.py
Auto-merging test/distributed/test_c10d.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 10 times.

pritamdamania87 pushed a commit that referenced this pull request Dec 18, 2020
Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```

Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)!

ghstack-source-id: 118864113
Pull Request resolved: #49577
@pritamdamania87 pritamdamania87 changed the title Add benchmakr for torch.distributed.pipeline.sync.Pipe Add benchmark for torch.distributed.pipeline.sync.Pipe Dec 18, 2020
@@ -0,0 +1,284 @@
# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not sure whats the guidelines around these copyrights, but it seems like only the pipeline files have this, and no other files in OSS pytorch (i.e. caffe2/torch directory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need it only for original torchgpipe files, will remove it for this.


num_params = reduce(operator.add, (reduce(operator.mul, x.size()) for x in model.parameters()))
logging.info(f"training model, #prams = {num_params}")
vocab_size = 10000 # FIXME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What needs to be fixed here?

cur_loss = total_loss / log_interval
elapsed = time.time() - start_time
print(
"| batch {:5d} | wps {:5.2f} | loss {:5.2f} | ppl {:8.2f}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, What's the interpretation of pps here?

os.environ.update({"MASTER_ADDR" : args.host})
os.environ.update({"MASTER_PORT" : "10638"})

rpc.init_rpc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to initialize RPC with a world_size of 1 if there's no RPC/cross host communication being used here?

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'm curious why we need to init RPC here

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```

Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)!

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 18, 2020
Pull Request resolved: #49577

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```
ghstack-source-id: 118939686

Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)!
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 159de1f.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/193/head branch December 22, 2020 15:17
hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
Summary:
Pull Request resolved: pytorch#49577

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```
ghstack-source-id: 118939686

Test Plan: sentinel

Reviewed By: rohan-varma

Differential Revision: D25628721

fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants