-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add benchmark for torch.distributed.pipeline.sync.Pipe #49577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark for torch.distributed.pipeline.sync.Pipe #49577
Conversation
Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)! [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 978ef60 (more details on the Dr. CI page):
🕵️ 4 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)! ghstack-source-id: 118864113 Pull Request resolved: #49577
@@ -0,0 +1,284 @@ | |||
# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Not sure whats the guidelines around these copyrights, but it seems like only the pipeline files have this, and no other files in OSS pytorch (i.e. caffe2/torch directory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably need it only for original torchgpipe files, will remove it for this.
|
||
num_params = reduce(operator.add, (reduce(operator.mul, x.size()) for x in model.parameters())) | ||
logging.info(f"training model, #prams = {num_params}") | ||
vocab_size = 10000 # FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: What needs to be fixed here?
cur_loss = total_loss / log_interval | ||
elapsed = time.time() - start_time | ||
print( | ||
"| batch {:5d} | wps {:5.2f} | loss {:5.2f} | ppl {:8.2f}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, What's the interpretation of pps here?
os.environ.update({"MASTER_ADDR" : args.host}) | ||
os.environ.update({"MASTER_PORT" : "10638"}) | ||
|
||
rpc.init_rpc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to initialize RPC with a world_size of 1 if there's no RPC/cross host communication being used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'm curious why we need to init RPC here
Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)! [ghstack-poisoned]
Pull Request resolved: #49577 Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` ghstack-source-id: 118939686 Differential Revision: [D25628721](https://our.internmc.facebook.com/intern/diff/D25628721/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25628721/)!
This pull request has been merged in 159de1f. |
Summary: Pull Request resolved: pytorch#49577 Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` ghstack-source-id: 118939686 Test Plan: sentinel Reviewed By: rohan-varma Differential Revision: D25628721 fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a
Stack from ghstack:
Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.
Sample output:
Differential Revision: D25628721
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!