New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark combining Distributed Data Parallel and Distributed RPC #46993
Conversation
This pull request was exported from Phabricator. Differential Revision: D24409230 |
💊 CI failures summary and remediationsAs of commit 9d75306 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 4 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff, the overall benchmark looks great! I have a few minor comments inline.
@@ -0,0 +1,68 @@ | |||
# Benchmark combining Distributed Data Parallel and Distributed RPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we might add other files to the benchmark directory, maybe we should create a ddp_rpc directory under this folder and put the README and the benchmark file under that folder.
7) Finally, the [Distributed Optimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim) is used to update all parameters. | ||
|
||
|
||
## Example Benchmark output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add an example here of how to actually run the benchmark?
|
||
measurements = [] | ||
# Include warm-up cycles during training | ||
for epoch in range(100 + WARMUP_CYCLES): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this benchmark take to run? I'd suggest maybe 1000 iterations and 100 WARMUP_CYCLES as the default if it doesn't take too long to run.
|
||
------------------ PyTorch Distributed Benchmark (DDP and RPC) --------------------- | ||
|
||
sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be secs/epoch right and epochs/sec right?
super(HybridModel, self).__init__() | ||
self.emb_rref_list = emb_rref_list | ||
self.fc = DDP( | ||
torch.nn.Linear(NUM_PS * EMBEDDING_DIM, 8).cuda(device), device_ids=[device] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have Linear layers of constant size and then reshape the output from torch.cat
to match the first linear layer's shape.
super(HybridModel, self).__init__() | ||
self.emb_rref_list = emb_rref_list | ||
self.fc = DDP( | ||
torch.nn.Linear(NUM_PS * EMBEDDING_DIM, 8).cuda(device), device_ids=[device] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have maybe 3-5 linear layers stacked one after the other to simulate a MLP.
NUM_EMBEDDINGS = 100 | ||
EMBEDDING_DIM = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the default as 1000000 x 64 since this is a benchmark and not a test.
…torch#46993) Summary: Pull Request resolved: pytorch#46993 Introducing benchmark that combines Distributed Data Parallelism with Distributed Model Parallelism. The benchmark measures distributed training iteration time. The number of trainer nodes and parameter servers are configurable. The default setup has 8 trainers, 1 master node and 8 parameter servers. The training process is executed as follows: 1) The master creates embedding tables on each of the 8 Parameter Servers and holds an RRef to it. 2) The master, then kicks off the training loop on the 8 trainers and passes the embedding table RRef to the trainers. 3) The trainers create a `HybridModel` which performs embedding lookups in all 8 Parameter Servers using the embedding table RRef provided by the master and then executes the FC layer which is wrapped and replicated via DDP (DistributedDataParallel). 4) The trainer executes the forward pass of the model and uses the loss to execute the backward pass using Distributed Autograd. 5) As part of the backward pass, the gradients for the FC layer are computed first and synced to all trainers via allreduce in DDP. 6) Next, Distributed Autograd propagates the gradients to the parameter servers, where the gradients for the embedding table are updated. 7) Finally, the Distributed Optimizer is used to update all parameters. Test Plan: waitforbuildbot Benchmark output: ---------- Info --------- * PyTorch version: 1.7.0 * CUDA version: 9.2.0 ---------- nvidia-smi topo -m --------- GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X NV2 NV1 NV2 NV1 NODE NODE NODE 0-19,40-59 GPU1 NV2 X NV2 NV1 NODE NV1 NODE NODE 0-19,40-59 GPU2 NV1 NV2 X NV1 NODE NODE NV2 NODE 0-19,40-59 GPU3 NV2 NV1 NV1 X NODE NODE NODE NV2 0-19,40-59 GPU4 NV1 NODE NODE NODE X NV2 NV1 NV2 0-19,40-59 GPU5 NODE NV1 NODE NODE NV2 X NV2 NV1 0-19,40-59 GPU6 NODE NODE NV2 NODE NV1 NV2 X NV1 0-19,40-59 GPU7 NODE NODE NODE NV2 NV2 NV1 NV1 X 0-19,40-59 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks ------------------ PyTorch Distributed Benchmark (DDP and RPC) --------------------- sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec Trainer0: p50: 0.376s 185/s p75: 0.384s 182/s p90: 0.390s 179/s p95: 0.396s 176/s Trainer1: p50: 0.377s 204/s p75: 0.384s 200/s p90: 0.389s 197/s p95: 0.393s 195/s Trainer2: p50: 0.377s 175/s p75: 0.384s 172/s p90: 0.390s 169/s p95: 0.395s 166/s Trainer3: p50: 0.377s 161/s p75: 0.384s 158/s p90: 0.390s 156/s p95: 0.393s 155/s Trainer4: p50: 0.377s 172/s p75: 0.383s 169/s p90: 0.389s 166/s p95: 0.395s 164/s Trainer5: p50: 0.377s 180/s p75: 0.383s 177/s p90: 0.389s 174/s p95: 0.395s 172/s Trainer6: p50: 0.377s 204/s p75: 0.384s 200/s p90: 0.390s 197/s p95: 0.394s 195/s Trainer7: p50: 0.377s 185/s p75: 0.384s 182/s p90: 0.389s 179/s p95: 0.394s 177/s All: p50: 0.377s 1470/s p75: 0.384s 1443/s p90: 0.390s 1421/s p95: 0.396s 1398/s Differential Revision: D24409230 fbshipit-source-id: a9da9cfec013af6209f036dccca0361cc9305b0c
This pull request was exported from Phabricator. Differential Revision: D24409230 |
5fe4158
to
9d75306
Compare
Codecov Report
@@ Coverage Diff @@
## master #46993 +/- ##
=======================================
Coverage 60.81% 60.81%
=======================================
Files 2749 2749
Lines 254098 254098
=======================================
+ Hits 154522 154533 +11
+ Misses 99576 99565 -11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @BorisValkov5, thanks for adding this. I have two more comments:
- Shall we combine this
torch/distributed/benchmark
folder withbenchmarks/distributed/
folder? It might be a little confusing to have two different benchmark folders. - Is it possible to add tests for benchmarks, so that CI can help us make sure that our benchmarks won't be broken by future changes.
Thanks @mrshenli for your comments! Land in progress - will address them in a subsequent change. |
This pull request has been merged in 6c5a1c5. |
Summary:
Introducing benchmark that combines Distributed Data Parallelism with Distributed Model Parallelism. The benchmark measures distributed training iteration time. The number of trainer nodes and parameter servers are configurable. The default setup has 8 trainers, 1 master node and 8 parameter servers.
The training process is executed as follows:
HybridModel
which performs embedding lookups in all 8 Parameter Servers using the embedding table RRef provided by the master and then executes the FC layer which is wrapped and replicated via DDP (DistributedDataParallel).execute the backward pass using Distributed Autograd.
first and synced to all trainers via allreduce in DDP.
where the gradients for the embedding table are updated.
Test Plan:
waitforbuildbot
Benchmark output:
---------- Info ---------
---------- nvidia-smi topo -m ---------
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
------------------ PyTorch Distributed Benchmark (DDP and RPC) ---------------------
Differential Revision: D24409230