Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark combining Distributed Data Parallel and Distributed RPC #46993

Closed
wants to merge 1 commit into from

Conversation

BorisValkov5
Copy link
Contributor

Summary:
Introducing benchmark that combines Distributed Data Parallelism with Distributed Model Parallelism. The benchmark measures distributed training iteration time. The number of trainer nodes and parameter servers are configurable. The default setup has 8 trainers, 1 master node and 8 parameter servers.

The training process is executed as follows:

  1. The master creates embedding tables on each of the 8 Parameter Servers and holds an RRef to it.
  2. The master, then kicks off the training loop on the 8 trainers and passes the embedding table RRef to the trainers.
  3. The trainers create a HybridModel which performs embedding lookups in all 8 Parameter Servers using the embedding table RRef provided by the master and then executes the FC layer which is wrapped and replicated via DDP (DistributedDataParallel).
  4. The trainer executes the forward pass of the model and uses the loss to
    execute the backward pass using Distributed Autograd.
  5. As part of the backward pass, the gradients for the FC layer are computed
    first and synced to all trainers via allreduce in DDP.
  6. Next, Distributed Autograd propagates the gradients to the parameter servers,
    where the gradients for the embedding table are updated.
  7. Finally, the Distributed Optimizer is used to update all parameters.

Test Plan:
waitforbuildbot

Benchmark output:

---------- Info ---------

  • PyTorch version: 1.7.0
  • CUDA version: 9.2.0

---------- nvidia-smi topo -m ---------

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU     Affinity
GPU0     X      NV2     NV1     NV2     NV1     NODE    NODE    NODE    0-19,40-59
GPU1    NV2      X      NV2     NV1     NODE    NV1     NODE    NODE    0-19,40-59
GPU2    NV1     NV2      X      NV1     NODE    NODE    NV2     NODE    0-19,40-59
GPU3    NV2     NV1     NV1      X      NODE    NODE    NODE    NV2     0-19,40-59
GPU4    NV1     NODE    NODE    NODE     X      NV2     NV1     NV2     0-19,40-59
GPU5    NODE    NV1     NODE    NODE    NV2      X      NV2     NV1     0-19,40-59
GPU6    NODE    NODE    NV2     NODE    NV1     NV2      X      NV1     0-19,40-59
GPU7    NODE    NODE    NODE    NV2     NV2     NV1     NV1      X      0-19,40-59

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

------------------ PyTorch Distributed Benchmark (DDP and RPC) ---------------------

                sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
Trainer0:  p50:  0.376s     185/s  p75:  0.384s     182/s  p90:  0.390s     179/s  p95:  0.396s     176/s
Trainer1:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.389s     197/s  p95:  0.393s     195/s
Trainer2:  p50:  0.377s     175/s  p75:  0.384s     172/s  p90:  0.390s     169/s  p95:  0.395s     166/s
Trainer3:  p50:  0.377s     161/s  p75:  0.384s     158/s  p90:  0.390s     156/s  p95:  0.393s     155/s
Trainer4:  p50:  0.377s     172/s  p75:  0.383s     169/s  p90:  0.389s     166/s  p95:  0.395s     164/s
Trainer5:  p50:  0.377s     180/s  p75:  0.383s     177/s  p90:  0.389s     174/s  p95:  0.395s     172/s
Trainer6:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.390s     197/s  p95:  0.394s     195/s
Trainer7:  p50:  0.377s     185/s  p75:  0.384s     182/s  p90:  0.389s     179/s  p95:  0.394s     177/s
     All:  p50:  0.377s    1470/s  p75:  0.384s    1443/s  p90:  0.390s    1421/s  p95:  0.396s    1398/s

Differential Revision: D24409230

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24409230

@dr-ci
Copy link

dr-ci bot commented Oct 28, 2020

💊 CI failures summary and remediations

As of commit 9d75306 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff, the overall benchmark looks great! I have a few minor comments inline.

@@ -0,0 +1,68 @@
# Benchmark combining Distributed Data Parallel and Distributed RPC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we might add other files to the benchmark directory, maybe we should create a ddp_rpc directory under this folder and put the README and the benchmark file under that folder.

7) Finally, the [Distributed Optimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim) is used to update all parameters.


## Example Benchmark output:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add an example here of how to actually run the benchmark?


measurements = []
# Include warm-up cycles during training
for epoch in range(100 + WARMUP_CYCLES):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this benchmark take to run? I'd suggest maybe 1000 iterations and 100 WARMUP_CYCLES as the default if it doesn't take too long to run.


------------------ PyTorch Distributed Benchmark (DDP and RPC) ---------------------

sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be secs/epoch right and epochs/sec right?

super(HybridModel, self).__init__()
self.emb_rref_list = emb_rref_list
self.fc = DDP(
torch.nn.Linear(NUM_PS * EMBEDDING_DIM, 8).cuda(device), device_ids=[device]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have Linear layers of constant size and then reshape the output from torch.cat to match the first linear layer's shape.

super(HybridModel, self).__init__()
self.emb_rref_list = emb_rref_list
self.fc = DDP(
torch.nn.Linear(NUM_PS * EMBEDDING_DIM, 8).cuda(device), device_ids=[device]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have maybe 3-5 linear layers stacked one after the other to simulate a MLP.

Comment on lines 26 to 27
NUM_EMBEDDINGS = 100
EMBEDDING_DIM = 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the default as 1000000 x 64 since this is a benchmark and not a test.

…torch#46993)

Summary:
Pull Request resolved: pytorch#46993

Introducing benchmark that combines Distributed Data Parallelism with Distributed Model Parallelism. The benchmark measures distributed training iteration time. The number of trainer nodes and parameter servers are configurable. The default setup has 8 trainers, 1 master node and 8 parameter servers.

The training process is executed as follows:

1) The master creates embedding tables on each of the 8 Parameter Servers and holds an RRef to it.
2) The master, then kicks off the training loop on the 8 trainers and passes the embedding table RRef to the trainers.
3) The trainers create a `HybridModel` which performs embedding lookups in all 8 Parameter Servers using the embedding table RRef provided by the master and then executes the FC layer which is wrapped and replicated via DDP (DistributedDataParallel).
4) The trainer executes the forward pass of the model and uses the loss to
   execute the backward pass using Distributed Autograd.
5) As part of the backward pass, the gradients for the FC layer are computed
   first and synced to all trainers via allreduce in DDP.
6) Next, Distributed Autograd propagates the gradients to the parameter servers,
   where the gradients for the embedding table are updated.
7) Finally, the Distributed Optimizer is used to update all parameters.

Test Plan:
waitforbuildbot

Benchmark output:

---------- Info ---------

* PyTorch version: 1.7.0
* CUDA version: 9.2.0

---------- nvidia-smi topo -m ---------

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU     Affinity
    GPU0     X      NV2     NV1     NV2     NV1     NODE    NODE    NODE    0-19,40-59
    GPU1    NV2      X      NV2     NV1     NODE    NV1     NODE    NODE    0-19,40-59
    GPU2    NV1     NV2      X      NV1     NODE    NODE    NV2     NODE    0-19,40-59
    GPU3    NV2     NV1     NV1      X      NODE    NODE    NODE    NV2     0-19,40-59
    GPU4    NV1     NODE    NODE    NODE     X      NV2     NV1     NV2     0-19,40-59
    GPU5    NODE    NV1     NODE    NODE    NV2      X      NV2     NV1     0-19,40-59
    GPU6    NODE    NODE    NV2     NODE    NV1     NV2      X      NV1     0-19,40-59
    GPU7    NODE    NODE    NODE    NV2     NV2     NV1     NV1      X      0-19,40-59

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

------------------  PyTorch Distributed Benchmark (DDP and RPC) ---------------------

                    sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
    Trainer0:  p50:  0.376s     185/s  p75:  0.384s     182/s  p90:  0.390s     179/s  p95:  0.396s     176/s
    Trainer1:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.389s     197/s  p95:  0.393s     195/s
    Trainer2:  p50:  0.377s     175/s  p75:  0.384s     172/s  p90:  0.390s     169/s  p95:  0.395s     166/s
    Trainer3:  p50:  0.377s     161/s  p75:  0.384s     158/s  p90:  0.390s     156/s  p95:  0.393s     155/s
    Trainer4:  p50:  0.377s     172/s  p75:  0.383s     169/s  p90:  0.389s     166/s  p95:  0.395s     164/s
    Trainer5:  p50:  0.377s     180/s  p75:  0.383s     177/s  p90:  0.389s     174/s  p95:  0.395s     172/s
    Trainer6:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.390s     197/s  p95:  0.394s     195/s
    Trainer7:  p50:  0.377s     185/s  p75:  0.384s     182/s  p90:  0.389s     179/s  p95:  0.394s     177/s
         All:  p50:  0.377s    1470/s  p75:  0.384s    1443/s  p90:  0.390s    1421/s  p95:  0.396s    1398/s

Differential Revision: D24409230

fbshipit-source-id: a9da9cfec013af6209f036dccca0361cc9305b0c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D24409230

@codecov
Copy link

codecov bot commented Nov 3, 2020

Codecov Report

Merging #46993 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #46993   +/-   ##
=======================================
  Coverage   60.81%   60.81%           
=======================================
  Files        2749     2749           
  Lines      254098   254098           
=======================================
+ Hits       154522   154533   +11     
+ Misses      99576    99565   -11     

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @BorisValkov5, thanks for adding this. I have two more comments:

  1. Shall we combine this torch/distributed/benchmark folder with benchmarks/distributed/ folder? It might be a little confusing to have two different benchmark folders.
  2. Is it possible to add tests for benchmarks, so that CI can help us make sure that our benchmarks won't be broken by future changes.

@BorisValkov5
Copy link
Contributor Author

Thanks @mrshenli for your comments! Land in progress - will address them in a subsequent change.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6c5a1c5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants