DDP: coalescing many little broadcasts to improve performance #4978

teng-li · 2018-02-01T01:25:32Z

At the constructor, when we broadcast module 0 on Node 0's entire module states to all other nodes, we broadcast each state (tensor) one by one, this is very inefficient and also causes NCCL/Gloo deadlocks (previous, even though this is a different issue).

This PR coalesces the entire module states into a series of big Tensors and broadcasts them out in one shot, just like _sync_buffers() at each forward() function.

The memory is limited by the bucket size, which is 10MB by default.

This also reduces the chance that multi-process/node cross node NCCL deadlocks, which happens only after a long series of broadcast. So far after this change, I haven't seen any deadlocks yet.

Tested on ResNet50 training, I printed out the entire module's state on both nodes before the broadcast and after the broadcast, and verified that after the coalesced broadcast, node 0 and node 1 have the exact state and node 0's state hasn't changed either. Thus, this PR should be safe to land

Stonesjtu · 2018-02-01T16:00:38Z

I'm curious about how much time saving in this coalesced broadcast, and the overhead of packing/unpacking tensors. Should them be moved into a new contiguous memory.

I'm testing DDP on a single node due to the GIL in DP, but failed to get reasonable speedup. Could you please give me your system setup for running DDP.

torch/nn/parallel/distributed.py

@@ -188,6 +197,34 @@ def train(self, mode=True):
        for module in self._module_copies[1:]:
            module.train(mode)

+    def _dist_broadcast_coalesced(self, tensors, buffer_size):


teng-li · 2018-02-01T23:33:00Z

@Stonesjtu For about 100MB of broadcast across two nodes, I did a rough perf evaluation, a single broadcast vs the original many broadcasts, we can nearly double the performance. 0.0268 sec (with 1 coalesced b-cast) vs 0.0436 sec (with the original logic).

apaszke

LGTM, but the code could be simpler and more robust if we used _take_tensors

torch/nn/parallel/distributed.py

+        # Here we will coalesce all the parameters and buffers in several big
+        # flat tensors and broadcast them out to reduce the number of broadcasts
+        # as well as improve performance, even though this function is
+        # only called one time.


torch/nn/parallel/distributed.py

+                tensors_bucket.append([])
+                cur_bucket_size = 0
+            tensors_bucket[-1].append(tensor)
+            cur_bucket_size += tensor.numel() * tensor.element_size()


ebetica · 2018-02-14T01:54:11Z

@teng-li Is it possible to simply flatten the storage of all parameters (assuming they are static for the duration of the training)? As long as you do in place operations on parameters, which I think PyTorch does, it seems like you'd be able to copy the entire model with just one broadcast.

teng-li · 2018-02-14T06:52:25Z

@ebetica hmm, yeah, good thought, but this enforces a static assumption, I am not too sure how feasible this would be. cc @apaszke

apaszke · 2018-02-14T11:02:10Z

We can't do this transformation in general, because there are cases where users want to flatten the parameters as they like, and we shouldn't interfere.

onnxbot-worker-3 mentioned this pull request Feb 1, 2018

[auto] pytorch-pr-4978 onnxbot/onnx-fb-universe#506

Closed

Coalesce many little broadcasts into one to improve performance

4629b22

teng-li force-pushed the dist_perf branch from 6c1752f to 4629b22 Compare February 1, 2018 01:49

teng-li changed the title ~~DDP: coalescing many little broadcasts into one to improve performance~~ DDP: coalescing many little broadcasts to improve performance Feb 1, 2018

teng-li force-pushed the dist_perf branch 2 times, most recently from c3adf44 to 85aab24 Compare February 1, 2018 23:16

teng-li commented Feb 1, 2018

View reviewed changes

torch/nn/parallel/distributed.py

@@ -188,6 +197,34 @@ def train(self, mode=True):

for module in self._module_copies[1:]:

module.train(mode)

def _dist_broadcast_coalesced(self, tensors, buffer_size):

This comment was marked as off-topic.

Sign in to view

Use bucket to limit the memory of coalesced broadcast

4cc16b4

teng-li force-pushed the dist_perf branch from 85aab24 to 4cc16b4 Compare February 1, 2018 23:40

apaszke reviewed Feb 6, 2018

View reviewed changes

Used take_tensor helper for bucketing

8a0438a

teng-li force-pushed the dist_perf branch from 0c116fd to 8a0438a Compare February 7, 2018 07:14

Update distributed.py

93db5e9

apaszke approved these changes Feb 12, 2018

View reviewed changes

apaszke merged commit d7b6a61 into pytorch:master Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP: coalescing many little broadcasts to improve performance #4978

DDP: coalescing many little broadcasts to improve performance #4978

Uh oh!

teng-li commented Feb 1, 2018 •

edited

Loading

Uh oh!

Stonesjtu commented Feb 1, 2018 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Feb 1, 2018 •

edited

Loading

Uh oh!

apaszke left a comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ebetica commented Feb 14, 2018

Uh oh!

teng-li commented Feb 14, 2018

Uh oh!

apaszke commented Feb 14, 2018

Uh oh!

Uh oh!

DDP: coalescing many little broadcasts to improve performance #4978

DDP: coalescing many little broadcasts to improve performance #4978

Uh oh!

Conversation

teng-li commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stonesjtu commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ebetica commented Feb 14, 2018

Uh oh!

teng-li commented Feb 14, 2018

Uh oh!

apaszke commented Feb 14, 2018

Uh oh!

Uh oh!

teng-li commented Feb 1, 2018 •

edited

Loading

Stonesjtu commented Feb 1, 2018 •

edited

Loading

teng-li commented Feb 1, 2018 •

edited

Loading