DDP: 10% of NCCL backend perf improvements with mixed-prec support #5064

teng-li · 2018-02-05T22:45:43Z

This PR lets NCCL backend to directly enqueue the NCCL reduction kernels in the same thread since it's an async call by nature, and everything now goes to the default stream. This essentially gets rid of the overheads of python thread synchronization, stream synchronization, as well as bucketing map lookup overheads.

For DDP with multiple GPU (the default use case), instead of doing a two step reduction, we use the new NCCL backend API to all-reduce all GPU at one time, this is basically 4 times all-reduce throughput on DGX1s compared to the all-reduce(single GPU version) plus the intra-node reduce overheads.

As a results, we have the following perf improvements.

For DDP with 8 GPU (default use case):

On two DGX1s with 8 V100s, 256 batch size / process, single process / node
ResNet50
0.139 sec / iter reduced from 0.154 sec / iter (about 10 percent improvements)
ResNet101
0.247 sec / iter reduced from 0.272 sec / iter (about 10 percent improvements)

For DDP with 1 GPU (multi-process use case)

On singe DGX1s with 8 V100s, ResNet50, 32 batch size per GPU and process, 8 processes distributed training
0.109 sec / iter reduced from 0.116 sec / iter (about 6 percent improvement)

In addition, added bucketing to limit the memory usage, added mixed precision support.

torch/nn/parallel/distributed.py

+            for grad, reduced in \
+                    zip(all_grads[0],
+                        _unflatten_dense_tensors(all_grads_coalesced[0],
+                                                 all_grads[0])):


torch/nn/parallel/distributed.py

+            all_grads_coalesced = []
+
+            # Coalesce all the gradients
+            # TODO: Add mixed precision support here


torch/nn/parallel/distributed.py

+                    # Adding the gradients for reduction
+                    all_grads[idx].append(param.grad.data)
+                with torch.cuda.device(self.device_ids[idx]):
+                    dev_grads_coalesced = _flatten_dense_tensors(all_grads[idx])


torch/nn/parallel/distributed.py

+                    zip(all_grads[0],
+                        _unflatten_dense_tensors(all_grads_coalesced[0],
+                                                 all_grads[0])):
+                grad.copy_(reduced)


torch/nn/parallel/distributed.py

+            for p in module.parameters():
+                if p.requires_grad:
+                    def allreduce_hook(*unused):
+                        Variable._execution_engine.\


torch/nn/parallel/distributed.py

+            dist.all_reduce_multigpu(all_grads_coalesced,
+                                     group=self.nccl_reduction_group_id)
+
+            # Now only work on the first lead GPU


apaszke · 2018-02-12T15:42:06Z

@pytorchbot retest this please

apaszke

Just out of curiosity, have you tested this using NCCL2, but without IB? It might no longer be the fastest way to do things if that's the environment someone is in, and only few people have IB.

torch/nn/parallel/distributed.py

+                for grad, reduced in zip(mst_dev_grads, grads_reduced):
+                    grad.copy_(reduced)
+
+        # Now register the reduction function in the execution engine


torch/nn/parallel/distributed.py

+                    grad.copy_(reduced)
+
+        # Now register the reduction function in the execution engine
+        for module in self._module_copies:


torch/nn/parallel/distributed.py

+                    all_grads_coalesced = \
+                        [[] for _ in range(len(mst_dev_grads_buckets))]
+
+                for bkt_idx, dev_grads in enumerate(dev_grads_buckets):


torch/nn/parallel/distributed.py

+                    dev_id = self.device_ids[dev_idx]
+                    with torch.cuda.device(dev_id):
+                        dev_grads_coalesced = _flatten_dense_tensors(dev_grads)
+                        all_grads_coalesced[bkt_idx].append(dev_grads_coalesced)


torch/nn/parallel/distributed.py

+
+            # Reduce all the gradients first
+            # This single op will do all-reduce on all GPUs utilizing multiple
+            # all reduce rings when we have more than one fast IB interfaces.


teng-li

@apaszke I haven't got a chance to test on Ethernet yet. But for Nccl backend, this should be the same, since the other code path, we are also using a single bucket reduction, without any threading overlapping

torch/nn/parallel/distributed.py

+                    dev_id = self.device_ids[dev_idx]
+                    with torch.cuda.device(dev_id):
+                        dev_grads_coalesced = _flatten_dense_tensors(dev_grads)
+                        all_grads_coalesced[bkt_idx].append(dev_grads_coalesced)


torch/nn/parallel/distributed.py

+                    all_grads_coalesced = \
+                        [[] for _ in range(len(mst_dev_grads_buckets))]
+
+                for bkt_idx, dev_grads in enumerate(dev_grads_buckets):


torch/nn/parallel/distributed.py

+
+            # Reduce all the gradients first
+            # This single op will do all-reduce on all GPUs utilizing multiple
+            # all reduce rings when we have more than one fast IB interfaces.


torch/nn/parallel/distributed.py

+                for grad, reduced in zip(mst_dev_grads, grads_reduced):
+                    grad.copy_(reduced)
+
+        # Now register the reduction function in the execution engine


torch/nn/parallel/distributed.py

+                    grad.copy_(reduced)
+
+        # Now register the reduction function in the execution engine
+        for module in self._module_copies:


apaszke · 2018-02-13T11:16:20Z

I'm sorry, why should it be the same? Ethernet is likely to have much lower throughput than 4xIB that you tested with, and interleaving the communication with backward might still be beneficial in that case.

teng-li · 2018-02-13T20:19:52Z

@apaszke I meant, we don't currently support multi-threading for NCCL. Remember that the gradient buckets of each process on different nodes need to be executed in the exact same order. So even if we use the old code path, we still limit the number of bucket to be 1.

I will further test the Ethernet perf, which is on my to-do list anyway. But this should not block this PR for now

apaszke · 2018-02-14T11:59:28Z

Is this true, even if they are using different groups? I thought that it doesn't matter as long as they use different communicators, right?

teng-li · 2018-02-14T20:26:50Z

@apaszke right, they are using different communicators. But the NCCL call order (among all the reduction thread) needs to be maintained among all the nodes. The gradients available order for each bucket can not be guaranteed for different nodes.

apaszke · 2018-02-15T22:23:30Z

Does it have to be the same, even when they are on different communicators? I think the whole purpose of keeping multiple comms was to allow concurrent operations to execute independently

teng-li · 2018-02-15T22:27:36Z

@apaszke according to @csarofeen, the order needs to be maintained.

csarofeen · 2018-02-15T23:35:57Z

I believe if your communicators overlap, you can get in trouble if the order is not correct. I don't believe you can have 2 communicators on 2 gpu's and have:
GPU 0 call on comm 0 then comm 1
and
GPU 1 call on comm 1 then comm 0

apaszke · 2018-02-16T00:45:32Z

Aren't these situations the reason why ncclGroupStart and ncclGroupEnd were introduced?

teng-li · 2018-02-16T02:12:04Z

@apaszke comments addressed

torch/nn/parallel/distributed.py

+                # (1) intra-node reduce to lead GPU, followed by
+                # (2) inter-node allreduce for all the first lead GPUs in all nodes
+                dist.all_reduce_multigpu(grads_batch_coalesced,
+                                         group=self.nccl_reduction_group_id)


ngimel · 2018-02-16T17:26:44Z

@apaszke ncclGroupStart and ncclGroupEnd were introduced for a case when a single thread submitted nccl calls for the different GPUs (the way DataParallel operates). In nccl2 it would deadlock unless wrapped in groupStart/groupEnd. For a single GPU per process groupStart/groupEnd is a noop.

ngimel · 2018-02-16T18:44:04Z

Got answer from nccl: "calling collectives on different comms is to be avoided, and if there is no other solution it should be properly ordered". Apparently, the purpose of different comms was not to use them.

teng-li · 2018-02-20T01:24:53Z

@apaszke The latest commit has been tested with ResNet50 with the correct 76 accuracy on two nodes. It should be safe to land

apaszke · 2018-02-21T13:30:33Z

@teng-li can you please fix the conflict?

teng-li · 2018-02-21T21:37:13Z

@apaszke fixed

onnxbot-worker-3 mentioned this pull request Feb 5, 2018

[auto] pytorch-pr-5064 onnxbot/onnx-fb-universe#550

Closed

teng-li mentioned this pull request Feb 6, 2018

Added mixed-precision support in distributed training #4891

Merged

apaszke reviewed Feb 6, 2018

View reviewed changes

teng-li changed the title ~~DDP: 10% of NCCL backend performance improvements~~ DDP: 10% of NCCL backend perf improvements with mixed-prec support Feb 7, 2018

teng-li force-pushed the nccl2_ddp branch 2 times, most recently from 5979a95 to 5855d66 Compare February 8, 2018 00:08

apaszke reviewed Feb 12, 2018

View reviewed changes

teng-li commented Feb 13, 2018

View reviewed changes

teng-li force-pushed the nccl2_ddp branch from b151639 to 03cb5fe Compare February 16, 2018 02:16

apaszke reviewed Feb 16, 2018

View reviewed changes

teng-li added 3 commits February 21, 2018 13:23

DDP: 10% of NCCL backend performance improvements

314d632

Added mixed precision support with nccl reduction bucketing

07c8180

Bug fixes and addressed comments

27e50f3

teng-li force-pushed the nccl2_ddp branch from c377f48 to 37c21f5 Compare February 21, 2018 21:28

Added logics to clear gradients on the replicas

97e1489

teng-li force-pushed the nccl2_ddp branch from 37c21f5 to 97e1489 Compare February 21, 2018 21:40

apaszke approved these changes Feb 21, 2018

View reviewed changes

apaszke merged commit 579de82 into pytorch:master Feb 21, 2018

DDP: 10% of NCCL backend perf improvements with mixed-prec support #5064

DDP: 10% of NCCL backend perf improvements with mixed-prec support #5064

Uh oh!

Conversation

teng-li commented Feb 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Feb 12, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Feb 13, 2018

Uh oh!

teng-li commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Feb 14, 2018

Uh oh!

teng-li commented Feb 14, 2018

Uh oh!

apaszke commented Feb 15, 2018

Uh oh!

teng-li commented Feb 5, 2018 •

edited

Loading

teng-li commented Feb 13, 2018 •

edited

Loading