-
Notifications
You must be signed in to change notification settings - Fork 24.8k
DDP: 10% of NCCL backend perf improvements with mixed-prec support #5064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torch/nn/parallel/distributed.py
Outdated
for grad, reduced in \ | ||
zip(all_grads[0], | ||
_unflatten_dense_tensors(all_grads_coalesced[0], | ||
all_grads[0])): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
all_grads_coalesced = [] | ||
|
||
# Coalesce all the gradients | ||
# TODO: Add mixed precision support here |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
# Adding the gradients for reduction | ||
all_grads[idx].append(param.grad.data) | ||
with torch.cuda.device(self.device_ids[idx]): | ||
dev_grads_coalesced = _flatten_dense_tensors(all_grads[idx]) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
zip(all_grads[0], | ||
_unflatten_dense_tensors(all_grads_coalesced[0], | ||
all_grads[0])): | ||
grad.copy_(reduced) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
for p in module.parameters(): | ||
if p.requires_grad: | ||
def allreduce_hook(*unused): | ||
Variable._execution_engine.\ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
dist.all_reduce_multigpu(all_grads_coalesced, | ||
group=self.nccl_reduction_group_id) | ||
|
||
# Now only work on the first lead GPU |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
5979a95
to
5855d66
Compare
@pytorchbot retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, have you tested this using NCCL2, but without IB? It might no longer be the fastest way to do things if that's the environment someone is in, and only few people have IB.
torch/nn/parallel/distributed.py
Outdated
for grad, reduced in zip(mst_dev_grads, grads_reduced): | ||
grad.copy_(reduced) | ||
|
||
# Now register the reduction function in the execution engine |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
grad.copy_(reduced) | ||
|
||
# Now register the reduction function in the execution engine | ||
for module in self._module_copies: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
all_grads_coalesced = \ | ||
[[] for _ in range(len(mst_dev_grads_buckets))] | ||
|
||
for bkt_idx, dev_grads in enumerate(dev_grads_buckets): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
dev_id = self.device_ids[dev_idx] | ||
with torch.cuda.device(dev_id): | ||
dev_grads_coalesced = _flatten_dense_tensors(dev_grads) | ||
all_grads_coalesced[bkt_idx].append(dev_grads_coalesced) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
|
||
# Reduce all the gradients first | ||
# This single op will do all-reduce on all GPUs utilizing multiple | ||
# all reduce rings when we have more than one fast IB interfaces. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apaszke I haven't got a chance to test on Ethernet yet. But for Nccl backend, this should be the same, since the other code path, we are also using a single bucket reduction, without any threading overlapping
torch/nn/parallel/distributed.py
Outdated
dev_id = self.device_ids[dev_idx] | ||
with torch.cuda.device(dev_id): | ||
dev_grads_coalesced = _flatten_dense_tensors(dev_grads) | ||
all_grads_coalesced[bkt_idx].append(dev_grads_coalesced) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
all_grads_coalesced = \ | ||
[[] for _ in range(len(mst_dev_grads_buckets))] | ||
|
||
for bkt_idx, dev_grads in enumerate(dev_grads_buckets): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
|
||
# Reduce all the gradients first | ||
# This single op will do all-reduce on all GPUs utilizing multiple | ||
# all reduce rings when we have more than one fast IB interfaces. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
for grad, reduced in zip(mst_dev_grads, grads_reduced): | ||
grad.copy_(reduced) | ||
|
||
# Now register the reduction function in the execution engine |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
grad.copy_(reduced) | ||
|
||
# Now register the reduction function in the execution engine | ||
for module in self._module_copies: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
I'm sorry, why should it be the same? Ethernet is likely to have much lower throughput than 4xIB that you tested with, and interleaving the communication with backward might still be beneficial in that case. |
@apaszke I meant, we don't currently support multi-threading for NCCL. Remember that the gradient buckets of each process on different nodes need to be executed in the exact same order. So even if we use the old code path, we still limit the number of bucket to be 1. I will further test the Ethernet perf, which is on my to-do list anyway. But this should not block this PR for now |
Is this true, even if they are using different groups? I thought that it doesn't matter as long as they use different communicators, right? |
@apaszke right, they are using different communicators. But the NCCL call order (among all the reduction thread) needs to be maintained among all the nodes. The gradients available order for each bucket can not be guaranteed for different nodes. |
Does it have to be the same, even when they are on different communicators? I think the whole purpose of keeping multiple comms was to allow concurrent operations to execute independently |
@apaszke according to @csarofeen, the order needs to be maintained. |
I believe if your communicators overlap, you can get in trouble if the order is not correct. I don't believe you can have 2 communicators on 2 gpu's and have: |
Aren't these situations the reason why |
@apaszke comments addressed |
# (1) intra-node reduce to lead GPU, followed by | ||
# (2) inter-node allreduce for all the first lead GPUs in all nodes | ||
dist.all_reduce_multigpu(grads_batch_coalesced, | ||
group=self.nccl_reduction_group_id) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@apaszke ncclGroupStart and ncclGroupEnd were introduced for a case when a single thread submitted nccl calls for the different GPUs (the way DataParallel operates). In nccl2 it would deadlock unless wrapped in groupStart/groupEnd. For a single GPU per process groupStart/groupEnd is a noop. |
Got answer from nccl: "calling collectives on different comms is to be avoided, and if there is no other solution it should be properly ordered". Apparently, the purpose of different comms was not to use them. |
@apaszke The latest commit has been tested with ResNet50 with the correct 76 accuracy on two nodes. It should be safe to land |
@teng-li can you please fix the conflict? |
@apaszke fixed |
This PR lets NCCL backend to directly enqueue the NCCL reduction kernels in the same thread since it's an async call by nature, and everything now goes to the default stream. This essentially gets rid of the overheads of python thread synchronization, stream synchronization, as well as bucketing map lookup overheads.
For DDP with multiple GPU (the default use case), instead of doing a two step reduction, we use the new NCCL backend API to all-reduce all GPU at one time, this is basically 4 times all-reduce throughput on DGX1s compared to the all-reduce(single GPU version) plus the intra-node reduce overheads.
As a results, we have the following perf improvements.
For DDP with 8 GPU (default use case):
On two DGX1s with 8 V100s, 256 batch size / process, single process / node
ResNet50
0.139 sec / iter reduced from 0.154 sec / iter (about 10 percent improvements)
ResNet101
0.247 sec / iter reduced from 0.272 sec / iter (about 10 percent improvements)
For DDP with 1 GPU (multi-process use case)
On singe DGX1s with 8 V100s, ResNet50, 32 batch size per GPU and process, 8 processes distributed training
0.109 sec / iter reduced from 0.116 sec / iter (about 6 percent improvement)
In addition, added bucketing to limit the memory usage, added mixed precision support.