Skip to content

DDP: 10% of NCCL backend perf improvements with mixed-prec support #5064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 21, 2018

Conversation

teng-li
Copy link
Contributor

@teng-li teng-li commented Feb 5, 2018

This PR lets NCCL backend to directly enqueue the NCCL reduction kernels in the same thread since it's an async call by nature, and everything now goes to the default stream. This essentially gets rid of the overheads of python thread synchronization, stream synchronization, as well as bucketing map lookup overheads.

For DDP with multiple GPU (the default use case), instead of doing a two step reduction, we use the new NCCL backend API to all-reduce all GPU at one time, this is basically 4 times all-reduce throughput on DGX1s compared to the all-reduce(single GPU version) plus the intra-node reduce overheads.

As a results, we have the following perf improvements.

For DDP with 8 GPU (default use case):

On two DGX1s with 8 V100s, 256 batch size / process, single process / node
ResNet50
0.139 sec / iter reduced from 0.154 sec / iter (about 10 percent improvements)
ResNet101
0.247 sec / iter reduced from 0.272 sec / iter (about 10 percent improvements)

For DDP with 1 GPU (multi-process use case)

On singe DGX1s with 8 V100s, ResNet50, 32 batch size per GPU and process, 8 processes distributed training
0.109 sec / iter reduced from 0.116 sec / iter (about 6 percent improvement)

In addition, added bucketing to limit the memory usage, added mixed precision support.

for grad, reduced in \
zip(all_grads[0],
_unflatten_dense_tensors(all_grads_coalesced[0],
all_grads[0])):

This comment was marked as off-topic.

all_grads_coalesced = []

# Coalesce all the gradients
# TODO: Add mixed precision support here

This comment was marked as off-topic.

This comment was marked as off-topic.

# Adding the gradients for reduction
all_grads[idx].append(param.grad.data)
with torch.cuda.device(self.device_ids[idx]):
dev_grads_coalesced = _flatten_dense_tensors(all_grads[idx])

This comment was marked as off-topic.

This comment was marked as off-topic.

zip(all_grads[0],
_unflatten_dense_tensors(all_grads_coalesced[0],
all_grads[0])):
grad.copy_(reduced)

This comment was marked as off-topic.

for p in module.parameters():
if p.requires_grad:
def allreduce_hook(*unused):
Variable._execution_engine.\

This comment was marked as off-topic.

This comment was marked as off-topic.

dist.all_reduce_multigpu(all_grads_coalesced,
group=self.nccl_reduction_group_id)

# Now only work on the first lead GPU

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li teng-li changed the title DDP: 10% of NCCL backend performance improvements DDP: 10% of NCCL backend perf improvements with mixed-prec support Feb 7, 2018
@teng-li teng-li force-pushed the nccl2_ddp branch 2 times, most recently from 5979a95 to 5855d66 Compare February 8, 2018 00:08
@apaszke
Copy link
Contributor

apaszke commented Feb 12, 2018

@pytorchbot retest this please

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, have you tested this using NCCL2, but without IB? It might no longer be the fastest way to do things if that's the environment someone is in, and only few people have IB.

for grad, reduced in zip(mst_dev_grads, grads_reduced):
grad.copy_(reduced)

# Now register the reduction function in the execution engine

This comment was marked as off-topic.

This comment was marked as off-topic.

grad.copy_(reduced)

# Now register the reduction function in the execution engine
for module in self._module_copies:

This comment was marked as off-topic.

This comment was marked as off-topic.

all_grads_coalesced = \
[[] for _ in range(len(mst_dev_grads_buckets))]

for bkt_idx, dev_grads in enumerate(dev_grads_buckets):

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

dev_id = self.device_ids[dev_idx]
with torch.cuda.device(dev_id):
dev_grads_coalesced = _flatten_dense_tensors(dev_grads)
all_grads_coalesced[bkt_idx].append(dev_grads_coalesced)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.


# Reduce all the gradients first
# This single op will do all-reduce on all GPUs utilizing multiple
# all reduce rings when we have more than one fast IB interfaces.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor Author

@teng-li teng-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apaszke I haven't got a chance to test on Ethernet yet. But for Nccl backend, this should be the same, since the other code path, we are also using a single bucket reduction, without any threading overlapping

dev_id = self.device_ids[dev_idx]
with torch.cuda.device(dev_id):
dev_grads_coalesced = _flatten_dense_tensors(dev_grads)
all_grads_coalesced[bkt_idx].append(dev_grads_coalesced)

This comment was marked as off-topic.

all_grads_coalesced = \
[[] for _ in range(len(mst_dev_grads_buckets))]

for bkt_idx, dev_grads in enumerate(dev_grads_buckets):

This comment was marked as off-topic.


# Reduce all the gradients first
# This single op will do all-reduce on all GPUs utilizing multiple
# all reduce rings when we have more than one fast IB interfaces.

This comment was marked as off-topic.

for grad, reduced in zip(mst_dev_grads, grads_reduced):
grad.copy_(reduced)

# Now register the reduction function in the execution engine

This comment was marked as off-topic.

grad.copy_(reduced)

# Now register the reduction function in the execution engine
for module in self._module_copies:

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Feb 13, 2018

I'm sorry, why should it be the same? Ethernet is likely to have much lower throughput than 4xIB that you tested with, and interleaving the communication with backward might still be beneficial in that case.

@teng-li
Copy link
Contributor Author

teng-li commented Feb 13, 2018

@apaszke I meant, we don't currently support multi-threading for NCCL. Remember that the gradient buckets of each process on different nodes need to be executed in the exact same order. So even if we use the old code path, we still limit the number of bucket to be 1.

I will further test the Ethernet perf, which is on my to-do list anyway. But this should not block this PR for now

@apaszke
Copy link
Contributor

apaszke commented Feb 14, 2018

Is this true, even if they are using different groups? I thought that it doesn't matter as long as they use different communicators, right?

@teng-li
Copy link
Contributor Author

teng-li commented Feb 14, 2018

@apaszke right, they are using different communicators. But the NCCL call order (among all the reduction thread) needs to be maintained among all the nodes. The gradients available order for each bucket can not be guaranteed for different nodes.

@apaszke
Copy link
Contributor

apaszke commented Feb 15, 2018

Does it have to be the same, even when they are on different communicators? I think the whole purpose of keeping multiple comms was to allow concurrent operations to execute independently

@teng-li
Copy link
Contributor Author

teng-li commented Feb 15, 2018

@apaszke according to @csarofeen, the order needs to be maintained.

@csarofeen
Copy link
Contributor

I believe if your communicators overlap, you can get in trouble if the order is not correct. I don't believe you can have 2 communicators on 2 gpu's and have:
GPU 0 call on comm 0 then comm 1
and
GPU 1 call on comm 1 then comm 0

@apaszke
Copy link
Contributor

apaszke commented Feb 16, 2018

Aren't these situations the reason why ncclGroupStart and ncclGroupEnd were introduced?

@teng-li
Copy link
Contributor Author

teng-li commented Feb 16, 2018

@apaszke comments addressed

# (1) intra-node reduce to lead GPU, followed by
# (2) inter-node allreduce for all the first lead GPUs in all nodes
dist.all_reduce_multigpu(grads_batch_coalesced,
group=self.nccl_reduction_group_id)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ngimel
Copy link
Collaborator

ngimel commented Feb 16, 2018

@apaszke ncclGroupStart and ncclGroupEnd were introduced for a case when a single thread submitted nccl calls for the different GPUs (the way DataParallel operates). In nccl2 it would deadlock unless wrapped in groupStart/groupEnd. For a single GPU per process groupStart/groupEnd is a noop.

@ngimel
Copy link
Collaborator

ngimel commented Feb 16, 2018

Got answer from nccl: "calling collectives on different comms is to be avoided, and if there is no other solution it should be properly ordered". Apparently, the purpose of different comms was not to use them.

@teng-li
Copy link
Contributor Author

teng-li commented Feb 20, 2018

@apaszke The latest commit has been tested with ResNet50 with the correct 76 accuracy on two nodes. It should be safe to land

@apaszke
Copy link
Contributor

apaszke commented Feb 21, 2018

@teng-li can you please fix the conflict?

@teng-li
Copy link
Contributor Author

teng-li commented Feb 21, 2018

@apaszke fixed

@apaszke apaszke merged commit 579de82 into pytorch:master Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants