Skip to content

DDP: coalescing many little broadcasts to improve performance #4978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 12, 2018

Conversation

teng-li
Copy link
Contributor

@teng-li teng-li commented Feb 1, 2018

At the constructor, when we broadcast module 0 on Node 0's entire module states to all other nodes, we broadcast each state (tensor) one by one, this is very inefficient and also causes NCCL/Gloo deadlocks (previous, even though this is a different issue).

This PR coalesces the entire module states into a series of big Tensors and broadcasts them out in one shot, just like _sync_buffers() at each forward() function.

The memory is limited by the bucket size, which is 10MB by default.

This also reduces the chance that multi-process/node cross node NCCL deadlocks, which happens only after a long series of broadcast. So far after this change, I haven't seen any deadlocks yet.

Tested on ResNet50 training, I printed out the entire module's state on both nodes before the broadcast and after the broadcast, and verified that after the coalesced broadcast, node 0 and node 1 have the exact state and node 0's state hasn't changed either. Thus, this PR should be safe to land

@Stonesjtu
Copy link
Contributor

Stonesjtu commented Feb 1, 2018

I'm curious about how much time saving in this coalesced broadcast, and the overhead of packing/unpacking tensors. Should them be moved into a new contiguous memory.

I'm testing DDP on a single node due to the GIL in DP, but failed to get reasonable speedup. Could you please give me your system setup for running DDP.

@teng-li teng-li changed the title DDP: coalescing many little broadcasts into one to improve performance DDP: coalescing many little broadcasts to improve performance Feb 1, 2018
@teng-li teng-li force-pushed the dist_perf branch 2 times, most recently from c3adf44 to 85aab24 Compare February 1, 2018 23:16
@@ -188,6 +197,34 @@ def train(self, mode=True):
for module in self._module_copies[1:]:
module.train(mode)

def _dist_broadcast_coalesced(self, tensors, buffer_size):

This comment was marked as off-topic.

@teng-li
Copy link
Contributor Author

teng-li commented Feb 1, 2018

@Stonesjtu For about 100MB of broadcast across two nodes, I did a rough perf evaluation, a single broadcast vs the original many broadcasts, we can nearly double the performance. 0.0268 sec (with 1 coalesced b-cast) vs 0.0436 sec (with the original logic).

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the code could be simpler and more robust if we used _take_tensors

# Here we will coalesce all the parameters and buffers in several big
# flat tensors and broadcast them out to reduce the number of broadcasts
# as well as improve performance, even though this function is
# only called one time.

This comment was marked as off-topic.

tensors_bucket.append([])
cur_bucket_size = 0
tensors_bucket[-1].append(tensor)
cur_bucket_size += tensor.numel() * tensor.element_size()

This comment was marked as off-topic.

@apaszke apaszke merged commit d7b6a61 into pytorch:master Feb 12, 2018
@ebetica
Copy link
Contributor

ebetica commented Feb 14, 2018

@teng-li Is it possible to simply flatten the storage of all parameters (assuming they are static for the duration of the training)? As long as you do in place operations on parameters, which I think PyTorch does, it seems like you'd be able to copy the entire model with just one broadcast.

@teng-li
Copy link
Contributor Author

teng-li commented Feb 14, 2018

@ebetica hmm, yeah, good thought, but this enforces a static assumption, I am not too sure how feasible this would be. cc @apaszke

@apaszke
Copy link
Contributor

apaszke commented Feb 14, 2018

We can't do this transformation in general, because there are cases where users want to flatten the parameters as they like, and we shouldn't interfere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants