-
Notifications
You must be signed in to change notification settings - Fork 24.9k
DDP: coalescing many little broadcasts to improve performance #4978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm curious about how much time saving in this coalesced broadcast, and the overhead of packing/unpacking tensors. Should them be moved into a new contiguous memory. I'm testing DDP on a single node due to the GIL in DP, but failed to get reasonable speedup. Could you please give me your system setup for running DDP. |
c3adf44
to
85aab24
Compare
@@ -188,6 +197,34 @@ def train(self, mode=True): | |||
for module in self._module_copies[1:]: | |||
module.train(mode) | |||
|
|||
def _dist_broadcast_coalesced(self, tensors, buffer_size): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@Stonesjtu For about 100MB of broadcast across two nodes, I did a rough perf evaluation, a single broadcast vs the original many broadcasts, we can nearly double the performance. 0.0268 sec (with 1 coalesced b-cast) vs 0.0436 sec (with the original logic). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but the code could be simpler and more robust if we used _take_tensors
torch/nn/parallel/distributed.py
Outdated
# Here we will coalesce all the parameters and buffers in several big | ||
# flat tensors and broadcast them out to reduce the number of broadcasts | ||
# as well as improve performance, even though this function is | ||
# only called one time. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/parallel/distributed.py
Outdated
tensors_bucket.append([]) | ||
cur_bucket_size = 0 | ||
tensors_bucket[-1].append(tensor) | ||
cur_bucket_size += tensor.numel() * tensor.element_size() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@teng-li Is it possible to simply flatten the storage of all parameters (assuming they are static for the duration of the training)? As long as you do in place operations on parameters, which I think PyTorch does, it seems like you'd be able to copy the entire model with just one broadcast. |
We can't do this transformation in general, because there are cases where users want to flatten the parameters as they like, and we shouldn't interfere. |
At the constructor, when we broadcast module 0 on Node 0's entire module states to all other nodes, we broadcast each state (tensor) one by one, this is very inefficient and also causes NCCL/Gloo deadlocks (previous, even though this is a different issue).
This PR coalesces the entire module states into a series of big Tensors and broadcasts them out in one shot, just like _sync_buffers() at each forward() function.
The memory is limited by the bucket size, which is 10MB by default.
This also reduces the chance that multi-process/node cross node NCCL deadlocks, which happens only after a long series of broadcast. So far after this change, I haven't seen any deadlocks yet.
Tested on ResNet50 training, I printed out the entire module's state on both nodes before the broadcast and after the broadcast, and verified that after the coalesced broadcast, node 0 and node 1 have the exact state and node 0's state hasn't changed either. Thus, this PR should be safe to land