New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP fp16_compress_hook communication hook increases peak memory #45968
Comments
I guess this might be because we create a new gradient tensor within the hook, but still keep the old tensor around and thus occupy an extra gradient_tensor_size of memory. cc @pritamdamania87 @SciPioneer, is there anything we could do about this, such as make the old bucket tensors point to this new tensor and thereby free that memory? |
@rohan-varma I think the additional memory overhead should be temporary since we clear out the bucket views here each time: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L916.
@david-macleod Which synchronization are you referring to here? Looking at the implementation, it seems like we would allocate additional memory during the backward pass but it would cleared out in the next backward pass. Are you referring to the fact that we don't clear out this memory at the end of the backward pass instead? |
Sorry my original message wasn't very clear, I meant that during the backwards pass after each bucket is reduced the allocated memory for that particular bucket is released ideally, or that the original buffers are reused as @rohan-varma suggested. The big issue for me is that the peak memory usage has gone from 3*G to 4*G (where G is gradients size) and that limits the size of the model I can fit in memory. |
To resolve this issue, I think I can add a clear() method to GradBucket class for clearing the tensors, and call clear() right after the following compression line: |
…mpression provided by ddp comm hook The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: #45968 Differential Revision: [D24178118](https://our.internmc.facebook.com/intern/diff/D24178118/) [ghstack-poisoned]
…mpression provided by ddp comm hook The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: #45968 Differential Revision: [D24178118](https://our.internmc.facebook.com/intern/diff/D24178118/) ghstack-source-id: 113935840 Pull Request resolved: #46078
… of fp16 compression provided by ddp comm hook" The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: #45968 Differential Revision: [D24178118](https://our.internmc.facebook.com/intern/diff/D24178118/) [ghstack-poisoned]
…mpression provided by ddp comm hook Pull Request resolved: #46078 The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: #45968 ghstack-source-id: 113996453 Differential Revision: [D24178118](https://our.internmc.facebook.com/intern/diff/D24178118/)
@david-macleod Could you patch this PR to verify the fix? I verified the fix on my end already. |
Works great, thanks! |
🐛 Bug
When using the new fp16_compress_hook in torch 1.7 nightly the peak memory usage increases by an amount equal to the (in memory) size of the gradient tensors. In the example below the size of the 32bit grads are 8MB so the expected peak allocated memory is ~24MB (parameter tensors + gradient tensors + DDP buffers), and I have confirmed this is the case in the standard setting.
However when using the hook this increased to 32MB, presumably caused by the
bucket.get_tensors()[0].to(torch.float16)
copy, I would have expected this memory to be released after the synchronization was complete, but that does not seem to be the case. Is there a way to avoid this extra copy as it becomes an issue when dealing with very large models.To Reproduce
Steps to reproduce the behavior:
Without fp16 hook
With fp16 hook
Expected behavior
The peak memory usage to be the same with and without the hook
Environment
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski
The text was updated successfully, but these errors were encountered: