-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Change AccumulateGrad to yield .grad
s that match weights' memory layout
#34904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Just adding the BC-breaking tag to make sure it is tracked. |
DO NOT MERGE before corresponding distributed fix, this will break DP/DDP. |
💊 CI failures summary and remediationsAs of commit 679480f (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 125 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding request changes following comment above.
@ngimel do you have a reference for these other changes? And more information how this would break DP/DDP?
@albanD ddp changes are being added to this PR currently. |
Ok ! Good to know. |
For DDP allreduces as currently written, the fix must be in DDP's Reducer, because Reducer performs its own bucket flattening/unflattening (and those are the points where my fixes are applied). Interestingly, to perform the param broadcast during construction, DDP does not use its own bucketing logic. Instead it calls into BroadcastWork, a handy self-contained broadcaster in torch/csrc/distributed/c10d/comm.cpp. BroadcastWork takes a list of tensors, creates flat buckets, broadcasts the buckets, and copies the broadcasted data back into the original tensors. It handles NHWC-contiguous tensors just fine. Creating an analogous AllreduceWork class for Reducer to call is worth considering, but probably out of scope for today. |
…P gradient and DDP changes to see what CI thinks of raw collective changes alone
6dfb14f
to
68ef85e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
if (!global_unused) { | ||
if (!grad.defined()) { | ||
grad = at::empty(bucket_view.sizes(), bucket_view.options()); | ||
// Creates grad according to the "Gradient Layout Contract" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli are the diffs here ok? (again, grad might not be rowmajor contiguous here). Also under what circumstances does the grad need to be "written back" and what does "written back" entail? The grad is modified in place, so references to it elsewhere don't need to be explicitly modified (although I guess copies in other processes would be)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why isn't there an equivalent runGradCallbackForVariable
in finalize_bucket_sparse
? Was that an oversight in the dist_autograd
PR, or was that intentionally omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the diffs here ok?
This looks OK to me.
Also under what circumstances does the grad need to be "written back" and what does "written back" entail?
If you are referring to the written back
comment below, I think that comment is inaccurate. IIUC, it should return true when we are certain that the grad value is final, and we can now launch cbs on it.
Also, why isn't there an equivalent runGradCallbackForVariable in finalize_bucket_sparse? Was that an oversight in the dist_autograd PR, or was that intentionally omitted?
I guess this is because #37998 didn't mean to support sparse tensors for RPC + DDP. @pritamdamania87 if this is the case, should we also remove the runGradCallbackForVariable
in mark_variable_ready_sparse
to make it consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was probably an oversight in #37998, we should probably call runGradCallbackForVariable
in finalize_bucket_sparse as well. I'll fix this in a separate diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I wrote my other thoughts about sparse grads in the other thread https://github.com/pytorch/pytorch/pull/34904/files#r439117597
return false; | ||
replica.contents.div_(process_group_->getSize()); | ||
// The grad is modified in place and needs to be written back. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli are the diffs here ok? In mark_variable_ready_dense
mul_out averages the pre-allreduce gradients (which doesn't affect grad itself, because the result is written to the allreduce bucket). For mark_variable_ready_sparse
to mirror that control flow, I added div_
here. For sparse gradients, replica.contents = grad
, though, so grad itself IS affected in place. Therefore, I also changed return false
to return true
. Admittedly I don't fully understand the implications of doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the div_
+return true
here actually fixes a bug with the dist_autograd
changes. In current master, the bucket predivision is carried out in mark_variable_ready
. For sparse gradients, the predivision alters the grad in place, but there's no mention at that point of runGradCallbackForVariable
, so that change isn't communicated anywhere. Again, admittedly i have no idea what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the criteria for returning true is when we want to trigger cbs on that grad. So the prior (return true when the grad is modified) comments might not be accurate.
We want to trigger those cbs when the grad is ready, i.e., after the allreduce sync done by DDP. So I think we should still return false here?
@pritamdamania87 please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for consistency with local autograd its better to return true here (since for local autograd .grad would change, so for dist_autograd grads in the map should change too). Although, is it possible to do the division in the end in finalize_backward
? That way we modify the grad in only one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute, since for sparse grads the grad is the bucket, it makes no sense for anything external (local or remote) to look at its values between when the allreduce hook triggers and when until the allreduce is finalized. From the perspective of external code in this thread, the allreduce hook carries out a div_
and immediately kicks off an allreduce which spends a while updating grad's values in place. Any access to the grad before finalize_backward
is racy.
We could make div_
an out-of-place div
for sparse gradients. Then variable.grad
itself will remain untouched until finalize_bucket_sparse
(which makes sparse grad control flow more consistent with dense grads which use separate memory for their allreduce buckets). However, as I said earlier, finalize_bucket_sparse
makes no reference to runGradCallbackForVariable
anyway, so dist_autograd
's "map" is never informed even after the allreduce completes and grad is safe to access. Do we need to add runGradCallbackForVariable
to finalize_bucket_sparse
?
is it possible to do the division in the end in finalize_backward
That's a reasonable idea but I'm wary of it. For some networks that allreduce FP16 gradients at scale, we've noticed that post-allreduce averaging caused nonconvergence, but pre-allreduce averaging was fine. I'm not aware off the top of my head of any cases where we observed the reverse, so I prefer the predivision that DDP implements currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is true that nothing should access the grad between allreduce hooks and finalize_backward. Although, my point was mostly from a consistency standpoint. In the local autograd engine case, we are modifying the grad here and technically anyone looking at variable's .grad after the div_
operation would see a change. So if we're updating .grad here, we should update it in distributed autograd's map as well. The other option is not to update in both .grad and dist autograd map. Although, I'd prefer the in place div_
to avoid creating a new tensor.
Do we need to add runGradCallbackForVariable to finalize_bucket_sparse?
Yes, we do and this was a bug in #37998.
auto wrapped = c10::scalar_to_tensor(double(1.)/process_group_->getSize()); | ||
wrapped.unsafeGetTensorImpl()->set_wrapped_number(true); | ||
// Divides while copying into the bucket view. | ||
at::native::mul_out(bucket_view, grad, wrapped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli are the diffs here ok? grad is now divided while being copied into bucket_view, but it's still not being modified, so I think it's ok.
A more subtle thing is that with this PR's other diffs, grad may no longer be rowmajor contiguous. I'm not sure how that affects dist_autograd
in general if the callback ends up routing into the dist_autograd
context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM.
Curious, what difference does set_wrapped_number
make? Is it like mul_out
only support scalar tensor or there is a shortcut for this?
Regarding dist autograd, as DDP + RPC is a very new feature and will be released as beta. It still need more tests and apps to verify and try it out. I think even if it does not work with this PR, it should not block us from landing this. We can fix that in follow PRs.
Hold on. Let me think again about dist autograd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at::native::mul_out
only has an overload that accepts three Tensor args, so I need to convert the scalar to a tensor.
I'm not sure exactly why set_wrapped_number
in particular is needed. I don't think it's needed for TensorIterator kernels inside mul_out, because the lambda capturing of CPU scalars for GPU kernels is based on TensorIterator::is_cpu_scalar which does not rely on Tensor::is_wrapped_number. I included set_wrapped_number
because it seems to be the standard practice when converting scalars to Tensors (e.g. in BinaryOps.cpp and python_arg_parser.cpp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@jeffdaily I'm seeing timeouts for my DDP tests on Rocm. Is it expected that DDP tests with
I'm decorating these with skip_if_rocm in the meantime but if it's expected that they pass we should figure out why they failed. |
This is what I see in the CI log:
It looks to me that MIOpen has encountered kernels that it hasn't compiled yet. The error creating the miopen lock file indicates two or more processes attempted to race to compile the same kernel (race was resolved when one couldn't acquire lock). However, this likely has lead to a timeout. One process got the miopen lock, compiled the kernel, then released the lock to the other process that may or may not re-compile the same kernel (depending on timing). One solution might be if you increased the timeout, then you might get the test to pass for rocm. Another solution will be available starting in ROCm 3.5 where we can install precompiled kernels as a separate deb package, but that isn't available to you yet. Skipping them for now to unblock your work is understandable for the short term. Wouldn't be the first time we had a timing issue caused by first-use compiling of miopen kernels. |
@jeffdaily Thanks for the quick reply, just wanted to make sure you were aware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…memory layout (#40129) Summary: #34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: #40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
…yout (pytorch#34904) Summary: Currently, whether `AccumulateGrad` [steals](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L42) or [clones](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L80) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. pytorch#38942 Pull Request resolved: pytorch#34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
…memory layout (pytorch#40129) Summary: pytorch#34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: pytorch#40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
Currently, whether
AccumulateGrad
steals or clones an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.The present PR changes
AccumulateGrad
to prefer, where possible, stashing gradients that match their params' layouts ("Gradient Layout Contract").Allowing
AccumulateGrad
to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.
Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general:
This PR also updates autograd docs to describe potential BC-breaking changes below.
BC notes
@ngimel @albanD @gchanan
BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to
view(-1)
ing these grads will break.Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual
param.grad
. Any such users hopefully know they're off the edge of the map and understand how to update their expectations.BC escape hatches
At @albanD's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing
.grad
attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing.grad
's layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can presetparam.grad
to a zeroed tensor with the desired strides or callgrad.contiguous(desired format)
. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.One limitation (present before this PR and unchanged by this PR): Presetting
param.grad
does not ensure in-place accumulation all the time. For example, ifcreate_graph=True
, or if incomingnew_grad
is dense and existingvariable_grad
is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
to reduce CPU overhead and incur fewer kernel launches, lazily create flatPR includes a minor change to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do.contents
tensors by a singlecat
kernel only when a bucket is full, instead ofcopy_
ing grads intocontents
individually as soon as they are received.local_used_maps_dev_
whenfind_unused_param=False
#38942