New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778
Conversation
💊 CI failures summary and remediationsAs of commit f84274d (more details on the Dr. CI page): Commit f84274d was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 132 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a test that covers all edge cases (sparse grads, grads that arent non overlapping and dense, grads on different devices, grads with different dtypes)
…o change points of use.
…ach passes aside from minor precision mismatch in addcdiv and addcmul
const auto inv_scale_val = *inv_scale_ptr; // Every thread accesses inv_scale, but it will hit in cache. | ||
return static_cast<scalar_t>(inv_scale_val == 1.f ? fval : fval*inv_scale_val); | ||
}); | ||
using opmath_t = get_opmath_t<scalar_t>::opmath_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with using acc_type<scalar_t, true>? That's what's used in all other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it does the mapping we want, that makes sense. I'll double check the behavior. get_opmath_t
doesn't do anything for integer types, while acc_type
might. I'm not sure if we do or don't want any pre/post op casting to occur for integer types.
gpu_kernel(iter, | ||
[found_inf_ptr, inv_scale_ptr] GPU_LAMBDA (scalar_t val_in) -> scalar_t { | ||
opmath_t val = static_cast<opmath_t>(val_in); | ||
if (!isfinite_ensure_cuda_math(val)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time goes on, nature heals, maybe std::isfinite works now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
std::vector<at::Tensor> vec_res; | ||
vec_res.reserve(tensors.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same for tensor_lists
? Here and everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've already done so, ctrl+f "reserve(". Lmk if you spot any location i missed.
use_c10_dispatcher: full | ||
device_guard: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this have device_guard: False
? (and like, _amp_update_scale doesn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm imitating existing foreach functions, all of which use device_guard: False
. That's a good point though, I should be more explicit about device guarding in my functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, existing foreach functors might be dropping the ball on device guarding as well (they should do so manually, since they're all codegenned with device_guard: False
). will double check those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi_tensor_apply guards onto the first tensor in its lists.
My fallback single-tensor path (_amp_non_finite_check_and_unscale_cuda_
), for tensors MTA can't handle, uses gpu_kernel. It's not obvious to me that gpu_kernel always guards onto its argument. Added explicit guard in the fallback.
Also, the tests cover incoming scaled_grads on different GPUs, forcing both the MTA path and the fallback path to execute on two devices (fallback path is forced by creating some inputs as slices so theyre not non overlapping and dense and therefore can't be handled by MTA).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
TORCH_CHECK(scaled_grad.is_cuda(), "scaled_grad must be a CUDA tensor."); | ||
// The only way we reach this function is through _amp_foreach_non_finite_check_and_unscale_cuda_, so no input checks. | ||
|
||
// It's not obvious gpu_kernel always guards onto its argument. Guarding here just in case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't, only gpu_kernel_with_scalars does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for
torch.cuda.amp
. @izdeby your interface was clean and straightforward to use, great work!Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from
_amp_non_finite_check_and_unscale_
to_amp_foreach_non_finite_check_and_unscale_
.The PR also modifies Unary/Binary/Pointwise Functors to
template<class> class Op
). This allows calling code to pass lambdas.Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer.