Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

Closed
wants to merge 38 commits into from

Conversation

mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Sep 16, 2020

Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for torch.cuda.amp. @izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from _amp_non_finite_check_and_unscale_ to _amp_foreach_non_finite_check_and_unscale_.

The PR also modifies Unary/Binary/Pointwise Functors to

  • do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision (and throughput, on some architectures!) and has no downside for the ops we care about.
  • accept an instantiated op functor rather than an op functor template (template<class> class Op). This allows calling code to pass lambdas.

Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer.

@dr-ci
Copy link

dr-ci bot commented Sep 16, 2020

💊 CI failures summary and remediations

As of commit f84274d (more details on the Dr. CI page):


Commit f84274d was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 132 times.

Copy link
Contributor

@definitelynotmcarilli definitelynotmcarilli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test that covers all edge cases (sparse grads, grads that arent non overlapping and dense, grads on different devices, grads with different dtypes)

@mcarilli mcarilli changed the title [WIP] Use MTA for amp grad unscaling, and fix op math type in MTA functors [WIP] Use MTA for amp grad unscaling, and enforce op math type in MTA functors Sep 17, 2020
const auto inv_scale_val = *inv_scale_ptr; // Every thread accesses inv_scale, but it will hit in cache.
return static_cast<scalar_t>(inv_scale_val == 1.f ? fval : fval*inv_scale_val);
});
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with using acc_type<scalar_t, true>? That's what's used in all other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does the mapping we want, that makes sense. I'll double check the behavior. get_opmath_t doesn't do anything for integer types, while acc_type might. I'm not sure if we do or don't want any pre/post op casting to occur for integer types.

aten/src/ATen/native/cuda/AmpKernels.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/cuda/AmpKernels.cu Show resolved Hide resolved
gpu_kernel(iter,
[found_inf_ptr, inv_scale_ptr] GPU_LAMBDA (scalar_t val_in) -> scalar_t {
opmath_t val = static_cast<opmath_t>(val_in);
if (!isfinite_ensure_cuda_math(val)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time goes on, nature heals, maybe std::isfinite works now?

aten/src/ATen/native/cuda/AmpKernels.cu Show resolved Hide resolved
aten/src/ATen/native/cuda/AmpKernels.cu Show resolved Hide resolved
aten/src/ATen/native/cuda/ForeachFunctors.cuh Outdated Show resolved Hide resolved
aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved
@mcarilli mcarilli changed the title [WIP] Use MTA for amp grad unscaling, and enforce op math type in MTA functors [WIP] Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas Sep 17, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

std::vector<at::Tensor> vec_res;
vec_res.reserve(tensors.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same for tensor_lists? Here and everywhere else

Copy link
Collaborator Author

@mcarilli mcarilli Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've already done so, ctrl+f "reserve(". Lmk if you spot any location i missed.

tools/codegen/model.py Outdated Show resolved Hide resolved
use_c10_dispatcher: full
device_guard: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have device_guard: False? (and like, _amp_update_scale doesn't).

Copy link
Collaborator Author

@mcarilli mcarilli Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imitating existing foreach functions, all of which use device_guard: False. That's a good point though, I should be more explicit about device guarding in my functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, existing foreach functors might be dropping the ball on device guarding as well (they should do so manually, since they're all codegenned with device_guard: False). will double check those as well.

Copy link
Collaborator Author

@mcarilli mcarilli Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi_tensor_apply guards onto the first tensor in its lists.
My fallback single-tensor path (_amp_non_finite_check_and_unscale_cuda_), for tensors MTA can't handle, uses gpu_kernel. It's not obvious to me that gpu_kernel always guards onto its argument. Added explicit guard in the fallback.

Also, the tests cover incoming scaled_grads on different GPUs, forcing both the MTA path and the fallback path to execute on two devices (fallback path is forced by creating some inputs as slices so theyre not non overlapping and dense and therefore can't be handled by MTA).

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

TORCH_CHECK(scaled_grad.is_cuda(), "scaled_grad must be a CUDA tensor.");
// The only way we reach this function is through _amp_foreach_non_finite_check_and_unscale_cuda_, so no input checks.

// It's not obvious gpu_kernel always guards onto its argument. Guarding here just in case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't, only gpu_kernel_with_scalars does

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@izdeby merged this pull request in 72bc3d9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: bc-breaking Related to a BC-breaking change open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants