Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

mcarilli · 2020-09-16T08:34:22Z

Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for). This PR adds an MTA unscale+infcheck functor. Really excited to have it for torch.cuda.amp. @izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from _amp_non_finite_check_and_unscale_ to _amp_foreach_non_finite_check_and_unscale_.

The PR also modifies Unary/Binary/Pointwise Functors to

do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision (and throughput, on some architectures!) and has no downside for the ops we care about.
accept an instantiated op functor rather than an op functor template (template<class> class Op). This allows calling code to pass lambdas.

Open question: As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops. However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control. I can easily rewrite it that way if you prefer.

dr-ci · 2020-09-16T08:40:50Z

💊 CI failures summary and remediations

As of commit f84274d (more details on the Dr. CI page):

Commit f84274d was recently pushed. Waiting for builds...

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 132 times.

definitelynotmcarilli

Needs a test that covers all edge cases (sparse grads, grads that arent non overlapping and dense, grads on different devices, grads with different dtypes)

…o change points of use.

…ach passes aside from minor precision mismatch in addcdiv and addcmul

torch/cuda/amp/grad_scaler.py

ngimel · 2020-09-17T00:57:23Z

aten/src/ATen/native/cuda/AmpKernels.cu

-          const auto inv_scale_val = *inv_scale_ptr; // Every thread accesses inv_scale, but it will hit in cache.
-          return static_cast<scalar_t>(inv_scale_val == 1.f ? fval : fval*inv_scale_val);
-        });
+      using opmath_t = get_opmath_t<scalar_t>::opmath_t;


what's wrong with using acc_type<scalar_t, true>? That's what's used in all other places.

If it does the mapping we want, that makes sense. I'll double check the behavior. get_opmath_t doesn't do anything for integer types, while acc_type might. I'm not sure if we do or don't want any pre/post op casting to occur for integer types.

torch/cuda/amp/grad_scaler.py

aten/src/ATen/native/cuda/AmpKernels.cu

ngimel · 2020-09-17T01:18:37Z

aten/src/ATen/native/cuda/AmpKernels.cu

+      gpu_kernel(iter,
+                 [found_inf_ptr, inv_scale_ptr] GPU_LAMBDA (scalar_t val_in) -> scalar_t {
+                   opmath_t val = static_cast<opmath_t>(val_in);
+                   if (!isfinite_ensure_cuda_math(val)) {


time goes on, nature heals, maybe std::isfinite works now?

aten/src/ATen/native/cuda/AmpKernels.cu

aten/src/ATen/native/cuda/ForeachFunctors.cuh

aten/src/ATen/native/native_functions.yaml

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

izdeby · 2020-09-25T22:51:40Z

aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu

    std::vector<at::Tensor> vec_res;
+    vec_res.reserve(tensors.size());


Should we do the same for tensor_lists? Here and everywhere else

I think I've already done so, ctrl+f "reserve(". Lmk if you spot any location i missed.

test/test_foreach.py

tools/codegen/model.py

gchanan · 2020-09-29T15:10:12Z

aten/src/ATen/native/native_functions.yaml

  use_c10_dispatcher: full
+  device_guard: False


why does this have device_guard: False? (and like, _amp_update_scale doesn't).

I'm imitating existing foreach functions, all of which use device_guard: False. That's a good point though, I should be more explicit about device guarding in my functions.

Now that you mention it, existing foreach functors might be dropping the ball on device guarding as well (they should do so manually, since they're all codegenned with device_guard: False). will double check those as well.

multi_tensor_apply guards onto the first tensor in its lists.
My fallback single-tensor path (_amp_non_finite_check_and_unscale_cuda_), for tensors MTA can't handle, uses gpu_kernel. It's not obvious to me that gpu_kernel always guards onto its argument. Added explicit guard in the fallback.

Also, the tests cover incoming scaled_grads on different GPUs, forcing both the MTA path and the fallback path to execute on two devices (fallback path is forced by creating some inputs as slices so theyre not non overlapping and dense and therefore can't be handled by MTA).

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-09-29T23:07:35Z

aten/src/ATen/native/cuda/AmpKernels.cu

-  TORCH_CHECK(scaled_grad.is_cuda(), "scaled_grad must be a CUDA tensor.");
+  // The only way we reach this function is through _amp_foreach_non_finite_check_and_unscale_cuda_, so no input checks.
+
+  // It's not obvious gpu_kernel always guards onto its argument.  Guarding here just in case.


it doesn't, only gpu_kernel_with_scalars does

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-01T16:11:56Z

@izdeby merged this pull request in 72bc3d9.

procrastiworking

8713ff6

mcarilli requested review from izdeby, zou3519 and ngimel September 16, 2020 08:34

pytorchbot added the open source label Sep 16, 2020

Pass scalars as opmath_t to avoid precision loss

518bc3c

definitelynotmcarilli suggested changes Sep 16, 2020

View reviewed changes

mcarilli added 3 commits September 16, 2020 10:09

Fixed opmath_t for all Functors. Pushing for visibility, still need t…

5d34ee2

…o change points of use.

Change call sites to supply instantiated op functors

64b6748

Refactor op plumbing to allow lambdas. Everything compiles, test_fore…

f978609

…ach passes aside from minor precision mismatch in addcdiv and addcmul

mcarilli changed the title ~~[WIP] Use MTA for amp grad unscaling, and fix op math type in MTA functors~~ [WIP] Use MTA for amp grad unscaling, and enforce op math type in MTA functors Sep 17, 2020

izdeby reviewed Sep 17, 2020

View reviewed changes

torch/cuda/amp/grad_scaler.py Outdated Show resolved Hide resolved

ngimel reviewed Sep 17, 2020

View reviewed changes

mcarilli commented Sep 17, 2020

View reviewed changes

torch/cuda/amp/grad_scaler.py Outdated Show resolved Hide resolved

ngimel reviewed Sep 17, 2020

View reviewed changes

some comments

cafe839

mcarilli changed the title ~~[WIP] Use MTA for amp grad unscaling, and enforce op math type in MTA functors~~ [WIP] Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas Sep 17, 2020

mcarilli added 8 commits September 21, 2020 00:26

Test passes

36b93c3

test_grad_scaling* all pass

79f4908

resolve conflicts

eb86bbc

Fixing subrepos with master take 1, how do i always screw this up

c1731c1

merging in master

8d5ea34

Align scalarlist Functors with other ForeachFunctors

a5ca21d

All foreach tests except for four cpu tests pass

cb29145

tests mimic cuda kernel dtype flow

294637e

facebook-github-bot reviewed Sep 25, 2020

View reviewed changes

izdeby reviewed Sep 25, 2020

View reviewed changes

test/test_foreach.py Show resolved Hide resolved

ngimel reviewed Sep 25, 2020

View reviewed changes

tools/codegen/model.py Outdated Show resolved Hide resolved

gchanan reviewed Sep 29, 2020

View reviewed changes

mcarilli added 3 commits September 29, 2020 11:08

Device guard in fallback path

aa98705

resolving conflict

3622b28

Merge remote-tracking branch 'upstream/master' into mta_unscale

db2e193

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

Adjust rocm bf16 and fp16 atol

552f614

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

Fix bad conflict resolve in grad_scaler.py

e787dd2

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into mta_unscale

bb20adc

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

mcarilli added 2 commits September 29, 2020 15:28

fix indent and mypy error

b4d274d

Merge remote-tracking branch 'upstream/master' into mta_unscale

13439de

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

flake8

e3804e3

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into mta_unscale

c4cfaa9

facebook-github-bot reviewed Sep 29, 2020

View reviewed changes

ngimel reviewed Sep 29, 2020

View reviewed changes

mcarilli added 2 commits September 30, 2020 09:53

ensure multiGPU unscaling test tries fallback on both devices

c9a8944

Merge remote-tracking branch 'upstream/master' into mta_unscale

c989e1b

facebook-github-bot reviewed Sep 30, 2020

View reviewed changes

resolving conflict

f84274d

facebook-github-bot reviewed Sep 30, 2020

View reviewed changes

facebook-github-bot closed this in 72bc3d9 Oct 1, 2020

facebook-github-bot added the merged label Oct 1, 2020

mruberry added the Merged label Oct 28, 2020

ezyang mentioned this pull request Aug 25, 2021

Deify opmath_t into a first class concept #63985

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

mcarilli commented Sep 16, 2020 •

edited

dr-ci bot commented Sep 16, 2020 •

edited

definitelynotmcarilli left a comment •

edited by mcarilli

ngimel Sep 17, 2020

mcarilli Sep 17, 2020

ngimel Sep 17, 2020

facebook-github-bot left a comment

izdeby Sep 25, 2020

mcarilli Sep 25, 2020 •

edited

gchanan Sep 29, 2020

mcarilli Sep 29, 2020 •

edited

mcarilli Sep 29, 2020

mcarilli Sep 29, 2020 •

edited

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

ngimel Sep 29, 2020

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Oct 1, 2020

		std::vector<at::Tensor> vec_res;
		vec_res.reserve(tensors.size());

Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas #44778

Conversation

mcarilli commented Sep 16, 2020 • edited

dr-ci bot commented Sep 16, 2020 • edited

💊 CI failures summary and remediations

definitelynotmcarilli left a comment • edited by mcarilli

Choose a reason for hiding this comment

ngimel Sep 17, 2020

Choose a reason for hiding this comment

mcarilli Sep 17, 2020

Choose a reason for hiding this comment

ngimel Sep 17, 2020

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

izdeby Sep 25, 2020

Choose a reason for hiding this comment

mcarilli Sep 25, 2020 • edited

Choose a reason for hiding this comment

gchanan Sep 29, 2020

Choose a reason for hiding this comment

mcarilli Sep 29, 2020 • edited

Choose a reason for hiding this comment

mcarilli Sep 29, 2020

Choose a reason for hiding this comment

mcarilli Sep 29, 2020 • edited

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

ngimel Sep 29, 2020

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 1, 2020

mcarilli commented Sep 16, 2020 •

edited

dr-ci bot commented Sep 16, 2020 •

edited

definitelynotmcarilli left a comment •

edited by mcarilli

mcarilli Sep 25, 2020 •

edited

mcarilli Sep 29, 2020 •

edited

mcarilli Sep 29, 2020 •

edited