Add support for non-affine batch norm with float stats and half inputs #22750

ptrblck · 2019-07-11T16:27:46Z

This PR creates support for non-affine batch norm with float running estimates and half inputs.
Changed were made similar to #16735.

I couldn't find a specific test for SyncBatchNorm, so I used this code to test it.

cc @ngimel

ptrblck · 2019-07-11T16:33:55Z

CC @jjsjann123

ptrblck · 2019-07-11T17:15:06Z

Lint issue seems to come from here:
#21323 (comment)

jjsjann123 · 2019-07-12T06:41:11Z

Great work! Looks like we are doing the right thing everywhere.

Thanks very much for taking care of this!

ptrblck · 2019-07-12T10:28:07Z

Thanks for the review @jjsjann123!

ptrblck · 2019-07-12T10:28:22Z

Please don't merge it yet, as I would like to run some additional tests first.

ptrblck · 2019-07-16T18:40:31Z

Sorry for blocking this PR.
I've added support for backward calls, so that it can be reviewed now.

Thanks @jjsjann123 for the support! 🙂

jjsjann123 · 2019-07-19T04:31:43Z

aten/src/ATen/native/cuda/Normalization.cuh

-          scalar_t inp = input[batch][plane][x];
-          accscalar_t proj = (inp - mean) * proj_scale;
-          grad_input[batch][plane][x] = static_cast<scalar_t>((go - proj - grad_mean) * grad_scale);
+          input_scalar_t inp = input[batch][plane][x];


Shouldn't inp be stat_accscalar_t? Since it was casted right after.

jjsjann123 · 2019-07-19T04:38:34Z

aten/src/ATen/native/cuda/Normalization.cuh

+  stat_accscalar_t m_c = mean[plane];
+  stat_accscalar_t m_dy_c = mean_dy[plane];
+  stat_accscalar_t factor_1_c = invstd[plane];
+  stat_accscalar_t factor_2_c = weight.size(0) > 0 ? static_cast<stat_accscalar_t>(weight[plane]) : static_cast<stat_accscalar_t>(1);


The last one could just be stat_accscalar_t(1)

jjsjann123 · 2019-07-19T04:55:45Z

aten/src/ATen/native/cuda/Normalization.cuh

@@ -765,22 +765,22 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> batch_norm_backward_reduce_cuda_templ
    mean_dy_ = at::empty_like(mean_);
    mean_dy_xmu_ = at::empty_like(mean_);
  }
-  auto grad_options = grad_out_.options();
+  auto grad_options = mean_.options();


Not sure if we discussed it here before.
IIRC, mean_ is passed here as calculated from the batch_norm_gather_stats_with_counts

pytorch/torch/nn/modules/_functions.py

Lines 62 to 70 in b93f29d

mean_dy, mean_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(

grad_output,

saved_input,

mean,

invstd,

self.needs_input_grad[0],

self.needs_input_grad[1],

self.needs_input_grad[2]

)

If we have input and weight both in fp16. This will break?
We might have to pass in an optional variable (either weight or bias) here to make the decision. It cannot be deduced from given information.

But I could imagine that we must have fp16 layer with fp16 input. If unit test is fine, I can take another look at this with a clearer head tomorrow morning :)

This makes sense and I'm passing now the weight tensor to this method. Since it can be None, I'm calling weight_.options() now for grad_weight_ and grad_bias_ separately.

Looks good. Thanks for the hard work.

jjsjann123 · 2019-07-19T05:05:00Z

Looks good to me except for some code cleaning and that grad_option thing I mentioned there.

…duce_cuda_template, add requested changes

soumith · 2019-07-22T22:26:37Z

@pytorchbot rebase this please

ptrblck · 2019-08-02T08:51:29Z

Any pointers on this failing test:

00:53:35 FAIL: test_wrong_cuda_fork (__main__.TestMultiprocessing)
00:53:35 ----------------------------------------------------------------------
00:53:35 Traceback (most recent call last):
00:53:35   File "test_multiprocessing.py", line 496, in test_wrong_cuda_fork
00:53:35     you must use the 'spawn' start method")
00:53:35 AssertionError: Regex didn't match: "Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" not found in 'Traceback (most recent call last):\r\n  File "<string>", line 1, in <module>\r\n  File "C:\\Jenkins\\Miniconda3\\lib\\multiprocessing\\spawn.py", line 105, in spawn_main\r\n    exitcode = _main(fd)\r\n  File "C:\\Jenkins\\Miniconda3\\lib\\multiprocessing\\spawn.py", line 115, in _main\r\n    self = reduction.pickle.load(from_parent)\r\nAttributeError: Can\'t get attribute \'run\' on <module \'__main__\' (built-in)>\r\nTraceback (most recent call last):\r\n  File "<string>", line 1, in <module>\r\n  File "C:\\Jenkins\\Miniconda3\\lib\\multiprocessing\\spawn.py", line 105, in spawn_main\r\n    exitcode = _main(fd)\r\n  File "C:\\Jenkins\\Miniconda3\\lib\\multiprocessing\\spawn.py", line 115, in _main\r\n    self = reduction.pickle.load(from_parent)\r\nAttributeError: Can\'t get attribute \'run\' on <module \'__main__\' (built-in)>\r\n'

Is this a valid failure?

ptrblck · 2019-08-06T10:32:13Z

@pytorchbot retest this please

ptrblck · 2019-08-07T15:37:19Z

@pytorchbot retest this please

ptrblck · 2019-08-07T21:19:01Z

@pytorchbot rebase this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…s (#22750) Summary: This PR creates support for non-affine batch norm with float running estimates and half inputs. Changed were made similar to pytorch/pytorch#16735. I couldn't find a specific test for `SyncBatchNorm`, so I used [this code](https://gist.github.com/ptrblck/ab45bfcde6df55ac28a7be18531f4718) to test it. cc ngimel Pull Request resolved: pytorch/pytorch#22750 Differential Revision: D17119965 Pulled By: ezyang fbshipit-source-id: 2e8c5d63fc3c636b8a1338c43c9c101a0f5e9b22

facebook-github-bot · 2019-08-30T02:18:33Z

@ezyang merged this pull request in 8640aef.

ptrblck added 7 commits July 8, 2019 17:14

initial commit

deadd93

add mixed precision calls for sync bn

546ae0f

backup commit

33732ff

cleanup

225bc90

add test

5da1b78

add eval test

d286297

Merge branch 'master' of git://github.com/pytorch/pytorch into batchnorm

e5f9e73

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn module: operators labels Jul 11, 2019

ezyang added the open source label Jul 11, 2019

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2019

jerryzh168 requested a review from li-roy July 13, 2019 00:55

ptrblck added 3 commits July 15, 2019 17:00

add cast for backward

ce5ae6a

WIP syncbn backward

d22c05b

fix type error, cleanup

6f02f23

jjsjann123 reviewed Jul 19, 2019

View reviewed changes

use weight_ for grad_weight_ and grad_bias_ in batch_norm_backward_re…

9b0fa2b

…duce_cuda_template, add requested changes

jjsjann123 approved these changes Jul 19, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

9fda5e5

ptrblck mentioned this pull request Aug 7, 2019

scaled_loss.backward() returned an invalid gradient NVIDIA/apex#422

Closed

Merge remote-tracking branch 'origin/master' into HEAD

a4e56bc

iariav mentioned this pull request Aug 13, 2019

Enable half precision training NoamRosenberg/autodeeplab#11

Merged

facebook-github-bot reviewed Aug 29, 2019

View reviewed changes

facebook-github-bot closed this in 8640aef Aug 29, 2019

facebook-github-bot added the merged label Aug 30, 2019

jjsjann123 mentioned this pull request Oct 8, 2019

nn.sync_bn does not work well with apex fp16 mode #27420

Closed

mruberry added the Merged label Oct 28, 2020

This was referenced Jan 6, 2021

Program throws exception when using SyncBatchNorm with track_running_stats = False #49730

Closed

Fix SyncBatchNorm usage without stats tracking #50126

Closed

	mean_dy, mean_dy_xmu, grad_weight, grad_bias = torch.batch_norm_backward_reduce(
	grad_output,
	saved_input,
	mean,
	invstd,
	self.needs_input_grad[0],
	self.needs_input_grad[1],
	self.needs_input_grad[2]
	)

Add support for non-affine batch norm with float stats and half inputs #22750

Add support for non-affine batch norm with float stats and half inputs #22750

Uh oh!

Conversation

ptrblck commented Jul 11, 2019

Uh oh!

ptrblck commented Jul 11, 2019

Uh oh!

ptrblck commented Jul 11, 2019

Uh oh!

jjsjann123 commented Jul 12, 2019

Uh oh!

ptrblck commented Jul 12, 2019

Uh oh!

ptrblck commented Jul 12, 2019

Uh oh!

ptrblck commented Jul 16, 2019

Uh oh!

jjsjann123 Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

ptrblck Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Jul 19, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Jul 19, 2019

Uh oh!

soumith commented Jul 22, 2019

Uh oh!

ptrblck commented Aug 2, 2019

Uh oh!

ptrblck commented Aug 6, 2019

Uh oh!

ptrblck commented Aug 7, 2019

Uh oh!

ptrblck commented Aug 7, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 30, 2019

Uh oh!

Uh oh!