test_spectral_norm: Backward is not reentrant #13818

ezyang · 2018-11-11T03:11:32Z

I got this on a run:

Nov 09 21:56:23 ======================================================================
Nov 09 21:56:23 ERROR: test_spectral_norm (__main__.TestNN)
Nov 09 21:56:23 ----------------------------------------------------------------------
Nov 09 21:56:23 Traceback (most recent call last):
Nov 09 21:56:23   File "/var/lib/jenkins/workspace/test/common_utils.py", line 116, in wrapper
Nov 09 21:56:23     fn(*args, **kwargs)
Nov 09 21:56:23   File "test_nn.py", line 1864, in test_spectral_norm
Nov 09 21:56:23     torch.autograd.gradcheck(fn, (input.clone().requires_grad_(),))
Nov 09 21:56:23   File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 208, in gradcheck
Nov 09 21:56:23     return fail_test('Backward is not reentrant, i.e., running backward with same '
Nov 09 21:56:23   File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 185, in fail_test
Nov 09 21:56:23     raise RuntimeError(msg)
Nov 09 21:56:23 RuntimeError: Backward is not reentrant, i.e., running backward with same input and grad_output multiple times gives different values, although analytical gradient matches numerical gradient
Nov 09 21:56:23

https://circleci.com/gh/pytorch/pytorch/215442?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano

The text was updated successfully, but these errors were encountered:

ssnl · 2018-11-12T04:06:52Z

@t-vi saw this on one of his PR too, but I couldn't reproduce it. Well, if no one's looking it when I get back to China and get myself a GPU machine, I can try taking a look.

t-vi · 2018-11-12T08:59:05Z

I'll try to run this a few times to see if I can reproduce it.

t-vi · 2018-11-12T10:15:32Z

So this is the "multiple forward" bit (the last gradcheck in test_spectral_norm) apparently.

What we (I at least) don't know is what apply_dp/requires_grad is.
My first impression is that the test is "multi-gpu" or "cpu" only. When I change the multi-gpu branch
to work on single GPU, it seems to succeed every time. (I use cuda 9.2 currently, I have yet to try 8 - if I manage to compile that)
Would we believe that there are determinism issues for multi-gpu here? I think that the spectral norm parameters are only updated on the first GPU, does that work correctly here (i.e. that all GPUs are reset)?
The two instances I've seen are on cuda8. Do we think there might be "non-determinism" issues with cuda8?
@ezyang Do you have an overview what the actual failure rate is? Is it all on cuda8? That is a test run that has multiple GPUs? Is it the only one?

Unfortunately, I still didn't buy a computer large enough to put my second GPU in...

ezyang · 2018-11-12T16:04:24Z

Another one: https://circleci.com/gh/pytorch/pytorch/221279?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I haven't seen very many failures, but our sample size of three (with the one above) only had cuda8 failures. So it might be a cuda8 only thing, nondeterminism also sounds likely to me. I don't know what the implementation looks like.

Re apply_dp versus require_grad, it might be a good start to split up the test into its various configurations, so we can get that information.

avmgithub · 2018-11-13T14:29:24Z

I've seen this multiple times on my ppc64le builds. This is the latest: https://powerci.osuosl.org/user/avmgithub/my-views/view/PyTorch/job/pytorch-linux-cuda92-cudnn7-py3-mpi-build-test-gpu/165/console . Ubuntu 16.04 cuda 9.2 cudnn 7

Test is flaky on CUDA 8. See pytorch#13818

colesbury · 2018-11-13T17:42:11Z

Yeah, this isn't CUDA 8 specific. There is some non-determinism due to the summing of DataParallel outputs. I imagine it has something to do with the thread-per-GPU. Threads can add work to each others queues, which I imagine can lead to a non-deterministic execution order.

We should relax gradcheck to allow for some small non-determinism in this case, either as an optional argument to gradcheck or just by default.

See pytorch#13818

Summary: See #13818 for suggestions about a long-term fix Pull Request resolved: #13908 Differential Revision: D13047262 Pulled By: colesbury fbshipit-source-id: 0f29bd5b659bb97826381abbc305fb8a25b131ed

…20980) Summary: gradcheck currently includes a determinism check (although only trying twice and seeing if results match). This can lead to flaky tests, e.g. in #20971, but also #13818. This adds nondet_tol for both gradcheck and gradgradcheck. It does not change / reenable any tests yet. Pull Request resolved: #20980 Differential Revision: D15530129 Pulled By: soumith fbshipit-source-id: 04d7f85b5b59cd62867820c74b064ba14f4fa7f8

colesbury added a commit to colesbury/pytorch that referenced this issue Nov 13, 2018

Temporarily skip test_spectral_norm on CUDA 8

1764d88

Test is flaky on CUDA 8. See pytorch#13818

colesbury mentioned this issue Nov 13, 2018

Temporarily skip test_spectral_norm on CUDA 8 #13895

Closed

colesbury added a commit to colesbury/pytorch that referenced this issue Nov 13, 2018

Temporarily disable part of test_spectral_norm

24a09bc

See pytorch#13818

colesbury mentioned this issue Nov 13, 2018

Temporarily disable part of test_spectral_norm #13908

Closed

t-vi mentioned this issue May 27, 2019

Allow nondet_tol for nondeterminism in gradcheck and gradgradcheck #20980

Closed

heitorschueroff added module: norms and normalization module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 17, 2021

soulitzer added the module: data parallel label May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_spectral_norm: Backward is not reentrant #13818

test_spectral_norm: Backward is not reentrant #13818

ezyang commented Nov 11, 2018 •

edited by pytorch-probot bot

ssnl commented Nov 12, 2018

t-vi commented Nov 12, 2018

t-vi commented Nov 12, 2018

ezyang commented Nov 12, 2018

avmgithub commented Nov 13, 2018

colesbury commented Nov 13, 2018

test_spectral_norm: Backward is not reentrant #13818

test_spectral_norm: Backward is not reentrant #13818

Comments

ezyang commented Nov 11, 2018 • edited by pytorch-probot bot

ssnl commented Nov 12, 2018

t-vi commented Nov 12, 2018

t-vi commented Nov 12, 2018

ezyang commented Nov 12, 2018

avmgithub commented Nov 13, 2018

colesbury commented Nov 13, 2018

ezyang commented Nov 11, 2018 •

edited by pytorch-probot bot