Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_spectral_norm: Backward is not reentrant #13818

Open
ezyang opened this issue Nov 11, 2018 · 6 comments
Open

test_spectral_norm: Backward is not reentrant #13818

ezyang opened this issue Nov 11, 2018 · 6 comments
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: data parallel module: norms and normalization triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented Nov 11, 2018

I got this on a run:

Nov 09 21:56:23 ======================================================================
Nov 09 21:56:23 ERROR: test_spectral_norm (__main__.TestNN)
Nov 09 21:56:23 ----------------------------------------------------------------------
Nov 09 21:56:23 Traceback (most recent call last):
Nov 09 21:56:23   File "/var/lib/jenkins/workspace/test/common_utils.py", line 116, in wrapper
Nov 09 21:56:23     fn(*args, **kwargs)
Nov 09 21:56:23   File "test_nn.py", line 1864, in test_spectral_norm
Nov 09 21:56:23     torch.autograd.gradcheck(fn, (input.clone().requires_grad_(),))
Nov 09 21:56:23   File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 208, in gradcheck
Nov 09 21:56:23     return fail_test('Backward is not reentrant, i.e., running backward with same '
Nov 09 21:56:23   File "/opt/conda/lib/python3.6/site-packages/torch/autograd/gradcheck.py", line 185, in fail_test
Nov 09 21:56:23     raise RuntimeError(msg)
Nov 09 21:56:23 RuntimeError: Backward is not reentrant, i.e., running backward with same input and grad_output multiple times gives different values, although analytical gradient matches numerical gradient
Nov 09 21:56:23 

https://circleci.com/gh/pytorch/pytorch/215442?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano

@ssnl
Copy link
Collaborator

ssnl commented Nov 12, 2018

@t-vi saw this on one of his PR too, but I couldn't reproduce it. Well, if no one's looking it when I get back to China and get myself a GPU machine, I can try taking a look.

@t-vi
Copy link
Collaborator

t-vi commented Nov 12, 2018

I'll try to run this a few times to see if I can reproduce it.

@t-vi
Copy link
Collaborator

t-vi commented Nov 12, 2018

So this is the "multiple forward" bit (the last gradcheck in test_spectral_norm) apparently.

  • What we (I at least) don't know is what apply_dp/requires_grad is.
  • My first impression is that the test is "multi-gpu" or "cpu" only. When I change the multi-gpu branch
    to work on single GPU, it seems to succeed every time. (I use cuda 9.2 currently, I have yet to try 8 - if I manage to compile that)
  • Would we believe that there are determinism issues for multi-gpu here? I think that the spectral norm parameters are only updated on the first GPU, does that work correctly here (i.e. that all GPUs are reset)?
  • The two instances I've seen are on cuda8. Do we think there might be "non-determinism" issues with cuda8?
  • @ezyang Do you have an overview what the actual failure rate is? Is it all on cuda8? That is a test run that has multiple GPUs? Is it the only one?

Unfortunately, I still didn't buy a computer large enough to put my second GPU in...

@ezyang
Copy link
Contributor Author

ezyang commented Nov 12, 2018

Another one: https://circleci.com/gh/pytorch/pytorch/221279?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I haven't seen very many failures, but our sample size of three (with the one above) only had cuda8 failures. So it might be a cuda8 only thing, nondeterminism also sounds likely to me. I don't know what the implementation looks like.

Re apply_dp versus require_grad, it might be a good start to split up the test into its various configurations, so we can get that information.

@avmgithub
Copy link
Contributor

I've seen this multiple times on my ppc64le builds. This is the latest: https://powerci.osuosl.org/user/avmgithub/my-views/view/PyTorch/job/pytorch-linux-cuda92-cudnn7-py3-mpi-build-test-gpu/165/console . Ubuntu 16.04 cuda 9.2 cudnn 7

colesbury added a commit to colesbury/pytorch that referenced this issue Nov 13, 2018
@colesbury
Copy link
Member

Yeah, this isn't CUDA 8 specific. There is some non-determinism due to the summing of DataParallel outputs. I imagine it has something to do with the thread-per-GPU. Threads can add work to each others queues, which I imagine can lead to a non-deterministic execution order.

We should relax gradcheck to allow for some small non-determinism in this case, either as an optional argument to gradcheck or just by default.

colesbury added a commit to colesbury/pytorch that referenced this issue Nov 13, 2018
facebook-github-bot pushed a commit that referenced this issue Nov 13, 2018
Summary:
See #13818 for suggestions about a long-term fix
Pull Request resolved: #13908

Differential Revision: D13047262

Pulled By: colesbury

fbshipit-source-id: 0f29bd5b659bb97826381abbc305fb8a25b131ed
facebook-github-bot pushed a commit that referenced this issue May 29, 2019
…20980)

Summary:
gradcheck currently includes a determinism check (although only trying twice and seeing if results match).
This can lead to flaky tests, e.g. in #20971, but also #13818.
This adds nondet_tol for both gradcheck and gradgradcheck. It does not change / reenable any tests yet.
Pull Request resolved: #20980

Differential Revision: D15530129

Pulled By: soumith

fbshipit-source-id: 04d7f85b5b59cd62867820c74b064ba14f4fa7f8
@heitorschueroff heitorschueroff added module: norms and normalization module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: data parallel module: norms and normalization triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants