-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_spectral_norm: Backward is not reentrant #13818
Comments
@t-vi saw this on one of his PR too, but I couldn't reproduce it. Well, if no one's looking it when I get back to China and get myself a GPU machine, I can try taking a look. |
I'll try to run this a few times to see if I can reproduce it. |
So this is the "multiple forward" bit (the last gradcheck in test_spectral_norm) apparently.
Unfortunately, I still didn't buy a computer large enough to put my second GPU in... |
I haven't seen very many failures, but our sample size of three (with the one above) only had cuda8 failures. So it might be a cuda8 only thing, nondeterminism also sounds likely to me. I don't know what the implementation looks like. Re |
I've seen this multiple times on my ppc64le builds. This is the latest: https://powerci.osuosl.org/user/avmgithub/my-views/view/PyTorch/job/pytorch-linux-cuda92-cudnn7-py3-mpi-build-test-gpu/165/console . Ubuntu 16.04 cuda 9.2 cudnn 7 |
Test is flaky on CUDA 8. See pytorch#13818
Yeah, this isn't CUDA 8 specific. There is some non-determinism due to the summing of DataParallel outputs. I imagine it has something to do with the thread-per-GPU. Threads can add work to each others queues, which I imagine can lead to a non-deterministic execution order. We should relax |
…20980) Summary: gradcheck currently includes a determinism check (although only trying twice and seeing if results match). This can lead to flaky tests, e.g. in #20971, but also #13818. This adds nondet_tol for both gradcheck and gradgradcheck. It does not change / reenable any tests yet. Pull Request resolved: #20980 Differential Revision: D15530129 Pulled By: soumith fbshipit-source-id: 04d7f85b5b59cd62867820c74b064ba14f4fa7f8
I got this on a run:
https://circleci.com/gh/pytorch/pytorch/215442?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano
The text was updated successfully, but these errors were encountered: