New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated derivative rules for complex svd and pinverse #47761
Conversation
💊 CI failures summary and remediationsAs of commit cb7590e (more details on the Dr. CI page):
🕵️ 5 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)Step: "Report results" (full log | diagnosis details | 🔁 rerun)
|
This pull request has been reverted by f5b68e7. |
Hey @IvanYashchuk, bad news / interesting news. I had to unland this because it caused a significant increase in test timings, like adding 30 minutes to some test builds. That's the bad news. The interesting news is now we get to have a discussion about testing thoroughness vs. cost. cc @malfet, @ngimel, @albanD, @kurtamohler For example, here's a CI build after this PR: And here's the same build before it: Is this discrepancy really caused by this PR? Perhaps surprisingly: yes. While most of the tests this PR adds run quickly, there are some standouts:
For just CUDA grad and gradgrad testing of SVD that's almost 20 minutes! Other tests and the pinverse tests are much faster. Here's some of the tests from above running on CPU, for example:
While these tests aren't super fast, they take less than a minute compared to the corresponding CUDA tests that over six minutes. A few things here:
Some options for being smarter:
I'd like to suggest starting with the first two options. Maybe we should stop testing the CUDA grad and gradgrad of other functions whose backward is a composite operation, too? Separately, we should follow-up on CUDA SVD performance. Luckily @xwang233 already has a PR that may address some of these issues: #48436. |
That seems hard to do reliably. Normally you'd notice this anyway when adding the tests, since you're running only those tests when developing. So perhaps best to do as a combination of setting expectations and once in a while cleaning up outliers? Easy to find outliers:
|
I don't know if it is so challenging. What if we diffed the test files produced for new tests, for example, and summed their total time and reported if it was over a minute? |
You will have the same problem that codecov has, which is that diffing against a baseline run under different circumstances or which may be further from the branch point of your PR can be wrong or nonexisting for lots of reasons. I'd expect the false positive rate to be high. |
@mruberry thank you very much for the revert. Please note, that adding all those tests pushed pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test total test time beyond 5h limit. We can always shard it to test1 and test2, but it would be good to understand why those tests need to be so slow. |
I do agree with Mike that for such cases, it is ok to skip the grad and gradgrad check on cuda.
You mean that right now, when an op is both a method and a function, we (grad)gradcheck both? |
Yes. We currently re-run the grad and grad grad checks for the function variant and the method variant of each operation. Seems like we can stop running them on the method variant? |
For all the ops that are automatically bound to python via codegen from |
@ngimel and I looked at the current test timings for SVD CUDA backward, and the existing tests take about 5 minutes. By creating an OpInfo there's a 4x expansion of these tests ({function, method} x {double, cdouble}). Removing the method variants would still create a 2x expansion. @ngimel mentioned she'd like the opportunity to review SVD's CUDA backwards before we decide on a course of action. |
Thanks to @ngimel and @albanD we're going to stop testing method grad and gradgrad for the moment and suggest we stop testing CUDA svd gradgrad. @IvanYashchuk, sorry to ask you to make another revision of this PR, but would you add skips for svd's CUDA gradgrad's tasks? With these changes we expect svd's CUDA gradient test time to be lower than in builds today. |
@mruberry I added skips for cuda gradchecks. I also marked gradgrad tests as slow.
and after
|
# cuda gradchecks are very slow | ||
# see discussion https://github.com/pytorch/pytorch/pull/47761#issuecomment-747316775 | ||
SkipInfo('TestGradients', 'test_fn_gradgrad', device_type='cuda'), | ||
SkipInfo('TestGradients', 'test_fn_grad', device_type='cuda'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep the grad checks going, let's just exclude the gradgrad checks on CUDA for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should CUDA gradgrad check be always skipped or tested only with PYTORCH_TEST_WITH_SLOW
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just unconditionally skip them on svd for now. @albanD and I are going to look at our test matrix more closely in January.
New test timings look good. Let's keep CUDA grad for now and just skip CUDA gradgrad. Then let's merge this. |
Done. Let's try merging it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Updated `svd_backward` to work correctly for complex-valued inputs. Updated `common_methods_invocations.py` to take dtype, device arguments for input construction. Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`. Added `svd` and `pinverse` to list of complex tests. References for complex-valued SVD differentiation: - https://giggleliu.github.io/2019/04/02/einsumbp.html - https://arxiv.org/abs/1909.02659 The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant. https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/ The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl). Ref. pytorch#33152 Pull Request resolved: pytorch#47761 Reviewed By: izdeby Differential Revision: D25574962 Pulled By: mruberry fbshipit-source-id: 832b61303e883ad3a451b84850ccf0f36763a6f6
Summary: Updated `svd_backward` to work correctly for complex-valued inputs. Updated `common_methods_invocations.py` to take dtype, device arguments for input construction. Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`. Added `svd` and `pinverse` to list of complex tests. References for complex-valued SVD differentiation: - https://giggleliu.github.io/2019/04/02/einsumbp.html - https://arxiv.org/abs/1909.02659 The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant. https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/ The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl). Ref. pytorch#33152 Pull Request resolved: pytorch#47761 Reviewed By: ngimel Differential Revision: D25658897 Pulled By: mruberry fbshipit-source-id: ba33ecbbea3f592238c01e62c7f193daf22a9d01
Updated
svd_backward
to work correctly for complex-valued inputs.Updated
common_methods_invocations.py
to take dtype, device arguments for input construction.Removed
test_pinverse
fromtest_autograd.py
, it is replaced by entries tocommon_methods_invocations.py
.Added
svd
andpinverse
to list of complex tests.References for complex-valued SVD differentiation:
The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant.
https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/
The same rule is implemented in Tensorflow and BackwardsLinalg.jl.
Ref. #33152