Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructional error message for cudnn RNN double backward workaround #33884

Closed
wants to merge 10 commits into from

Conversation

zou3519
Copy link
Contributor

@zou3519 zou3519 commented Feb 27, 2020

Stack from ghstack:

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:

  • added some tests to check the error message

Differential Revision: D20143544

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

[ghstack-poisoned]
@zou3519 zou3519 requested a review from albanD February 27, 2020 18:16
albanD
albanD previously approved these changes Feb 27, 2020
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nit, thanks!

std::string("the derivative for '") + name + "' is not implemented");
template <typename T>
T not_implemented_base(const char* name, const char* reason) {
std::string msg = std::string("the derivative for '") + name + "' is not implemented.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually do this with c10::str("the derivative for '", name, "' is not implemented."); as it looks slightly better.

…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Feb 27, 2020
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

ghstack-source-id: d5eb3c1852d2729ff9a8d9cd5d813db82682140d
Pull Request resolved: #33884
@dr-ci
Copy link

dr-ci bot commented Feb 27, 2020

💊 CircleCI build failures summary and remediations

As of commit 8907802 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (1/2)

Step: "Build" (full log | pattern match details)

Mar 09 18:51:38 caused by: Connection refused (os error 111)
Mar 09 18:51:38 +++ eval 'extract_trap_cmd ' 
Mar 09 18:51:38 ++++ extract_trap_cmd 
Mar 09 18:51:38 ++++ printf '%s\n' '' 
Mar 09 18:51:38 +++ printf '%s\n' cleanup 
Mar 09 18:51:38 ++ trap -- ' 
Mar 09 18:51:38 cleanup' EXIT 
Mar 09 18:51:38 ++ which sccache 
Mar 09 18:51:38 ++ sccache --stop-server 
Mar 09 18:51:38 Stopping sccache server... 
Mar 09 18:51:38 error: couldn't connect to server 
Mar 09 18:51:38 caused by: Connection refused (os error 111) 
Mar 09 18:51:38 ++ true 
Mar 09 18:51:38 ++ rm /var/lib/jenkins/sccache_error.log 
Mar 09 18:51:38 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Mar 09 18:51:38 ++ true 
Mar 09 18:51:38 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Mar 09 18:51:38 ++ SCCACHE_IDLE_TIMEOUT=1200 
Mar 09 18:51:38 ++ RUST_LOG=sccache::server=error 
Mar 09 18:51:38 ++ sccache --start-server 
Mar 09 18:51:38 Starting sccache server... 
Mar 09 18:51:38 ++ sccache --zero-stats 

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (2/2)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 7 failures>

ERROR: test_cross_device_reentrant_autograd_cuda (__main__.TestAutogradDeviceTypeCUDA)
test_where_broadcast_all_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_where_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_where_functional_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_where_scalar_broadcast_mask_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_where_scalar_broadcast_non_mask_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_where_scalar_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_zero__cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
test_zero__scalar_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok 
 
====================================================================== 
ERROR: test_cross_device_reentrant_autograd_cuda (__main__.TestAutogradDeviceTypeCUDA) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 207, in instantiated_test 
    return test(self, device_arg) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 426, in only_fn 
    return fn(slf, device, *args, **kwargs) 
  File "test_autograd.py", line 4628, in test_cross_device_reentrant_autograd 
    out.sum().backward() 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\tensor.py", line 198, in backward 
    torch.autograd.backward(self, gradient, retain_graph, create_graph) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 28 times.

…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 9, 2020
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

ghstack-source-id: 4e847612b16b2a0135298c8af2eb131adbb9140b
Pull Request resolved: #33884
…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Jan 13, 2021
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

ghstack-source-id: 7fecbc66391d91a7f1287a396864e5ffd750141f
Pull Request resolved: #33884
@zou3519
Copy link
Contributor Author

zou3519 commented Jan 13, 2021

test_cross_device_reentrant_autograd_cuda seems to consistently fail on the CUDA job for me, but I can't reproduce it locally or while ssh-ing into a CircleCI machine. This will need more investigation.

@zou3519
Copy link
Contributor Author

zou3519 commented Jan 13, 2021

Edit: I figured out how to repro the error locally. It required running two tests in a specific order. https://gist.github.com/zou3519/17aa029509c83470aa299666ceab1d6f.

It looks like what happening is that raising the double backward error message permanently messes with the "checkpoint state" of the engine. Will dig into this a bit more later.

@albanD
Copy link
Collaborator

albanD commented Jan 13, 2021

@zou3519 yes there is a known issue that is state is not reset properly: #37874
Is the test that leave the engine in a bad state the one you added? You might want to modify it to avoid doing so (at least until that other issue is fixed).

@zou3519
Copy link
Contributor Author

zou3519 commented Jan 13, 2021

#37874

Ah, that makes sense, thanks for pointing it out! Yes, the tests I added check for the existence of the error message... we can remove the tests for now as a workaround (leaving this PR untested, lol)

@albanD
Copy link
Collaborator

albanD commented Jan 13, 2021

Yes this bug is pretty nasty for testing...

I think you can work around this by running the double backward that fails by hand with a .backward() that way it will still raise but will also leave the state clean.

…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Jan 15, 2021
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

ghstack-source-id: 41af886bf3e66c6817041d791a5132fabdeffa32
Pull Request resolved: #33884
@zou3519
Copy link
Contributor Author

zou3519 commented Jan 15, 2021

Yes this bug is pretty nasty for testing...

I think you can work around this by running the double backward that fails by hand with a .backward() that way it will still raise but will also leave the state clean.

Thanks! That worked out great

@zou3519 zou3519 requested a review from albanD January 15, 2021 15:54
@zou3519 zou3519 dismissed albanD’s stale review January 15, 2021 15:55

re-requesting review

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated test!

…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Jan 15, 2021
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

ghstack-source-id: ea38bf6b2baac59315d9a40e9857edf0e5f8d64a
Pull Request resolved: #33884
…ard workaround"

Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some tests to check the error message

Differential Revision: [D20143544](https://our.internmc.facebook.com/intern/diff/D20143544)

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Jan 18, 2021
Mitigates #5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan:
- added some testing to check the error message

ghstack-source-id: 52a85733add28a80fff68af86d6fc8bf8dd99e58
Pull Request resolved: #33884
@codecov
Copy link

codecov bot commented Jan 19, 2021

Codecov Report

Merging #33884 (74a0333) into gh/zou3519/238/base (7f3a407) will decrease coverage by 0.00%.
The diff coverage is 62.50%.

@@                   Coverage Diff                   @@
##           gh/zou3519/238/base   #33884      +/-   ##
=======================================================
- Coverage                80.66%   80.66%   -0.01%     
=======================================================
  Files                     1913     1913              
  Lines                   208058   208064       +6     
=======================================================
- Hits                    167833   167825       -8     
- Misses                   40225    40239      +14     

@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 1154a85.

@facebook-github-bot facebook-github-bot deleted the gh/zou3519/238/head branch January 23, 2021 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants