-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple C++ custom autograd function code throws error "CUDA error: driver shutting down" #35736
Comments
cc @wanchaol that changed that code recently. |
Thanks I'm looking into it. |
Update: I noticed that if we initialize a cuda tensor (e.g. |
This probably has something to do with initialization of CUDA happening inside the autograd engine, and then the destructor gets called too early at destruction time, before the autograd destructors happen. I wonder if there's a way for the autograd engine to "promote" the destructor so that it stays live until autograd is done shutting down. |
Update: I obtained the following backtrace:
Also @wanchaol and I found that this bug only reproduces when an external C++ program is linked to libtorch, not when the test script itself is in our test suite (i.e. running it as the only test in |
Thanks for the backtrace, yeah I am trying to link external cpp with the test but it failed under the cuda library linking failure, I will need to figure out the build issue. Do you know what is the difference between linking in our internal test suite and the external linking? Apart from this problem, I think we should make our internal test suite to be identical with external linking so that we could catch this problem directly in the tests. |
Yes I agreed, I think this is the CMakeLists.txt file we use to build the internal test suite: https://github.com/pytorch/pytorch/blob/master/test/cpp/api/CMakeLists.txt. And I suspect if this block is what makes internal test suite working: pytorch/test/cpp/api/CMakeLists.txt Lines 44 to 50 in 9650f46
|
@yf225 I tried downloading libtorch cu10 from pytorch.org, and locally build the cpp application, still could not repro, works fine on my side without the exception.
|
Hi, I think this is a workaround for this problem.
Just a simple CUDA related function called before initializing a tensor will let everything OK, even though the tensor does not need CUDA support.
I am using LibTorch 1.6.0 with cuda10.1. The LibTorch was downloaded from https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.6.0%2Bcu101.zip. The CMakeLists.txt is below:
|
@yf225 do you know if this is still an issue in latest versions? |
I am currently experiencing what seems to be the same issue via Rust bindings for the C++ API. Versions (Arch linux packages)
Code: use tch::{Device, Kind, Tensor};
fn main() {
let x = Tensor::zeros(&[1], (Kind::Float, Device::Cpu)).requires_grad_(true);
x.backward();
} Output:
And adding |
still not fixed with the latest libtorch_1.9.0_cuda_10.2, simple back-prop on the cpu would cause this error.
|
🐛 Bug
Running the following code in cuda-enabled libtorch throws error "CUDA error: driver shutting down", even though the code doesn't use CUDA. Running the same code in cpu-only libtorch doesn't throw any error.
Error:
Better backtrace:
Update: I noticed that if we initialize a cuda tensor (e.g.
auto cuda_tensor = torch::randn({3, 4}, torch::kCUDA); std::cout << cuda_tensor << std::endl;
) before running the C++ custom autograd function, the whole thing would pass and there is no error.Expected behavior
It should just work without throwing any error.
Environment
Latest libtorch nightly
cc @ezyang @ssnl @albanD @zou3519 @gqchen @yf225
The text was updated successfully, but these errors were encountered: