-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Error in batchNorm #42588
Comments
reproduction? PyTorch Version? |
Please run |
@gchanan |
@malfet
|
I'm sorry for not displaying the model because It's a model of my company.. I used some customlayer before applying the batchnorm |
I have the same problem as you. Did you find any solution? |
I have the same problem again. Strangely, if I run it on CPU, then there is no problem. |
Now I can provide a small example to reproduce the problem. I'm using GCC10.2 on ubuntu 20.04 with libtorch nightly build ann cuda 10.2. In my following code, structure of DTNet is: conv - 4 layers of TGCNSABlock - linear layer. I use a Observation:
cmake output:
CMakeLists.txt:
DTNet.hpp:
src.cpp:
Error:
|
This is a smaller bad example. However if change line DTNet.hpp:
src.cpp:
|
A super small bad example:
src.cpp
It looks like that problem is about the |
Awesome, thanks for the repro, I can reproduce, we'll take a look |
馃悰 Bug
RuntimeError Traceback (most recent call last)
in
----> 1 loss.backward()
/opt/conda/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
183 products. Defaults to
False
.184 """
--> 185 torch.autograd.backward(self, gradient, retain_graph, create_graph)
186
187 def register_hook(self, hook):
/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
125 Variable._execution_engine.run_backward(
126 tensors, grad_tensors, retain_graph, create_graph,
--> 127 allow_unreachable=True) # allow_unreachable flag
128
129
RuntimeError: Expected grad_output->is_contiguous(grad_output->suggest_memory_format()) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Exception raised from cudnn_batch_norm_backward at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/cudnn/BatchNorm.cpp:249 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f993197377d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x25b2 (0x7f9932a79db2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd1150a (0x7f9932aea50a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd3fa3b (0x7f9932b18a3b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f9964cff10f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x2b59cff (0x7f9966946cff in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: + 0x2b6b21b (0x7f996695821b in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f9964cff10f in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::generated::CudnnBatchNormBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x42c (0x7f99668a9fec in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x30d1017 (0x7f9966ebe017 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f9966eb9860 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f9966eba401 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f9966eb2579 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f996b1e199a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: + 0xc819d (0x7f99aa6fa19d in /opt/conda/bin/../lib/libstdc++.so.6)
frame #15: + 0x76db (0x7f99adb356db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #16: clone + 0x3f (0x7f99ad85e88f in /lib/x86_64-linux-gnu/libc.so.6)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
In ubuntu, CUDA 10.1, python 3.7
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source):Additional context
cc @ngimel @csarofeen @ptrblck @xwang233
The text was updated successfully, but these errors were encountered: