Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error in torchvision nms in Linux. Windows works fine #1705

Open
adizhol opened this issue Dec 29, 2019 · 5 comments
Open

Runtime error in torchvision nms in Linux. Windows works fine #1705

adizhol opened this issue Dec 29, 2019 · 5 comments

Comments

@adizhol
Copy link

adizhol commented Dec 29, 2019

Hi,

I'm getting an runtime error when running torchvision\ops\boxes.nms.
torchvision 0.4.0
pytorch 1.2.0 (GPU)

RuntimeError: Trying to create tensor with negative dimension -532064992: [-532064992] (check_size_nonnegative at /pytorch/aten/src/ATen/native/TensorFactories.h:64)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fd8e1d79273 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0xb76 (0x7fd7d85e8fb6 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x3f7da58 (0x7fd7d6f7fa58 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::VariableType::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x3fa (0x7fd7d69fa75a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: + 0x7f661 (0x7fd87b3fd661 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: nms_cuda(at::Tensor const&, at::Tensor const&, float) + 0x430 (0x7fd87b3fe042 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: nms(at::Tensor const&, at::Tensor const&, float) + 0x172 (0x7fd87b3c1cf9 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: + 0x65115 (0x7fd87b3e3115 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: + 0x62304 (0x7fd87b3e0304 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: + 0x5dc45 (0x7fd87b3dbc45 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: + 0x5ded2 (0x7fd87b3dbed2 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)
frame #11: + 0x4f2e7 (0x7fd87b3cd2e7 in /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so)

To Reproduce
code:

nms(transformed_anchors, scores, iou_threshold = 0.7)
transformed_anchors is [ 490698, 4]
scores is [ 490698]

transformed_anchors .min () = 0
transformed_anchors .max () = 2560

Environment

[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.16.4
[pip] torch==1.2.0
[pip] torchtext==0.4.0
[pip] torchvision==0.4.0
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] torch 1.2.0 pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.4.0 pypi_0 pypi

Additional context
same code on Windows runs as expected

same issue opened in pytorch/pytorch

@fmassa
Copy link
Member

fmassa commented Jan 3, 2020

Thanks for the bug report!

I believe the issue is because we use int in

int dets_num = dets.size(0);
const int col_blocks = at::cuda::ATenCeilDiv(dets_num, threadsPerBlock);
instead of long. Although I'm not sure if the current strategy will work for such large tensors, as it might require a bit too much memory.

EDIT: I tried a quick fix by replacing int with int64_t, but I get CUDA out of memory errors. So for this use-case, the current implementation is not enough and a new implementation might be required.

@shuangshuangguo
Copy link

Hi, @adizhol @fmassa Did you solve the problem?

@senarvi
Copy link

senarvi commented Aug 4, 2021

Even if the implementation doesn't scale up to larger sizes, it would be great to get a better error message.

@fmassa
Copy link
Member

fmassa commented Aug 9, 2021

Hi @senarvi,

I agree that this error message is not great. We should at least make the error message more meaningful by replacing the int with int64_t.

@abhiagwl4262
Copy link

abhiagwl4262 commented Nov 13, 2021

@adizhol

The range of int32 is -2147483648 to 2147483647. Then why did the error comes in your case with 490698 boxes?

sidharth5n added a commit to sidharth5n/Neural-CTRLF that referenced this issue Jan 25, 2022
NMS implementation has issues when the no. of proposals are large.
pytorch/vision#1705
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants