Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuda runtime error (7) : too many resources requested for launch at /pytorch/torch/lib/THCUNN/im2col.h:120 #7680

Closed
ShreyasSkandan opened this issue May 18, 2018 · 16 comments · Fixed by #7779
Labels
todo Not as important as medium or high priority tasks, but we will work on these.

Comments

@ShreyasSkandan
Copy link

ShreyasSkandan commented May 18, 2018

Issue description

I'm trying to run a variant of ERFNet on an NVIDIA TX-2 running Jetpack 3.2 (CUDA 9.0 and CuDNN 7).

I get the following error:
RuntimeError: cuda runtime error (7) : too many resources requested for launch at ../../../pytorch/torch/lib/THCUNN/im2col.h:120

This error indicates the size of the model + overhead is too large for the GPU? But this is roughly a 700mb model trying to perform inference on a single 640x512 grayscale image on a GPU that has roughly 6.5Gb of free space. I even tried training a new model on images at half that resolution and get the same error.

Any tips/feedback is appreciated.

  • PyTorch or Caffe2: PyTorch
  • How you installed PyTorch (conda, pip, source): Source
  • Build command you used (if compiling from source): Followed jetson-reinforcement github
  • OS: Ubuntu 16.04 on Nvidia TX2
  • PyTorch version: 0.3.0
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA 9.0 , CUDNN 7
  • GPU models and configuration: Tegra X2
  • GCC version (if compiling from source): 5.4
  • CMake version: 3.5.1
  • Versions of any other relevant libraries: https://github.com/Eromera/erfnet_pytorch
@soumith
Copy link
Member

soumith commented May 18, 2018

we should fix the launch parameters. I presume we cant use as many threads per block on the TX2 as we use on desktop GPUs.
@ngimel can you tell us what the limits of TX2 GPUs are for the fix to im2col

@ngimel
Copy link
Collaborator

ngimel commented May 18, 2018

Threads per block and maximum blocks in the grid are actually the same for TX2 as they are for desktop GPUs https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, only number of registers is smaller. @dusty-nv, do you know what might be causing this?

@ShreyasSkandan
Copy link
Author

@soumith: as @ngimel said, the number of threads per block is constant across different NVIDIA GPUs, the Tegra series included.

Is it possible that it was compiled to require more registers than is available to the TX2, and maybe the kernel invocation of im2col requires some sort of launch_bounds() command?

@soumith
Copy link
Member

soumith commented May 20, 2018

@ShreyasSkandan TX2 is arm64 platform. I presume pytorch is compiled from source. In this case, there isn't a chance that it was compiled to requre more registers than available on TX2 I think. The build log will be helpful to see.

@ShreyasSkandan
Copy link
Author

@soumith thanks for the quick response. I will try dig up the build log tomorrow and post it here.

@ngimel
Copy link
Collaborator

ngimel commented May 20, 2018

@soumith, if there are no launch bounds, it is in fact possible that a kernel is compiled to request more registers than available. At the compile time, compiler does not know how many threads you'll want to launch with, so, potentially, it can use too many registers per thread to later satisfy runtime requirements, and, e.g. trying to launch with 512 or 1024 threads could fail (not even a single block can be put on an SM), whereas launching with 256 would succeed.

@zou3519 zou3519 added the todo Not as important as medium or high priority tasks, but we will work on these. label May 21, 2018
@soumith
Copy link
Member

soumith commented May 21, 2018

@ngimel is there a way where we can audit that GPU constraints on our server setting (i.e. not actually sitting and compiling TX2 PyTorch)

@ngimel
Copy link
Collaborator

ngimel commented May 21, 2018

Adding launch_bounds with the max number of threads the kernel is going to be launched with will cause compiler not to overuse registers. We had to do it e.g. for interp kernels when cuda 9 started using more registers

@ShreyasSkandan
Copy link
Author

Ahh, that was as I suspected @ngimel Thank you for clearing this up.
@soumith @ngimel do you guys think a fix will be released in the near future (week?).

Thanks for all the help

@ShreyasSkandan
Copy link
Author

Works now, thanks!

@ababycat
Copy link

@ngimel hello, I met the same error on tx2, complied pytorch 0.3.0.. But the error is this: 'cuda runtime error(7): too many resources requested for launch at ...../pytorch/torch/lib/THCUNN/generic/SpatialDilatedMaxPooling.cu'. The file is different with 'VolumetricUpSamplingTrilinear.cu', is that I need add this line 'launch_bounds(1024)' to every function in file 'SpatialDilatedMaxPooling.cu'?
Thank you!

@andrewssobral
Copy link

Hello @ngimel @soumith,

I am also facing a similar issue:

RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/Downloads/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66

Here's my setup:

  • How you installed PyTorch (conda, pip, source): Source
  • Build command you used (if compiling from source): Followed jetson-reinforcement github
  • OS: Ubuntu 16.04 on Nvidia TX2
  • PyTorch version: 0.5.0a0+a24163a (torchvision 0.2.1)
  • Python version: 3.5
  • CUDA/cuDNN version: CUDA 9.0 , CUDNN 7
  • GPU models and configuration: Tegra X2
  • GCC version (if compiling from source): 5.4
  • CMake version: 3.12.2

Source code:
https://github.com/andrewssobral/deep-learning-pytorch/blob/master/segmentation/train_binseg.py

Full log:

CUDA_ENABLED:  True
/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:225: UserWarning: nn.UpsamplingBilinear2d is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.UpsamplingBilinear2d is deprecated. Use nn.functional.interpolate instead.")
/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
THCudaCheck FAIL file=/home/nvidia/Downloads/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu line=66 error=7 : too many resources requested forlaunch
Traceback (most recent call last):
  File "train_binseg.py", line 73, in <module>
    outputs = model(inputs)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nvidia/Downloads/deep-learning-pytorch/segmentation/networks/SegNet.py", line 73, in forward
    enc5 = self.enc5(dec5)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nvidia/Downloads/deep-learning-pytorch/segmentation/networks/SegNet.py", line 33, in forward
    return self.encode(x)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py", line 226, in forward
    return super(UpsamplingBilinear2d, self).forward(input)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/modules/upsampling.py", line 123, in forward
    return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)
  File "/home/nvidia/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 1985, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, _output_size(2), align_corners)
RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/Downloads/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66
nvidia@tegra-ubuntu:~/Downloads/deep-learning-pytorch/segmentation$

Do you know what it could be?

@andrewssobral
Copy link

I know that 'SpatialUpSamplingBilinear.cu' without launch_bounds(1024) leads to this error, but I don't know how to fix it...

Tried this solution #8103 , but still not working (after compilation)

@andrewssobral
Copy link

My issue was solved by following #8103 (comment)

@MrLinNing
Copy link

MrLinNing commented Dec 27, 2018

@ngimel @ShreyasSkandan Hi, I have the same problem. Can this problem solved by gitting clone the new pytorch, than changing CUDA_NUM_THREADS =256 in the below two file and compiling it?
image

@dusty-nv
Copy link

@MrLinNing that appears to fix some of the functions, but not perhaps all - not sure. For more info, see:

#8103 (comment)
#8103 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
todo Not as important as medium or high priority tasks, but we will work on these.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants