-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE #5
Comments
This error usually indicates that the kernel launch crashed due to asking for too many resources (too much shared memory or invalid threadblock size). In most cases that I encountered such an error, it was due to the compiler not being able to determine specific bounds for the scheduler. Most such applications had alot of boundary conditions and the bounds derived during scheduling were different than the ones actually needed at the end, so it was quite difficult to fix. |
Thanks for the quick reply! I added g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/ -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi -frtti -I./bin -Wall -O3 process.cpp bin/deepcamera.a bin/deepcamera_auto_schedule.a bin/deepcamera_simple_auto_schedule.a bin/deepcamera_auto_schedule_store.a bin/deepcamera_auto_schedule_no_fus.a -o bin/process -ldl -lpthread -lz -ltinfo -lpng16 -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Entering Pipeline deepcamera_auto_schedule
Input Buffer input: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
Output Buffer output: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
CUDA: halide_cuda_initialize_kernels (user_context: 0x0, state_ptr: 0x562fc7acdb48, ptx_src: 0x562fc7897c20, size: 42052
load_libcuda (user_context: 0x0)
Loaded CUDA runtime library: libcuda.so
Got device 0
Tesla V100-SXM2-16GB
total memory: 4095 MB
max threads per block: 1024
warp size: 32
max block size: 1024 1024 64
max grid size: 2147483647 65535 65535
max shared memory per block: 49152
max constant memory per block: 65536
compute capability 7.0
cuda cores: 80 x 0 = 0
cuCtxCreate 0 -> 0x562fc7ca01b0(3020)
cuModuleLoadData 0x562fc7897c20, 42052 -> 0x562fc842efb0
Time: 2.834470e-01 ms
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
halide_device_malloc: target device interface 0x562fc7ac6108
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x7ffe912f9890)
allocating buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
cuMemAlloc 7864320 -> 0x7f6ce7800000
Time: 1.600770e-01 ms
halide_copy_to_device 0x7ffe912f9890 host is dirty
c.extent[0] = 1536
c.extent[1] = 2560
CUDA: halide_cuda_buffer_copy (user_context: 0x0, src: 0x7ffe912f9890, dst: 0x7ffe912f9890)
from host to device, 0x7f6d0931c080 -> 0x7f6ce7800000, 7864320 bytes
cuMemcpyHtoD(0x7f6ce7800000, 0x7f6d0931c080, 7864320)
Time: 9.949730e-01 ms
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
halide_device_malloc: target device interface 0x562fc7ac6108
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x7ffe912f9920)
allocating buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
cuMemAlloc 8388608 -> 0x7f6cd2000000
Time: 1.766050e-01 ms
CUDA: halide_cuda_run (user_context: 0x0, entry: kernel_output_s0_y_y_o___block_id_y, blocks: 34x171x1, threads: 2048x2048x1, shmem: 19006078
Got context.
Got module 0x562fc842efb0
Got function 0x562fc843cee0
halide_cuda_run 0 4 [0x0 ...] 0
halide_cuda_run 1 4 [0x60000000000 ...] 0
halide_cuda_run 2 4 [0x80000000600 ...] 0
halide_cuda_run 3 4 [0x9ff00000800 ...] 0
halide_cuda_run 4 4 [0x9ff ...] 0
halide_cuda_run 5 4 [0x60000000000 ...] 0
halide_cuda_run 6 4 [0xa0000000600 ...] 0
halide_cuda_run 7 8 [0x7f6ce7800000 ...] 1
halide_cuda_run 8 8 [0x7f6cd2000000 ...] 1
halide_cuda_run translated arg7 [0x7f6ce7800000 ...]
halide_cuda_run translated arg8 [0x7f6cd2000000 ...]
Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE
Makefile:49: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped) The application I am compiling is here: https://github.com/dillonhuff/HalideAutoGPU/tree/dhuff_experiments/TACO_Benchmarks/deepcamera |
So for some reason the threadblock dimensions appear to be equal to the entire image. In any case I tried to debug this issue on the app you linked and there are 2 quick fixes:
Or in the above application replacing gaussian_blur with something like this:
In both cases the code seems to run fine, although I cannot really test its functionality. |
@savsiout I tried that and it worked. Thanks! |
@savsiout I've been trying to run some larger pipelines and have seen the following error several times, where the auto-scheduler runs to completion, but then the CUDA runtime seems to crash:
Have you seen this error before and do you have any suggestions about how I could fix it? I can provide more details about the applications that crash if that would be helpful.
The text was updated successfully, but these errors were encountered: