Add CudaDeviceGuard to recvImplFromLoop #264

mrshenli · 2021-01-12T18:46:05Z

Without this change, I am seeing the following error when running PyTorch's test_device_map_gpu_non_default test:

[E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal                                                                       
Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):                                                           
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai
d/shenli/pytorch/torch/lib/libc10.so)                                                                                                                           
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so)                                                                                       
frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so)                                        
frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6)                                       
frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                 
frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6)

Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test: ``` [E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai d/shenli/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so) frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6) frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6) ```

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-12T18:48:00Z

@mrshenli has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-12T19:30:46Z

@mrshenli merged this pull request in ac98f40.

Summary: Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test: ``` [E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai d/shenli/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so) frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6) frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6) ``` Pull Request resolved: #264 Reviewed By: beauby Differential Revision: D25887193 Pulled By: mrshenli fbshipit-source-id: 47fa6b22e88f0a2800c6b20451e74c880715d637

facebook-github-bot added the cla signed label Jan 12, 2021

facebook-github-bot reviewed Jan 12, 2021

View reviewed changes

fix lint

0d65bce

facebook-github-bot reviewed Jan 12, 2021

View reviewed changes

mrshenli requested a review from beauby January 12, 2021 18:49

facebook-github-bot closed this in ac98f40 Jan 12, 2021

facebook-github-bot added the Merged label Jan 12, 2021

beauby deleted the mrshenli-patch-1 branch April 12, 2021 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CudaDeviceGuard to recvImplFromLoop #264

Add CudaDeviceGuard to recvImplFromLoop #264

mrshenli commented Jan 12, 2021

facebook-github-bot left a comment

facebook-github-bot commented Jan 12, 2021

facebook-github-bot left a comment

facebook-github-bot commented Jan 12, 2021

Add CudaDeviceGuard to recvImplFromLoop #264

Add CudaDeviceGuard to recvImplFromLoop #264

Conversation

mrshenli commented Jan 12, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 12, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 12, 2021