This repository has been archived by the owner on Jul 1, 2023. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add CudaDeviceGuard to recvImplFromLoop (#264)
Summary: Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test: ``` [E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai d/shenli/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so) frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so) frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6) frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6) ``` Pull Request resolved: #264 Reviewed By: beauby Differential Revision: D25887193 Pulled By: mrshenli fbshipit-source-id: 47fa6b22e88f0a2800c6b20451e74c880715d637
- Loading branch information