Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Commit

Permalink
Add CudaDeviceGuard to recvImplFromLoop (#264)
Browse files Browse the repository at this point in the history
Summary:
Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test:

```
[E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal
Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai
d/shenli/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)
frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so)
frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6)
```

Pull Request resolved: #264

Reviewed By: beauby

Differential Revision: D25887193

Pulled By: mrshenli

fbshipit-source-id: 47fa6b22e88f0a2800c6b20451e74c880715d637
  • Loading branch information
mrshenli authored and facebook-github-bot committed Jan 12, 2021
1 parent ea17890 commit ac98f40
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions tensorpipe/channel/cuda_ipc/channel_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,10 @@ void ChannelImpl::recvImplFromLoop(
TDescriptor descriptor,
CudaBuffer buffer,
TRecvCallback callback) {
// Need to guard otherwise some op on the receiver will crash.
// TODO: figure out which CUDA op crashed and replace this with a
// more precise fix.
CudaDeviceGuard guard(cudaDeviceForPointer(buffer.ptr));
recvOperations_.emplace_back(
sequenceNumber, buffer.ptr, buffer.stream, buffer.length);
auto& op = recvOperations_.back();
Expand Down

0 comments on commit ac98f40

Please sign in to comment.