Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Add CudaDeviceGuard to recvImplFromLoop #264

Closed
wants to merge 2 commits into from
Closed

Conversation

mrshenli
Copy link
Contributor

Without this change, I am seeing the following error when running PyTorch's test_device_map_gpu_non_default test:

[E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal                                                                       
Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):                                                           
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai
d/shenli/pytorch/torch/lib/libc10.so)                                                                                                                           
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so)                                                                                       
frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so)                                        
frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6)                                       
frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                 
frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6)  

Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test:

```
[E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal                                                                       
Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):                                                           
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai
d/shenli/pytorch/torch/lib/libc10.so)                                                                                                                           
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so)                                                                                       
frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)                                                   
frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so)                                        
frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6)                                       
frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                 
frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6)  
```
Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@mrshenli has updated the pull request. You must reimport the pull request before landing.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@mrshenli merged this pull request in ac98f40.

@beauby beauby deleted the mrshenli-patch-1 branch April 12, 2021 01:28
lw pushed a commit that referenced this pull request Sep 20, 2021
Summary:
Without this change, I am seeing the following error when running PyTorch's `test_device_map_gpu_non_default` test:

```
[E thread_pool.cpp:112] Exception in thread pool task: CUDA error: invalid device ordinal
Exception raised from exchangeDevice at ../c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb1de41c9ac in /rai
d/shenli/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&) + 0xfa (0x7fb1de3e022c in /raid/shenli/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0xc29798 (0x7fb1f32f6798 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0xc19d91 (0x7fb1f32e6d91 in /raid/shenli/pytorch/torch/lib/libtorch_python.so)
frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x295 (0x7fb1de40b955 in /raid/shenli/pytorch/torch/lib/libc10.so)
frame #5: <unknown function> + 0xc819d (0x7fb1fece419d in /raid/shenli/miniconda/envs/torchdev/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x9609 (0x7fb21c06e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fb21bf95293 in /lib/x86_64-linux-gnu/libc.so.6)
```

Pull Request resolved: #264

Reviewed By: beauby

Differential Revision: D25887193

Pulled By: mrshenli

fbshipit-source-id: 47fa6b22e88f0a2800c6b20451e74c880715d637
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants