Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

Closed
xsacha opened this issue Apr 18, 2019 · 12 comments
Closed

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

xsacha opened this issue Apr 18, 2019 · 12 comments
Labels
high priority module: cudnn Related to torch.backends.cudnn, and CuDNN support module: multithreading Related to issues that occur when running on multiple CPU threads module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@xsacha
Copy link
Contributor

xsacha commented Apr 18, 2019

🐛 Bug

When using a JIT model twice at the same time, for the first time, I get a crash here.

PoolWindow::reserve seems to try to access a vector out of range.

The responsible code was added in #14861

Backtrace:

 	msvcp140d.dll!std::_Debug_message(const wchar_t * message, const wchar_t * file, unsigned int line) Line 9	C++
>	caffe2_gpu.dll!std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > >,std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > > > >::operator[](const unsigned __int64 _Pos) Line 1796	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Vec_lo(unsigned __int64 _Bucket) Line 822	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Begin(unsigned __int64 _Bucket) Line 841	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::lower_bound(const int & _Keyval) Line 647	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::find(const int & _Keyval) Line 630	C++
 	caffe2_gpu.dll!at::native::`anonymous namespace'::PoolWindow::reserve(int device) Line 88	C++
 	caffe2_gpu.dll!at::native::getCudnnHandle() Line 149	C++
 	caffe2_gpu.dll!at::native::setCuDNNStreamToCurrent() Line 13	C++
 	caffe2_gpu.dll!at::native::cudnn_convolution(const at::Tensor & input_t, const at::Tensor & weight_t, const at::Tensor & bias_t, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 930	C++
 	caffe2_gpu.dll!at::CUDAFloatType::cudnn_convolution(const at::Tensor & self, const at::Tensor & weight, const at::Tensor & bias, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 5315	C++

To Reproduce

Steps to reproduce the behavior:

  1. Load a Jit model (once, so on one thread)
  2. After the model is loaded, forward it twice simultaneously (separate threads)
  3. Experience crash in header

Called from two threads at same time:

    static std::once_flag model_flag;
    std::call_once(model_flag, [&modelFile]() {
        model = torch::jit::load(modelFile, torch::kCUDA); }
    );
    // All works fine up until here
    model->forward({torch::randn({1, 3, 1024, 1024}, torch::kCUDA)});
    // Crashes when called from a different thread than model was created on.

Expected behavior

Works when loaded from one thread without having to wait for some random time period to prevent race condition.

Environment

  • PyTorch Version (e.g., 1.0): 1.1.0-pre
  • OS (e.g., Linux): Windows 64-bit
  • How you installed PyTorch (conda, pip, source): nightly
  • Build command you used (if compiling from source): N/A
  • Python version: 3.7
  • CUDA/cuDNN version: 10.0/7.5
  • GPU models and configuration: GTX1060

Workarounds

Other models that I have in a special model thread that batches the inputs will work fine as it ensures they don't get run simultaneously.
This specific model has tensors of different sizes which I am unable to batch together.

@ezyang
Copy link
Contributor

ezyang commented Apr 18, 2019

This sounds like getCudnnHandle is not thread safe.

@ezyang
Copy link
Contributor

ezyang commented Apr 18, 2019

Yep. It's not thread safe:

  cudnnHandle_t reserve(int device)
  {
    // If this thread already has a handle for this device, return it
    if(my_handles.find(device) != my_handles.end())
      return my_handles[device];

    // otherwise, either grab a handle from the pool if one is available,
    // or if not, create a new one.
    std::lock_guard<std::mutex> guard(mutex);

We're not allowed to poke my_handles_ without the mutex unless we can guarantee that there isn't a mutation going on. Which we are not.

cc @mcarilli

I wonder if this explains the Windows race that plagued #14861. Maybe not.

@nairbv nairbv added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: multithreading Related to issues that occur when running on multiple CPU threads triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority labels Apr 18, 2019
@mcarilli
Copy link
Collaborator

mcarilli commented Apr 18, 2019

Yep. It's not thread safe.

my_handles is a thread local object though (it's a member of thread_local PoolWindow myPoolWindow;) so I'm confused why this would be a problem.

@ezyang
Copy link
Contributor

ezyang commented Apr 18, 2019

Oops, I guess I spoke too soon :) Is it possible that thread_local doesn't work the way we think it does on Windows?

@ezyang
Copy link
Contributor

ezyang commented Apr 18, 2019

@xsacha This is not a true fix, but if you move the mutex to guard the my_handles read (even though, as @mcarilli says, it should not be necessary), does that eliminate the crash?

@xsacha
Copy link
Contributor Author

xsacha commented Apr 18, 2019

I figure that would fix the issue because the code that crashes would then be behind the mutex lock but not sure.
Easter holidays here so not able to confirm.
Were you able to reproduce?

@xsacha
Copy link
Contributor Author

xsacha commented Apr 19, 2019

I'm actually finding it really hard to not reproduce this issue so I was wondering if you knew a workaround to reserve the cudnn handles in advance or set up the pool in advance but without actually loading the model in advance?
I've tried wrapping everything in blocking calls, putting in tiny thread sleeps, etc but it still reproduces.

@HapeMask
Copy link
Contributor

HapeMask commented Jun 11, 2019

@mcarilli @ezyang a coworker shared this with me, it might be relevant Re: "TLS doesn't work the way we think it does on Windows": https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html

They do say that it only affects "those that also use LoadLibrary DLLs and create threads before loading the DLL," which I'm not sure is the case here though? edit Actually yes, at least in our case described below, because our code is dynamically loaded by Unity which has already created the thread before loading us.

FWIW we are seeing the same crash in getCudnnHandle() w/100% repro when running a loaded JIT module that makes cudnn convolution calls inside a Unity render thread on Windows.

@ezyang
Copy link
Contributor

ezyang commented Jun 11, 2019 via email

@HapeMask
Copy link
Contributor

Not yet. Based on the comment from MSFT in the linked thread though, it may not be something a library developer can fix given that you can’t control when a consumer decides to load your library (if they do it dynamically) relative to thread construction.

Of course one option is to not use thread_local, or conditionally not use it on windows, but I don’t know if that is something you can/want to do.

On our end we are going to try explicitly loading caffe2_gpu.dll before loading our plugin which links against libtorch. I’ll post an update if it does / doesn’t work but that won’t help fix the root cause unfortunately (if this is the issue which seems likely).

@xsacha
Copy link
Contributor Author

xsacha commented Jun 11, 2019

I have worked around the issue by ensuring I never call model forward from a different thread than I read the model from. That is, I have a model thread per-model.

This is required for batching anyway but doesn't make sense in situations where the tensors are different sizes and can't be batched (hence how I came across the issue).
Other deep learning frameworks do not exhibit this limitation.

All my torch code and the torch/cuda libraries are dynamically loaded together at first use (with LoadLibrary) to ensure low memory utilisation on processes that do not use it.

@ezyang
Copy link
Contributor

ezyang commented Jun 11, 2019

Well, we could always rewrite the code in #15080 to use some sort of lock free structure (or we could even just take out a big honking lock; the critical region isn't big). @mcarilli thoughts (also, would you be interested in doing this?)

@ezyang ezyang added the module: windows Windows support for PyTorch label Jun 11, 2019
zdevito pushed a commit to zdevito/ATen that referenced this issue Jul 2, 2019
Summary:
Fixes pytorch/pytorch#19394

See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context.
Pull Request resolved: pytorch/pytorch#22405

Differential Revision: D16090822

Pulled By: ezyang

fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b
xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this issue Jul 5, 2019
Summary:
Fixes pytorch#19394

See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context.
Pull Request resolved: pytorch#22405

Differential Revision: D16090822

Pulled By: ezyang

fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b
facebook-github-bot pushed a commit that referenced this issue Mar 25, 2020
Summary:
The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405.
Pull Request resolved: #34822

Test Plan:
* Tested locally by running CodecverseWorkbench Unity app with patched build.
* CI

Differential Revision: D20627971

Pulled By: HapeMask

fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
gchanan pushed a commit to gchanan/pytorch that referenced this issue Mar 25, 2020
Summary:
The Windows + MSVC-specific bug discussed here: pytorch#19394 and fixed here: pytorch#22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of pytorch#22405.
Pull Request resolved: pytorch#34822

Test Plan:
* Tested locally by running CodecverseWorkbench Unity app with patched build.
* CI

Differential Revision: D20627971

Pulled By: HapeMask

fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
smessmer pushed a commit that referenced this issue Mar 26, 2020
Summary:
The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405.
Pull Request resolved: #34822

Test Plan:
* Tested locally by running CodecverseWorkbench Unity app with patched build.
* CI

Differential Revision: D20627971

Pulled By: HapeMask

fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: cudnn Related to torch.backends.cudnn, and CuDNN support module: multithreading Related to issues that occur when running on multiple CPU threads module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants