[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

xsacha · 2019-04-18T05:06:26Z

🐛 Bug

When using a JIT model twice at the same time, for the first time, I get a crash here.

PoolWindow::reserve seems to try to access a vector out of range.

The responsible code was added in #14861

Backtrace:

 	msvcp140d.dll!std::_Debug_message(const wchar_t * message, const wchar_t * file, unsigned int line) Line 9	C++
>	caffe2_gpu.dll!std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > >,std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > > > >::operator[](const unsigned __int64 _Pos) Line 1796	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Vec_lo(unsigned __int64 _Bucket) Line 822	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Begin(unsigned __int64 _Bucket) Line 841	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::lower_bound(const int & _Keyval) Line 647	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::find(const int & _Keyval) Line 630	C++
 	caffe2_gpu.dll!at::native::`anonymous namespace'::PoolWindow::reserve(int device) Line 88	C++
 	caffe2_gpu.dll!at::native::getCudnnHandle() Line 149	C++
 	caffe2_gpu.dll!at::native::setCuDNNStreamToCurrent() Line 13	C++
 	caffe2_gpu.dll!at::native::cudnn_convolution(const at::Tensor & input_t, const at::Tensor & weight_t, const at::Tensor & bias_t, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 930	C++
 	caffe2_gpu.dll!at::CUDAFloatType::cudnn_convolution(const at::Tensor & self, const at::Tensor & weight, const at::Tensor & bias, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 5315	C++

To Reproduce

Steps to reproduce the behavior:

Load a Jit model (once, so on one thread)
After the model is loaded, forward it twice simultaneously (separate threads)
Experience crash in header

Called from two threads at same time:

    static std::once_flag model_flag;
    std::call_once(model_flag, [&modelFile]() {
        model = torch::jit::load(modelFile, torch::kCUDA); }
    );
    // All works fine up until here
    model->forward({torch::randn({1, 3, 1024, 1024}, torch::kCUDA)});
    // Crashes when called from a different thread than model was created on.

Expected behavior

Works when loaded from one thread without having to wait for some random time period to prevent race condition.

Environment

PyTorch Version (e.g., 1.0): 1.1.0-pre
OS (e.g., Linux): Windows 64-bit
How you installed PyTorch (conda, pip, source): nightly
Build command you used (if compiling from source): N/A
Python version: 3.7
CUDA/cuDNN version: 10.0/7.5
GPU models and configuration: GTX1060

Workarounds

Other models that I have in a special model thread that batches the inputs will work fine as it ensures they don't get run simultaneously.
This specific model has tensors of different sizes which I am unable to batch together.

The text was updated successfully, but these errors were encountered:

ezyang · 2019-04-18T16:32:18Z

This sounds like getCudnnHandle is not thread safe.

ezyang · 2019-04-18T16:34:05Z

Yep. It's not thread safe:

  cudnnHandle_t reserve(int device)
  {
    // If this thread already has a handle for this device, return it
    if(my_handles.find(device) != my_handles.end())
      return my_handles[device];

    // otherwise, either grab a handle from the pool if one is available,
    // or if not, create a new one.
    std::lock_guard<std::mutex> guard(mutex);

We're not allowed to poke my_handles_ without the mutex unless we can guarantee that there isn't a mutation going on. Which we are not.

cc @mcarilli

I wonder if this explains the Windows race that plagued #14861. Maybe not.

mcarilli · 2019-04-18T17:19:45Z

Yep. It's not thread safe.

my_handles is a thread local object though (it's a member of thread_local PoolWindow myPoolWindow;) so I'm confused why this would be a problem.

ezyang · 2019-04-18T17:39:20Z

Oops, I guess I spoke too soon :) Is it possible that thread_local doesn't work the way we think it does on Windows?

ezyang · 2019-04-18T17:42:38Z

@xsacha This is not a true fix, but if you move the mutex to guard the my_handles read (even though, as @mcarilli says, it should not be necessary), does that eliminate the crash?

xsacha · 2019-04-18T21:58:37Z

I figure that would fix the issue because the code that crashes would then be behind the mutex lock but not sure.
Easter holidays here so not able to confirm.
Were you able to reproduce?

xsacha · 2019-04-19T00:18:24Z

I'm actually finding it really hard to not reproduce this issue so I was wondering if you knew a workaround to reserve the cudnn handles in advance or set up the pool in advance but without actually loading the model in advance?
I've tried wrapping everything in blocking calls, putting in tiny thread sleeps, etc but it still reproduces.

HapeMask · 2019-06-11T14:49:54Z

@mcarilli @ezyang a coworker shared this with me, it might be relevant Re: "TLS doesn't work the way we think it does on Windows": https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html

They do say that it only affects "those that also use LoadLibrary DLLs and create threads before loading the DLL," which I'm not sure is the case here though? edit Actually yes, at least in our case described below, because our code is dynamically loaded by Unity which has already created the thread before loading us.

FWIW we are seeing the same crash in getCudnnHandle() w/100% repro when running a loaded JIT module that makes cudnn convolution calls inside a Unity render thread on Windows.

ezyang · 2019-06-11T16:00:03Z

Yes, this does look very relevant. Do you know a way to work around the issue? Excerpts from Gabe Schwartz's message of 2019-06-11 07:51:01 -0700:

…

@mcarilli @ezyang a coworker shared this with me, it might be relevant Re: "TLS doesn't work the way we think it does on Windows": https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html They do say that it only affects "those that also use LoadLibrary DLLs and create threads before loading the DLL," which I'm not sure is the case here though? FWIW we are seeing the same crash in `getCudnnHandle()` w/100% repro when running a loaded JIT module that makes cudnn convolution calls inside a Unity render thread on Windows.

HapeMask · 2019-06-11T16:12:53Z

Not yet. Based on the comment from MSFT in the linked thread though, it may not be something a library developer can fix given that you can’t control when a consumer decides to load your library (if they do it dynamically) relative to thread construction.

Of course one option is to not use thread_local, or conditionally not use it on windows, but I don’t know if that is something you can/want to do.

On our end we are going to try explicitly loading caffe2_gpu.dll before loading our plugin which links against libtorch. I’ll post an update if it does / doesn’t work but that won’t help fix the root cause unfortunately (if this is the issue which seems likely).

xsacha · 2019-06-11T16:50:43Z

I have worked around the issue by ensuring I never call model forward from a different thread than I read the model from. That is, I have a model thread per-model.

This is required for batching anyway but doesn't make sense in situations where the tensors are different sizes and can't be batched (hence how I came across the issue).
Other deep learning frameworks do not exhibit this limitation.

All my torch code and the torch/cuda libraries are dynamically loaded together at first use (with LoadLibrary) to ensure low memory utilisation on processes that do not use it.

ezyang · 2019-06-11T19:16:12Z

Well, we could always rewrite the code in #15080 to use some sort of lock free structure (or we could even just take out a big honking lock; the critical region isn't big). @mcarilli thoughts (also, would you be interested in doing this?)

Summary: Fixes pytorch/pytorch#19394 See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context. Pull Request resolved: pytorch/pytorch#22405 Differential Revision: D16090822 Pulled By: ezyang fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b

Summary: Fixes pytorch#19394 See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context. Pull Request resolved: pytorch#22405 Differential Revision: D16090822 Pulled By: ezyang fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b

Summary: The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405. Pull Request resolved: #34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883

Summary: The Windows + MSVC-specific bug discussed here: pytorch#19394 and fixed here: pytorch#22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of pytorch#22405. Pull Request resolved: pytorch#34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883

Summary: The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405. Pull Request resolved: #34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883

ezyang mentioned this issue Apr 18, 2019

Use a pool of per-thread cudnn handles for each device #14861

Closed

ezyang added the module: windows Windows support for PyTorch label Jun 11, 2019

dkotfis-oculus mentioned this issue Jul 1, 2019

Fix PoolWindow crash from thread_local #22405

Closed

facebook-github-bot closed this as completed in 2dd71b1 Jul 2, 2019

HapeMask mentioned this issue Mar 16, 2020

Fix thread_local initializtion in C10 WarningHandler. #34822

Closed

gchanan mentioned this issue Mar 25, 2020

[v1.5.0] Fix thread_local initializtion in C10 WarningHandler. (#34822) #35436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

xsacha commented Apr 18, 2019 •

edited

ezyang commented Apr 18, 2019

ezyang commented Apr 18, 2019

mcarilli commented Apr 18, 2019 •

edited

ezyang commented Apr 18, 2019

ezyang commented Apr 18, 2019

xsacha commented Apr 18, 2019 •

edited

xsacha commented Apr 19, 2019 •

edited

HapeMask commented Jun 11, 2019 •

edited

ezyang commented Jun 11, 2019 via email

HapeMask commented Jun 11, 2019

xsacha commented Jun 11, 2019 •

edited

ezyang commented Jun 11, 2019

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

Comments

xsacha commented Apr 18, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

Workarounds

ezyang commented Apr 18, 2019

ezyang commented Apr 18, 2019

mcarilli commented Apr 18, 2019 • edited

ezyang commented Apr 18, 2019

ezyang commented Apr 18, 2019

xsacha commented Apr 18, 2019 • edited

xsacha commented Apr 19, 2019 • edited

HapeMask commented Jun 11, 2019 • edited

ezyang commented Jun 11, 2019 via email

HapeMask commented Jun 11, 2019

xsacha commented Jun 11, 2019 • edited

ezyang commented Jun 11, 2019

xsacha commented Apr 18, 2019 •

edited

mcarilli commented Apr 18, 2019 •

edited

xsacha commented Apr 18, 2019 •

edited

xsacha commented Apr 19, 2019 •

edited

HapeMask commented Jun 11, 2019 •

edited

xsacha commented Jun 11, 2019 •

edited