New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394
Comments
This sounds like |
Yep. It's not thread safe:
We're not allowed to poke cc @mcarilli I wonder if this explains the Windows race that plagued #14861. Maybe not. |
|
Oops, I guess I spoke too soon :) Is it possible that |
I figure that would fix the issue because the code that crashes would then be behind the mutex lock but not sure. |
I'm actually finding it really hard to not reproduce this issue so I was wondering if you knew a workaround to reserve the cudnn handles in advance or set up the pool in advance but without actually loading the model in advance? |
@mcarilli @ezyang a coworker shared this with me, it might be relevant Re: "TLS doesn't work the way we think it does on Windows": https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html They do say that it only affects "those that also use LoadLibrary DLLs and create threads before loading the DLL," which I'm not sure is the case here though? edit Actually yes, at least in our case described below, because our code is dynamically loaded by Unity which has already created the thread before loading us. FWIW we are seeing the same crash in |
Yes, this does look very relevant. Do you know a way to work around the
issue?
Excerpts from Gabe Schwartz's message of 2019-06-11 07:51:01 -0700:
… @mcarilli @ezyang a coworker shared this with me, it might be relevant Re: "TLS doesn't work the way we think it does on Windows": https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html
They do say that it only affects "those that also use LoadLibrary DLLs and create threads before loading the DLL," which I'm not sure is the case here though?
FWIW we are seeing the same crash in `getCudnnHandle()` w/100% repro when running a loaded JIT module that makes cudnn convolution calls inside a Unity render thread on Windows.
|
Not yet. Based on the comment from MSFT in the linked thread though, it may not be something a library developer can fix given that you can’t control when a consumer decides to load your library (if they do it dynamically) relative to thread construction. Of course one option is to not use On our end we are going to try explicitly loading caffe2_gpu.dll before loading our plugin which links against libtorch. I’ll post an update if it does / doesn’t work but that won’t help fix the root cause unfortunately (if this is the issue which seems likely). |
I have worked around the issue by ensuring I never call model forward from a different thread than I read the model from. That is, I have a model thread per-model. This is required for batching anyway but doesn't make sense in situations where the tensors are different sizes and can't be batched (hence how I came across the issue). All my torch code and the torch/cuda libraries are dynamically loaded together at first use (with LoadLibrary) to ensure low memory utilisation on processes that do not use it. |
Summary: Fixes pytorch/pytorch#19394 See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context. Pull Request resolved: pytorch/pytorch#22405 Differential Revision: D16090822 Pulled By: ezyang fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b
Summary: Fixes pytorch#19394 See https://developercommunity.visualstudio.com/content/problem/124121/thread-local-variables-fail-to-be-initialized-when.html for context. Pull Request resolved: pytorch#22405 Differential Revision: D16090822 Pulled By: ezyang fbshipit-source-id: 9fdd2c272fa7723fb62b906336d2e2620411b12b
Summary: The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405. Pull Request resolved: #34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
Summary: The Windows + MSVC-specific bug discussed here: pytorch#19394 and fixed here: pytorch#22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of pytorch#22405. Pull Request resolved: pytorch#34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
Summary: The Windows + MSVC-specific bug discussed here: #19394 and fixed here: #22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of #22405. Pull Request resolved: #34822 Test Plan: * Tested locally by running CodecverseWorkbench Unity app with patched build. * CI Differential Revision: D20627971 Pulled By: HapeMask fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
🐛 Bug
When using a JIT model twice at the same time, for the first time, I get a crash here.
PoolWindow::reserve seems to try to access a vector out of range.
The responsible code was added in #14861
Backtrace:
To Reproduce
Steps to reproduce the behavior:
Called from two threads at same time:
Expected behavior
Works when loaded from one thread without having to wait for some random time period to prevent race condition.
Environment
conda
,pip
, source): nightlyWorkarounds
Other models that I have in a special model thread that batches the inputs will work fine as it ensures they don't get run simultaneously.
This specific model has tensors of different sizes which I am unable to batch together.
The text was updated successfully, but these errors were encountered: