New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine::~Engine()
should wait for non-reentrant threads to shutdown
#34529
Conversation
💊 CircleCI build failures summary and remediationsAs of commit 2d89490 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)Step: "Test" (full log | pattern match details)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
So the problem is just that we finish destroying the engine before the shutdown tasks are processed?
Not that there is still a backward running and the leaked threads won't have access to the engine?
@albanD The OS could theoretically never schedule the detached threads before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Thanks !
@albanD , any objection if I explicitly delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet Given that we should only ever have a single instance of the Engine. It sounds ok.
Given the CI errors, I guess it deadlocks on deletion?
This should be fixed.
413fb82
to
cf9ec9a
Compare
@albanD modified |
cf9ec9a
to
7df08fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and fixing this. A few concerns:
- How does this interact with fork (i.e. multiprocessing)? We have code that explicitly calls the engine destructor after fork. At that point all the threads are dead, so they cannot decrement the counter:
pytorch/torch/csrc/autograd/python_engine.cpp
Lines 96 to 108 in 6f8a8e4
static void _maybe_reinitialize_engine_after_fork() { | |
// This is "probably" thread-safe because the flag is set in a fork handler | |
// before any threads are created, and this function is only called with the | |
// GIL held. However, using fork + threads is playing with fire so this is | |
// more of a "best effort" thing. For example, if the fork occurs while the | |
// backwards threads hold a lock, we'll probably deadlock in the engine | |
// destructor. | |
if (_reinitialize_engine) { | |
engine.~PythonEngine(); | |
new (&engine) torch::autograd::python::PythonEngine(); | |
_reinitialize_engine = false; | |
} | |
} |
-
I would prefer we avoid active spinning. We can either remember the threads and call
thread.join()
or use acondition_variable
+mutex
withnon_reentrant_thread_count_
. -
What happens in the case where a backwards is running (i.e.
noBackward
is false)? @albanD says we're not handling that case in this PR. That's fine -- at least this is addressing the common case, but that case seems particularly susceptible to this bug.
Another random concern. How did our internal version of the unit tests in ovrsource catch this, but none of the fbcode or CircleCI tests catch this? |
7df08fa
to
a396352
Compare
@colesbury thank you for the detailed, review. Comments are below.
|
a396352
to
231953e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a confusion in the latest version of the PR: the thread pool in thread_pool_shared_
does not contain the worker threads.
In particular, this pool should remain empty as long as we don't need it.
Also since this datastructure is unrelated to the actual worker threads that we try to shutdown, this might not be the best place to put the counter.
231953e
to
ec7be86
Compare
Ok, moved the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -266,6 +269,12 @@ struct TORCH_API Engine { | |||
std::shared_ptr<ThreadPoolShared> thread_pool_shared_; | |||
|
|||
private: | |||
// Number of non-reentrant threads | |||
std::atomic<uint32_t> non_reentrant_thread_count_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this does not need to be atomatic as all the changes are either in std::cal_once or protected by the lock below.
d4772a7
to
78cb2f5
Compare
53dde6f
to
99506c1
Compare
Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop
99506c1
to
2d89490
Compare
Finally figured out what is causing a deadlock on Windows: CRT destroys all DLL created threads before calling global destruction. I.e. wait logic should be skipped if PyTorch was compiled as shared library on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Because
this
must be valid whileEngine::main_thread
is running, at least for non-reentrant worker threadsTest Plan: Run
test_api --gtest-filter=ModulesTest.InstanceNorm1d
in a loop