Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine::~Engine() should wait for non-reentrant threads to shutdown #34529

Closed
wants to merge 1 commit into from

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Mar 10, 2020

Because this must be valid while Engine::main_thread is running, at least for non-reentrant worker threads

Test Plan: Run test_api --gtest-filter=ModulesTest.InstanceNorm1d in a loop

@dr-ci
Copy link

dr-ci bot commented Mar 10, 2020

💊 CircleCI build failures summary and remediations

As of commit 2d89490 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Test" (full log | pattern match details)

Mar 20 05:02:00 caused by: Connection refused (os error 111)
Mar 20 05:02:00 +++ eval 'extract_trap_cmd ' 
Mar 20 05:02:00 ++++ extract_trap_cmd 
Mar 20 05:02:00 ++++ printf '%s\n' '' 
Mar 20 05:02:00 +++ printf '%s\n' cleanup 
Mar 20 05:02:00 ++ trap -- ' 
Mar 20 05:02:00 cleanup' EXIT 
Mar 20 05:02:00 ++ which sccache 
Mar 20 05:02:00 ++ sccache --stop-server 
Mar 20 05:02:00 Stopping sccache server... 
Mar 20 05:02:00 error: couldn't connect to server 
Mar 20 05:02:00 caused by: Connection refused (os error 111) 
Mar 20 05:02:00 ++ true 
Mar 20 05:02:00 ++ rm /var/lib/jenkins/sccache_error.log 
Mar 20 05:02:00 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Mar 20 05:02:00 ++ SCCACHE_IDLE_TIMEOUT=1200 
Mar 20 05:02:00 ++ RUST_LOG=sccache::server=error 
Mar 20 05:02:00 ++ sccache --start-server 
Mar 20 05:02:00 Starting sccache server... 
Mar 20 05:02:00 ++ sccache --zero-stats 
Mar 20 05:02:00 Compile requests                 0 
Mar 20 05:02:00 Compile requests executed        0 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 61 times.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
So the problem is just that we finish destroying the engine before the shutdown tasks are processed?
Not that there is still a backward running and the leaked threads won't have access to the engine?

@EscapeZero
Copy link
Contributor

@albanD this is being passed into a child thread that is being detached at which point you have no way of ensuring the thread is finished with this before it is destructed.

The OS could theoretically never schedule the detached threads before Engine is destructed in its current state.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
Thanks !

@malfet
Copy link
Contributor Author

malfet commented Mar 10, 2020

So the problem is just that we finish destroying the engine before the shutdown tasks are processed?
Yes, there is no guarantee about the order and I can reproduce the failure pretty reliably by running multiple instances of the test concurrently.
Not that there is still a backward running and the leaked threads won't have access to the engine?
Is there a reason why I shouldn't wait for those as well?

@albanD , any objection if I explicitly delete Engine class copy constructor?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet Given that we should only ever have a single instance of the Engine. It sounds ok.

Given the CI errors, I guess it deadlocks on deletion?
This should be fixed.

@malfet malfet force-pushed the fix-autograd-destructor-race branch from 413fb82 to cf9ec9a Compare March 11, 2020 05:17
@malfet malfet requested a review from albanD March 11, 2020 05:20
@malfet
Copy link
Contributor Author

malfet commented Mar 11, 2020

@albanD modified Engine destructor to simply wait for non-reentrant thread to decrease atomic counter, which still achieves the same goal but solves the deadlock in PythonEngine, when worker thread can not be joined while native code is running.

@malfet malfet force-pushed the fix-autograd-destructor-race branch from cf9ec9a to 7df08fa Compare March 11, 2020 05:25
Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing this. A few concerns:

  1. How does this interact with fork (i.e. multiprocessing)? We have code that explicitly calls the engine destructor after fork. At that point all the threads are dead, so they cannot decrement the counter:

static void _maybe_reinitialize_engine_after_fork() {
// This is "probably" thread-safe because the flag is set in a fork handler
// before any threads are created, and this function is only called with the
// GIL held. However, using fork + threads is playing with fire so this is
// more of a "best effort" thing. For example, if the fork occurs while the
// backwards threads hold a lock, we'll probably deadlock in the engine
// destructor.
if (_reinitialize_engine) {
engine.~PythonEngine();
new (&engine) torch::autograd::python::PythonEngine();
_reinitialize_engine = false;
}
}

  1. I would prefer we avoid active spinning. We can either remember the threads and call thread.join() or use a condition_variable + mutex with non_reentrant_thread_count_.

  2. What happens in the case where a backwards is running (i.e. noBackward is false)? @albanD says we're not handling that case in this PR. That's fine -- at least this is addressing the common case, but that case seems particularly susceptible to this bug.

@EscapeZero
Copy link
Contributor

Another random concern. How did our internal version of the unit tests in ovrsource catch this, but none of the fbcode or CircleCI tests catch this?

@malfet malfet force-pushed the fix-autograd-destructor-race branch from 7df08fa to a396352 Compare March 12, 2020 15:01
@malfet
Copy link
Contributor Author

malfet commented Mar 12, 2020

@colesbury thank you for the detailed, review. Comments are below.

  1. How does this interact with fork (i.e. multiprocessing)? We have code that explicitly calls the engine destructor after fork. At that point all the threads are dead, so they cannot decrement the counter:
    Updated the commit by adding release_workers() method, that is called at from _maybe_reinitialize_engine_after_fork
  2. I would prefer we avoid active spinning. We can either remember the threads and call thread.join() or use a condition_variable + mutex with non_reentrant_thread_count_.
    Initial version of the change used std::thread::join(), but that caused a deadlock with python runtime.
  3. What happens in the case where a backwards is running (i.e. noBackward is false)? @albanD says we're not handling that case in this PR. That's fine -- at least this is addressing the common case, but that case seems particularly susceptible to this bug.
    Plan to address that as well, but perhaps in a separate commit, since it requires a much more rigorous testing.

@malfet malfet requested a review from albanD March 12, 2020 15:06
@malfet malfet force-pushed the fix-autograd-destructor-race branch from a396352 to 231953e Compare March 12, 2020 15:13
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a confusion in the latest version of the PR: the thread pool in thread_pool_shared_ does not contain the worker threads.
In particular, this pool should remain empty as long as we don't need it.

Also since this datastructure is unrelated to the actual worker threads that we try to shutdown, this might not be the best place to put the counter.

@malfet malfet force-pushed the fix-autograd-destructor-race branch from 231953e to ec7be86 Compare March 12, 2020 18:26
@malfet
Copy link
Contributor Author

malfet commented Mar 12, 2020

Ok, moved the non_reentrant_thread_count, condvar and mutex to private methods of Engine class

@malfet malfet requested a review from albanD March 12, 2020 18:27
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -266,6 +269,12 @@ struct TORCH_API Engine {
std::shared_ptr<ThreadPoolShared> thread_pool_shared_;

private:
// Number of non-reentrant threads
std::atomic<uint32_t> non_reentrant_thread_count_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this does not need to be atomatic as all the changes are either in std::cal_once or protected by the lock below.

@malfet malfet force-pushed the fix-autograd-destructor-race branch 3 times, most recently from d4772a7 to 78cb2f5 Compare March 18, 2020 22:36
@malfet malfet force-pushed the fix-autograd-destructor-race branch 3 times, most recently from 53dde6f to 99506c1 Compare March 19, 2020 03:25
Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads

Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop
@malfet malfet force-pushed the fix-autograd-destructor-race branch from 99506c1 to 2d89490 Compare March 20, 2020 03:43
@malfet
Copy link
Contributor Author

malfet commented Mar 20, 2020

Finally figured out what is causing a deadlock on Windows: CRT destroys all DLL created threads before calling global destruction. I.e. wait logic should be skipped if PyTorch was compiled as shared library on Windows.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in 1c958f8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants