`Engine::~Engine()` should wait for non-reentrant threads to shutdown #34529

malfet · 2020-03-10T06:21:14Z

Because this must be valid while Engine::main_thread is running, at least for non-reentrant worker threads

Test Plan: Run test_api --gtest-filter=ModulesTest.InstanceNorm1d in a loop

dr-ci · 2020-03-10T06:46:18Z

💊 CircleCI build failures summary and remediations

As of commit 2d89490 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Test" (full log | pattern match details)

Mar 20 05:02:00 caused by: Connection refused (os error 111)

Mar 20 05:02:00 +++ eval 'extract_trap_cmd ' 
Mar 20 05:02:00 ++++ extract_trap_cmd 
Mar 20 05:02:00 ++++ printf '%s\n' '' 
Mar 20 05:02:00 +++ printf '%s\n' cleanup 
Mar 20 05:02:00 ++ trap -- ' 
Mar 20 05:02:00 cleanup' EXIT 
Mar 20 05:02:00 ++ which sccache 
Mar 20 05:02:00 ++ sccache --stop-server 
Mar 20 05:02:00 Stopping sccache server... 
Mar 20 05:02:00 error: couldn't connect to server 
Mar 20 05:02:00 caused by: Connection refused (os error 111) 
Mar 20 05:02:00 ++ true 
Mar 20 05:02:00 ++ rm /var/lib/jenkins/sccache_error.log 
Mar 20 05:02:00 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Mar 20 05:02:00 ++ SCCACHE_IDLE_TIMEOUT=1200 
Mar 20 05:02:00 ++ RUST_LOG=sccache::server=error 
Mar 20 05:02:00 ++ sccache --start-server 
Mar 20 05:02:00 Starting sccache server... 
Mar 20 05:02:00 ++ sccache --zero-stats 
Mar 20 05:02:00 Compile requests                 0 
Mar 20 05:02:00 Compile requests executed        0

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 61 times.

albanD

Thanks for the PR.
So the problem is just that we finish destroying the engine before the shutdown tasks are processed?
Not that there is still a backward running and the leaked threads won't have access to the engine?

EscapeZero · 2020-03-10T17:31:59Z

@albanD this is being passed into a child thread that is being detached at which point you have no way of ensuring the thread is finished with this before it is destructed.

The OS could theoretically never schedule the detached threads before Engine is destructed in its current state.

albanD

Sounds good.
Thanks !

malfet · 2020-03-10T19:58:54Z

So the problem is just that we finish destroying the engine before the shutdown tasks are processed?
Yes, there is no guarantee about the order and I can reproduce the failure pretty reliably by running multiple instances of the test concurrently.
Not that there is still a backward running and the leaked threads won't have access to the engine?
Is there a reason why I shouldn't wait for those as well?

@albanD , any objection if I explicitly delete Engine class copy constructor?

albanD

@malfet Given that we should only ever have a single instance of the Engine. It sounds ok.

Given the CI errors, I guess it deadlocks on deletion?
This should be fixed.

malfet · 2020-03-11T05:24:03Z

@albanD modified Engine destructor to simply wait for non-reentrant thread to decrease atomic counter, which still achieves the same goal but solves the deadlock in PythonEngine, when worker thread can not be joined while native code is running.

torch/csrc/autograd/engine.cpp

colesbury

Thanks for catching and fixing this. A few concerns:

How does this interact with fork (i.e. multiprocessing)? We have code that explicitly calls the engine destructor after fork. At that point all the threads are dead, so they cannot decrement the counter:

pytorch/torch/csrc/autograd/python_engine.cpp

Lines 96 to 108 in 6f8a8e4

    
           static void _maybe_reinitialize_engine_after_fork() { 
        
             // This is "probably" thread-safe because the flag is set in a fork handler 
        
             // before any threads are created, and this function is only called with the 
        
             // GIL held. However, using fork + threads is playing with fire so this is 
        
             // more of a "best effort" thing. For example, if the fork occurs while the 
        
             // backwards threads hold a lock, we'll probably deadlock in the engine 
        
             // destructor. 
        
             if (_reinitialize_engine) { 
        
               engine.~PythonEngine(); 
        
               new (&engine) torch::autograd::python::PythonEngine(); 
        
               _reinitialize_engine = false; 
        
             } 
        
           }

I would prefer we avoid active spinning. We can either remember the threads and call thread.join() or use a condition_variable + mutex with non_reentrant_thread_count_.
What happens in the case where a backwards is running (i.e. noBackward is false)? @albanD says we're not handling that case in this PR. That's fine -- at least this is addressing the common case, but that case seems particularly susceptible to this bug.

EscapeZero · 2020-03-11T18:40:02Z

Another random concern. How did our internal version of the unit tests in ovrsource catch this, but none of the fbcode or CircleCI tests catch this?

malfet · 2020-03-12T15:06:42Z

@colesbury thank you for the detailed, review. Comments are below.

How does this interact with fork (i.e. multiprocessing)? We have code that explicitly calls the engine destructor after fork. At that point all the threads are dead, so they cannot decrement the counter:
Updated the commit by adding release_workers() method, that is called at from _maybe_reinitialize_engine_after_fork

I would prefer we avoid active spinning. We can either remember the threads and call thread.join() or use a condition_variable + mutex with non_reentrant_thread_count_.
Initial version of the change used std::thread::join(), but that caused a deadlock with python runtime.

What happens in the case where a backwards is running (i.e. noBackward is false)? @albanD says we're not handling that case in this PR. That's fine -- at least this is addressing the common case, but that case seems particularly susceptible to this bug.
Plan to address that as well, but perhaps in a separate commit, since it requires a much more rigorous testing.

albanD

I think there is a confusion in the latest version of the PR: the thread pool in thread_pool_shared_ does not contain the worker threads.
In particular, this pool should remain empty as long as we don't need it.

Also since this datastructure is unrelated to the actual worker threads that we try to shutdown, this might not be the best place to put the counter.

malfet · 2020-03-12T18:27:35Z

Ok, moved the non_reentrant_thread_count, condvar and mutex to private methods of Engine class

albanD

Looks good!

albanD · 2020-03-12T18:35:17Z

torch/csrc/autograd/engine.h

@@ -266,6 +269,12 @@ struct TORCH_API Engine {
 std::shared_ptr<ThreadPoolShared> thread_pool_shared_;

 private:
+  // Number of non-reentrant threads
+  std::atomic<uint32_t> non_reentrant_thread_count_;


nit: this does not need to be atomatic as all the changes are either in std::cal_once or protected by the lock below.

Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop

malfet · 2020-03-20T03:49:10Z

Finally figured out what is causing a deadlock on Windows: CRT destroys all DLL created threads before calling global destruction. I.e. wait logic should be skipped if PyTorch was compiled as shared library on Windows.

facebook-github-bot

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-03-20T14:32:35Z

@malfet merged this pull request in 1c958f8.

malfet requested review from ezyang and EscapeZero March 10, 2020 06:21

malfet requested review from albanD and apaszke as code owners March 10, 2020 06:21

EscapeZero approved these changes Mar 10, 2020

View reviewed changes

albanD reviewed Mar 10, 2020

View reviewed changes

albanD approved these changes Mar 10, 2020

View reviewed changes

albanD requested changes Mar 10, 2020

View reviewed changes

malfet force-pushed the fix-autograd-destructor-race branch from 413fb82 to cf9ec9a Compare March 11, 2020 05:17

malfet requested a review from albanD March 11, 2020 05:20

malfet force-pushed the fix-autograd-destructor-race branch from cf9ec9a to 7df08fa Compare March 11, 2020 05:25

albanD reviewed Mar 11, 2020

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

colesbury reviewed Mar 11, 2020

View reviewed changes

malfet force-pushed the fix-autograd-destructor-race branch from 7df08fa to a396352 Compare March 12, 2020 15:01

malfet requested a review from albanD March 12, 2020 15:06

malfet force-pushed the fix-autograd-destructor-race branch from a396352 to 231953e Compare March 12, 2020 15:13

albanD reviewed Mar 12, 2020

View reviewed changes

malfet force-pushed the fix-autograd-destructor-race branch from 231953e to ec7be86 Compare March 12, 2020 18:26

malfet requested a review from albanD March 12, 2020 18:27

albanD approved these changes Mar 12, 2020

View reviewed changes

malfet force-pushed the fix-autograd-destructor-race branch 3 times, most recently from d4772a7 to 78cb2f5 Compare March 18, 2020 22:36

malfet force-pushed the fix-autograd-destructor-race branch 3 times, most recently from 53dde6f to 99506c1 Compare March 19, 2020 03:25

Engine::~Engine() should wait for non-reentrant threads to shutdown

2d89490

Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop

malfet force-pushed the fix-autograd-destructor-race branch from 99506c1 to 2d89490 Compare March 20, 2020 03:43

facebook-github-bot reviewed Mar 20, 2020

View reviewed changes

facebook-github-bot closed this in 1c958f8 Mar 20, 2020

peterjc123 mentioned this pull request Mar 20, 2020

skip ctc_loss test on Windows #35069

Closed

facebook-github-bot added the merged label Mar 20, 2020

malfet deleted the fix-autograd-destructor-race branch March 20, 2020 14:50

mruberry added the Merged label Oct 28, 2020

malfet mentioned this pull request Nov 17, 2020

Core dump when checking that basic CNN works (Python 3.9) #47776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Engine::~Engine()` should wait for non-reentrant threads to shutdown #34529

`Engine::~Engine()` should wait for non-reentrant threads to shutdown #34529

malfet commented Mar 10, 2020

dr-ci bot commented Mar 10, 2020 •

edited

albanD left a comment

EscapeZero commented Mar 10, 2020

albanD left a comment

malfet commented Mar 10, 2020

albanD left a comment

malfet commented Mar 11, 2020

colesbury left a comment

EscapeZero commented Mar 11, 2020

malfet commented Mar 12, 2020

albanD left a comment •

edited

malfet commented Mar 12, 2020

albanD left a comment

albanD Mar 12, 2020

malfet commented Mar 20, 2020

facebook-github-bot left a comment

facebook-github-bot commented Mar 20, 2020

	static void _maybe_reinitialize_engine_after_fork() {
	// This is "probably" thread-safe because the flag is set in a fork handler
	// before any threads are created, and this function is only called with the
	// GIL held. However, using fork + threads is playing with fire so this is
	// more of a "best effort" thing. For example, if the fork occurs while the
	// backwards threads hold a lock, we'll probably deadlock in the engine
	// destructor.
	if (_reinitialize_engine) {
	engine.~PythonEngine();
	new (&engine) torch::autograd::python::PythonEngine();
	_reinitialize_engine = false;
	}
	}

Engine::~Engine() should wait for non-reentrant threads to shutdown #34529

Engine::~Engine() should wait for non-reentrant threads to shutdown #34529

Conversation

malfet commented Mar 10, 2020

dr-ci bot commented Mar 10, 2020 • edited

💊 CircleCI build failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

albanD left a comment

Choose a reason for hiding this comment

EscapeZero commented Mar 10, 2020

albanD left a comment

Choose a reason for hiding this comment

malfet commented Mar 10, 2020

albanD left a comment

Choose a reason for hiding this comment

malfet commented Mar 11, 2020

colesbury left a comment

Choose a reason for hiding this comment

EscapeZero commented Mar 11, 2020

malfet commented Mar 12, 2020

albanD left a comment • edited

Choose a reason for hiding this comment

malfet commented Mar 12, 2020

albanD left a comment

Choose a reason for hiding this comment

albanD Mar 12, 2020

Choose a reason for hiding this comment

malfet commented Mar 20, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 20, 2020

`Engine::~Engine()` should wait for non-reentrant threads to shutdown #34529

`Engine::~Engine()` should wait for non-reentrant threads to shutdown #34529

dr-ci bot commented Mar 10, 2020 •

edited

albanD left a comment •

edited