-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Don't leak threads on exit #21438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't leak threads on exit #21438
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done. Looking forward to the tests!
for (size_t i = 0; i < static_cast<size_t>(c10::DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES); i++) { | ||
auto* impl = c10::impl::device_guard_impl_registry[i].load(); | ||
if (impl && device < impl->deviceCount()) { | ||
guards[i].reset_device(at::Device(static_cast<c10::DeviceType>(i), device)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was this code for? Why is it safe to remove it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for setting the Cuda device. I have changed it now to do it without using guards.
FunctionTask task = queue->pop(); | ||
if (task.isShutdownTask_) { | ||
C10_LOG_API_USAGE_ONCE("worker shutting down"); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using reentrant autograd, the thread_main
function is called again, with a specific graph_task. So the shutdown task could be caught by a thread_main
from the reentrant autograd and not the original one that keeps the thread alive.
Or do you assume that no backward us currently running when the engine is destroyed? If so, this should be checked in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect this to work only when no backward is running while the engine is destroyed. This might not always be the case though and it needs a separate fix. Could you clarify what needs to be checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess making sure all ready queues are empty will make sure that no backward is currently running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comment wasn't entirely correct. I think it's okay for a backward to be running if it's not from a reentrant autograd, so the queue may not be empty. Should we only process the shutdown task in those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the behavior if the Engine is destroyed while a backward is running? Wait for it to finish? Or interrupt it?
I guess that will change the priority of the task your add, minimal if you want to wait for it to finish and maximal otherwise.
In both cases, I would says that another shutdown task should be queued if !graph_task: inside reentrant autograd, (optionally stop the current processing) and notify the other thread_main that it should exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD My hope was to pretend this problem didn't exist :)
I think the correct thing to do is to stop executing work as much as soon as possible, throwing exceptions if necessary to speed things up. We just have to be careful to catch these exceptions properly and then swallow them (don't report errors to users here). If you queue another task on the queue, you'll wait for the task which made the reentrant backwards call to finish executing the rest of its stuff before shutting down; if we are actually going to fix this edge case, let's fix it properly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case what happens is that the "parent" thread_main
is stuck in its evaluate_function
.
The "child" thread_main
is executing stuff.
So in case of reentrant, the "child" will always get the shutdown task first.
That will interrupt it and force it to exit early.
Now the "parent" exit its evaluate_function
and will start running other stuff.
If another shutdown task was not queued, it will try to finish it's backward and the thread will never be destroyed.
Maybe preventing destruction when a backward is running is too much? Or could be done as future work if it is actually needed.
Checking that all queues are empty before queuing the shutdown task seems ok for me. And raise an error saying that the Engine cannot be destroyed if it's running stuff. That would be simpler.
@ezyang does that sound ok with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, SGTM. Actually I wouldn't raise an error, because I am not sure if we run destructors when you C-C a PyTorch process (I don't know exactly what Python's signal handler does...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure Python catches C-C, turns it into a KeyboardException
and then terminates the program "cleanly" (in the sense that the destructors will always run). I have no idea which thread will be responsible for the cleanup though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't surprise me if all of this handling happened on the main thread.
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't forget to dismiss reviews when you want to get a new review ;) |
I think preventing destruction if there is a running backward seems like
a simple and easy fix.
Edward
Excerpts from albanD's message of 2019-06-07 08:58:53 -0700:
… albanD commented on this pull request.
> @@ -271,6 +280,10 @@ auto Engine::thread_main(GraphTask *graph_task) -> void {
// Note [Reentrant backwards]
while (!graph_task || graph_task->outstanding_tasks_ > 0) {
FunctionTask task = queue->pop();
+ if (task.isShutdownTask_) {
+ C10_LOG_API_USAGE_ONCE("worker shutting down");
+ break;
In that case what happens is that the "parent" `thread_main` is stuck in its `evaluate_function`.
The "child" `thread_main` is executing stuff.
So in case of reentrant, the "child" will always get the shutdown task first.
That will interrupt it and force it to exit early.
Now the "parent" exit its `evaluate_function` and will start running other stuff.
If another shutdown task was not queued, it will try to finish it's backward and the thread will never be destroyed.
Maybe preventing destruction when a backward is running is too much? Or could be done as future work if it is actually needed.
Checking that all queues are empty before queuing the shutdown task seems ok for me. And raise an error saying that the Engine cannot be destroyed if it's running stuff. That would be simpler.
@ezyang does that sound ok with you?
|
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
test/test_autograd.py
Outdated
env=env) | ||
return pipes.communicate()[1].decode('ascii') | ||
|
||
@unittest.skipIf(IS_WINDOWS, "Skip for windows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to mention exactly why you are skipping on Windows (in this case, it doesn't work. Might be worth filing a bug for it and mentioning it here.)
test/test_autograd.py
Outdated
stdout=subprocess.PIPE, | ||
stderr=subprocess.PIPE, | ||
env=env) | ||
return pipes.communicate()[1].decode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this got copy-pasted from test_logging.py
, maybe it should get moved to the common test code? That way, we don't have to copy paste. It would also be a good opportunity to name the function appropriately to mention that it's all about PYTORCH_API_USAGE_STDERR
.
Engine::~Engine() { | ||
bool noBackward = true; | ||
for (auto& queue: ready_queues_) { | ||
std::lock_guard<std::mutex> lock(queue->mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking out a lock on a mutex in a destructor makes me nervous for a few reasons:
- Taking out a lock could throw an exception. Throwing exceptions in a destructor is bad juju, because you won't actually get an exception; you'll just forcibly terminate the program.
- It could lead to a deadlock at program shutdown (no one's idea of a happy time). I think in this case it's not possible, because the scope we acquire
mutex_
is not very large and we don't ever take out other locks.
It seems to me that all we need is a "best effort" test to see if we're running a backwards or not. Perhaps a single, global atomic counter on Engine
itself would be good enough? Increment it when you start a backwards, decrement it when you finish (probably want to do this with RAII to make sure exceptions are handled correctly). Then look at that counter to decide if a backwards is running or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we take the lock at every loop iteration instead of acquiring it once before the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit
? Wouldn't that lead to case 2.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock has to be taken out on each iteration of the loop, because we don't have a global lock, it's a per ready queue lock, and so there isn't "one" lock to take out.
Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit?
I think you're right. If a worker thread calls std::exit
, that thread will handle destructing static objects, and we'd have to detect that case in the destructor in that case.
This is all a bit moot right now, because this PR got disabled again when we landed the fix for infinite recursion. But if we do revive this patch it would be good to get this part right too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments are nits, but let's try not to take out a lock in the destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malvika2147 convinced me that we can't get rid of the lock, because we have to push message on queues and those require locking. So I rescind that comment; only nits left.
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head
This pull request has been merged in f308b07. |
Engine::~Engine() { | ||
bool noBackward = true; | ||
for (auto& queue: ready_queues_) { | ||
std::lock_guard<std::mutex> lock(queue->mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we take the lock at every loop iteration instead of acquiring it once before the loop?
queue->pushShutdownTask(); | ||
} | ||
} | ||
// Othewise threads are leaked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the issue with leaking threads? This is only a best-effort measure anyway, right?
Engine::~Engine() { | ||
bool noBackward = true; | ||
for (auto& queue: ready_queues_) { | ||
std::lock_guard<std::mutex> lock(queue->mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit
? Wouldn't that lead to case 2.?
FunctionTask task = queue->pop(); | ||
if (task.isShutdownTask_) { | ||
C10_LOG_API_USAGE_ONCE("worker shutting down"); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure Python catches C-C, turns it into a KeyboardException
and then terminates the program "cleanly" (in the sense that the destructors will always run). I have no idea which thread will be responsible for the cleanup though...
Stack from ghstack:
Summary: When the Engine destructs, it sends a shutdown task to all the threads with the highest priority.
Test Plan: Verified change by temporarily having each thread print message while shutting down. Running with REL_WITH_DEB_INFO=1 should print the shutdown message (only once).
(expect output
PYTORCH_API_USAGE worker shutting down
)Added a test for that.
Differential Revision: D15738856