Don't leak threads on exit #21438

malvika2147 · 2019-06-05T22:49:40Z

Stack from ghstack:

Don't leak threads on exit #21438 Don't leak threads on exit

Summary: When the Engine destructs, it sends a shutdown task to all the threads with the highest priority.

Test Plan: Verified change by temporarily having each thread print message while shutting down. Running with REL_WITH_DEB_INFO=1 should print the shutdown message (only once).
(expect output PYTORCH_API_USAGE worker shutting down)
Added a test for that.

Differential Revision: D15738856

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

torch/csrc/autograd/engine.cpp

ezyang

Nicely done. Looking forward to the tests!

albanD · 2019-06-06T15:23:06Z

torch/csrc/autograd/engine.cpp

-    for (size_t i = 0; i < static_cast<size_t>(c10::DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES); i++) {
-      auto* impl = c10::impl::device_guard_impl_registry[i].load();
-      if (impl && device < impl->deviceCount()) {
-        guards[i].reset_device(at::Device(static_cast<c10::DeviceType>(i), device));


What was this code for? Why is it safe to remove it now?

This was for setting the Cuda device. I have changed it now to do it without using guards.

albanD · 2019-06-06T15:27:20Z

torch/csrc/autograd/engine.cpp

    FunctionTask task = queue->pop();
+    if (task.isShutdownTask_) {
+      C10_LOG_API_USAGE_ONCE("worker shutting down");
+      break;


When using reentrant autograd, the thread_main function is called again, with a specific graph_task. So the shutdown task could be caught by a thread_main from the reentrant autograd and not the original one that keeps the thread alive.
Or do you assume that no backward us currently running when the engine is destroyed? If so, this should be checked in some way.

We expect this to work only when no backward is running while the engine is destroyed. This might not always be the case though and it needs a separate fix. Could you clarify what needs to be checked?

I guess making sure all ready queues are empty will make sure that no backward is currently running.

My previous comment wasn't entirely correct. I think it's okay for a backward to be running if it's not from a reentrant autograd, so the queue may not be empty. Should we only process the shutdown task in those cases?

What should be the behavior if the Engine is destroyed while a backward is running? Wait for it to finish? Or interrupt it?
I guess that will change the priority of the task your add, minimal if you want to wait for it to finish and maximal otherwise.
In both cases, I would says that another shutdown task should be queued if !graph_task: inside reentrant autograd, (optionally stop the current processing) and notify the other thread_main that it should exit.

@albanD My hope was to pretend this problem didn't exist :)

I think the correct thing to do is to stop executing work as much as soon as possible, throwing exceptions if necessary to speed things up. We just have to be careful to catch these exceptions properly and then swallow them (don't report errors to users here). If you queue another task on the queue, you'll wait for the task which made the reentrant backwards call to finish executing the rest of its stuff before shutting down; if we are actually going to fix this edge case, let's fix it properly :)

In that case what happens is that the "parent" thread_main is stuck in its evaluate_function.
The "child" thread_main is executing stuff.
So in case of reentrant, the "child" will always get the shutdown task first.
That will interrupt it and force it to exit early.
Now the "parent" exit its evaluate_function and will start running other stuff.
If another shutdown task was not queued, it will try to finish it's backward and the thread will never be destroyed.

Maybe preventing destruction when a backward is running is too much? Or could be done as future work if it is actually needed.
Checking that all queues are empty before queuing the shutdown task seems ok for me. And raise an error saying that the Engine cannot be destroyed if it's running stuff. That would be simpler.
@ezyang does that sound ok with you?

Yes, SGTM. Actually I wouldn't raise an error, because I am not sure if we run destructors when you C-C a PyTorch process (I don't know exactly what Python's signal handler does...)

I'm pretty sure Python catches C-C, turns it into a KeyboardException and then terminates the program "cleanly" (in the sense that the destructors will always run). I have no idea which thread will be responsible for the cleanup though...

It wouldn't surprise me if all of this handling happened on the main thread.

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

ezyang · 2019-06-07T15:38:51Z

Don't forget to dismiss reviews when you want to get a new review ;)

resolved

ezyang · 2019-06-07T16:10:07Z

I think preventing destruction if there is a running backward seems like a simple and easy fix. Edward Excerpts from albanD's message of 2019-06-07 08:58:53 -0700:

…

albanD commented on this pull request. > @@ -271,6 +280,10 @@ auto Engine::thread_main(GraphTask *graph_task) -> void { // Note [Reentrant backwards] while (!graph_task || graph_task->outstanding_tasks_ > 0) { FunctionTask task = queue->pop(); + if (task.isShutdownTask_) { + C10_LOG_API_USAGE_ONCE("worker shutting down"); + break; In that case what happens is that the "parent" `thread_main` is stuck in its `evaluate_function`. The "child" `thread_main` is executing stuff. So in case of reentrant, the "child" will always get the shutdown task first. That will interrupt it and force it to exit early. Now the "parent" exit its `evaluate_function` and will start running other stuff. If another shutdown task was not queued, it will try to finish it's backward and the thread will never be destroyed. Maybe preventing destruction when a backward is running is too much? Or could be done as future work if it is actually needed. Checking that all queues are empty before queuing the shutdown task seems ok for me. And raise an error saying that the Engine cannot be destroyed if it's running stuff. That would be simpler. @ezyang does that sound ok with you?

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

made changes

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

ezyang · 2019-06-07T19:23:09Z

test/test_autograd.py

+            env=env)
+        return pipes.communicate()[1].decode('ascii')
+
+    @unittest.skipIf(IS_WINDOWS, "Skip for windows")


Prefer to mention exactly why you are skipping on Windows (in this case, it doesn't work. Might be worth filing a bug for it and mentioning it here.)

ezyang · 2019-06-07T19:24:00Z

test/test_autograd.py

+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            env=env)
+        return pipes.communicate()[1].decode('ascii')


Since this got copy-pasted from test_logging.py, maybe it should get moved to the common test code? That way, we don't have to copy paste. It would also be a good opportunity to name the function appropriately to mention that it's all about PYTORCH_API_USAGE_STDERR.

ezyang · 2019-06-07T19:29:35Z

torch/csrc/autograd/engine.cpp

+Engine::~Engine() {
+  bool noBackward = true;
+  for (auto& queue: ready_queues_) {
+    std::lock_guard<std::mutex> lock(queue->mutex_);


Taking out a lock on a mutex in a destructor makes me nervous for a few reasons:

Taking out a lock could throw an exception. Throwing exceptions in a destructor is bad juju, because you won't actually get an exception; you'll just forcibly terminate the program.

It could lead to a deadlock at program shutdown (no one's idea of a happy time). I think in this case it's not possible, because the scope we acquire mutex_ is not very large and we don't ever take out other locks.

It seems to me that all we need is a "best effort" test to see if we're running a backwards or not. Perhaps a single, global atomic counter on Engine itself would be good enough? Increment it when you start a backwards, decrement it when you finish (probably want to do this with RAII to make sure exceptions are handled correctly). Then look at that counter to decide if a backwards is running or not.

Why do we take the lock at every loop iteration instead of acquiring it once before the loop?

Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit? Wouldn't that lead to case 2.?

The lock has to be taken out on each iteration of the loop, because we don't have a global lock, it's a per ready queue lock, and so there isn't "one" lock to take out.

Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit?

I think you're right. If a worker thread calls std::exit, that thread will handle destructing static objects, and we'd have to detect that case in the destructor in that case.

This is all a bit moot right now, because this PR got disabled again when we landed the fix for infinite recursion. But if we do revive this patch it would be good to get this part right too.

ezyang

Most of the comments are nits, but let's try not to take out a lock in the destructor.

ezyang

@malvika2147 convinced me that we can't get rid of the lock, because we have to push message on queues and those require locking. So I rescind that comment; only nits left.

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

facebook-github-bot · 2019-06-10T17:40:32Z

This pull request has been merged in f308b07.

apaszke · 2019-07-01T13:33:00Z

torch/csrc/autograd/engine.cpp

+Engine::~Engine() {
+  bool noBackward = true;
+  for (auto& queue: ready_queues_) {
+    std::lock_guard<std::mutex> lock(queue->mutex_);


Why do we take the lock at every loop iteration instead of acquiring it once before the loop?

apaszke · 2019-07-01T13:33:32Z

torch/csrc/autograd/engine.cpp

+     queue->pushShutdownTask();
+    }
+  }
+  // Othewise threads are leaked


What was the issue with leaking threads? This is only a best-effort measure anyway, right?

apaszke · 2019-07-01T13:34:08Z

torch/csrc/autograd/engine.cpp

+Engine::~Engine() {
+  bool noBackward = true;
+  for (auto& queue: ready_queues_) {
+    std::lock_guard<std::mutex> lock(queue->mutex_);


Also, are we sure that the destruction is guaranteed not to run on any of the worker threads? Like, what happens if one of them calls os.exit? Wouldn't that lead to case 2.?

apaszke · 2019-07-01T13:36:42Z

torch/csrc/autograd/engine.cpp

    FunctionTask task = queue->pop();
+    if (task.isShutdownTask_) {
+      C10_LOG_API_USAGE_ONCE("worker shutting down");
+      break;


I'm pretty sure Python catches C-C, turns it into a KeyboardException and then terminates the program "cleanly" (in the sense that the destructors will always run). I have no idea which thread will be responsible for the cleanup though...

Don't leak threads on exit

1258518

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorchbot added the module: autograd Related to torch.autograd, and the autograd engine in general label Jun 5, 2019

malvika2147 requested a review from ezyang June 5, 2019 22:59

malvika2147 self-assigned this Jun 5, 2019

ezyang reviewed Jun 6, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Show resolved Hide resolved

ezyang reviewed Jun 6, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang reviewed Jun 6, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang previously requested changes Jun 6, 2019

View reviewed changes

ezyang requested review from albanD, apaszke and colesbury June 6, 2019 15:00

albanD previously requested changes Jun 6, 2019

View reviewed changes

mal added 4 commits June 6, 2019 16:14

Update on "Don't leak threads on exit"

e0dfea8

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

Update on "Don't leak threads on exit"

dc5a5e6

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

Update on "Don't leak threads on exit"

ebf9331

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

Update on "Don't leak threads on exit"

faa4bfb

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

malvika2147 requested review from albanD and ezyang June 7, 2019 15:46

Update on "Don't leak threads on exit"

b989f4f

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

Update on "Don't leak threads on exit"

e327179

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

ezyang reviewed Jun 7, 2019

View reviewed changes

ezyang requested changes Jun 7, 2019

View reviewed changes

ezyang approved these changes Jun 7, 2019

View reviewed changes

Update on "Don't leak threads on exit"

085a99c

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

pytorchbot added the module: tests Issues related to tests (not the torch.testing module) label Jun 7, 2019

Update on "Don't leak threads on exit"

3c438ea

Don't leak threads on exit gh-metadata: pytorch pytorch 21438 gh/mal2147/2/head

facebook-github-bot closed this in f308b07 Jun 10, 2019

zou3519 deleted the gh/mal2147/2/head branch June 10, 2019 16:16

facebook-github-bot added the merged label Jun 10, 2019

apaszke reviewed Jul 1, 2019

View reviewed changes

colesbury mentioned this pull request May 11, 2020

Python autograd engine threads never terminate in Python 3.5-3.8 #38231

Open

mruberry added the Merged label Oct 28, 2020

ezyang mentioned this pull request Dec 8, 2020

Background threads not exiting after calling loss.backward() on Win7 #48888

Closed

albanD mentioned this pull request Mar 1, 2021

Unconditionally leak autograd threads. #53018

Closed

Don't leak threads on exit #21438

Don't leak threads on exit #21438

Uh oh!

Conversation

malvika2147 commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malvika2147 Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jun 7, 2019

Uh oh!

ezyang commented Jun 7, 2019 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

malvika2147 commented Jun 5, 2019 •

edited

Loading

malvika2147 Jun 6, 2019 •

edited

Loading