-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Bug
In our supported versions of Python (3.5-3.8) the autograd engine threads never fully exit. This is probably fine, but it's worth noting in the context of issue #38230, and because the engine has code to shutdown threads when it's destructed (#21438).
The threads stall in the destructor of pybind11::gil_scoped_release
pytorch/torch/csrc/autograd/python_engine.cpp
Lines 49 to 56 in 6f396e1
void PythonEngine::thread_init(int device, const std::shared_ptr<ReadyQueue>& ready_queue) { | |
// Create a PyThreadState, but release the GIL. This lets pybind11::gil_scoped_acquire calls | |
// inside thread_main acquire the GIL without having to create a new | |
// PyThreadState each time. | |
pybind11::gil_scoped_acquire gil; | |
pybind11::gil_scoped_release no_gil; | |
Engine::thread_init(device, ready_queue); | |
} |
The gil_scoped_release
destructor calls PyEval_RestoreThread
to re-acquire the GIL. However, by this time, Python has fully finalized and the GIL remains locked. (The behavior is different in Python 3.9.0a6; see issue #38230). The threads stall on acquiring the mutex. Eventually, the main thread exits and the process terminates anyways.
In PyTorch 1.4, this code used AutoNoGIL
instead of gil_scoped_release
; that had the same issue.
You can verify this behavior by adding printf statements or with gdb, some breakpoints, and set scheduler-locking on
.
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @seemethere @malfet @walterddr @pytorch/pytorch-dev-infra @ssnl
Metadata
Metadata
Assignees
Labels
Type
Projects
Status