- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.7k
Description
🐛 Describe the bug
When running this test e.g. via python test_autograd.py TestAutograd.test_thread_shutdown it may fail with the following sequence of events (i.e. this is a race condition):
- The test expects "PYTORCH_API_USAGE torch.autograd.thread_shutdown"atLine 4556 in d561aa9 self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown") 
- This should be done by: pytorch/torch/csrc/autograd/engine.cpp Line 445 in d561aa9 C10_LOG_API_USAGE_ONCE("torch.autograd.thread_shutdown"); 
- That in turn is only done via a "ShutdownTask" which (on a successful test) will be added via Engine::~Engine()callingpytorch/torch/csrc/autograd/engine.cpp Line 259 in d561aa9 stop(); stop()pytorch/torch/csrc/autograd/engine.cpp Line 283 in d561aa9 queue->pushShutdownTask(); 
However that ShutdownTask is only pushed when all queues are empty, see
pytorch/torch/csrc/autograd/engine.cpp
Line 279 in d561aa9
| noBackward = noBackward && queue->empty(); | 
As the test code is very simple it is possible that the the "run_backward" task pushed at the end was not yet picked up by the/a thread and hence is still in the/a queue when the Engine destructor is run during program shutdown. In that case the ShutdownTask will never be pushed and the test fails with
FAIL: test_thread_shutdown (__main__.TestAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_autograd.py", line 4360, in test_thread_shutdown
    self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown")
AssertionError: Regex didn't match: 'PYTORCH_API_USAGE torch.autograd.thread_shutdown' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE tensor.create\n'
Versions
Noticed with PyTorch 1.9.0 on an Intel Cascade Lake system with V100 GPUs: easybuilders/easybuild-easyconfigs#16233
But may still apply as the relevant code is unchanged. See the referenced parts from current master above.
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7