Skip to content

TestAutograd.test_thread_shutdown may fail sometimes #85259

@Flamefire

Description

@Flamefire

🐛 Describe the bug

When running this test e.g. via python test_autograd.py TestAutograd.test_thread_shutdown it may fail with the following sequence of events (i.e. this is a race condition):

However that ShutdownTask is only pushed when all queues are empty, see

noBackward = noBackward && queue->empty();

As the test code is very simple it is possible that the the "run_backward" task pushed at the end was not yet picked up by the/a thread and hence is still in the/a queue when the Engine destructor is run during program shutdown. In that case the ShutdownTask will never be pushed and the test fails with

FAIL: test_thread_shutdown (__main__.TestAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_autograd.py", line 4360, in test_thread_shutdown
    self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown")
AssertionError: Regex didn't match: 'PYTORCH_API_USAGE torch.autograd.thread_shutdown' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE tensor.create\n'

Versions

Noticed with PyTorch 1.9.0 on an Intel Cascade Lake system with V100 GPUs: easybuilders/easybuild-easyconfigs#16233

But may still apply as the relevant code is unchanged. See the referenced parts from current master above.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: autogradRelated to torch.autograd, and the autograd engine in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions