Fix deadlock for multi-output forward AD #67995

albanD · 2021-11-08T12:31:44Z

Will hide some of the issues from #67367
This will at least allow us to run gradcheck for now until the above issue is fixed.

For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level.
In particular, when exiting the level from

pytorch/torch/csrc/autograd/forward_grad.cpp

Line 23 in 191b48b

void ForwardADLevel::release_idx(uint64_t idx) {

We are taking the all_forward_levels_mutex_ lock and proceed to delete the level at

pytorch/torch/csrc/autograd/forward_grad.cpp

Line 29 in 191b48b

all_forward_levels_.pop_back();

(nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock!

In the level destructor in

pytorch/torch/csrc/autograd/forward_grad.cpp

Line 55 in 191b48b

it = grads_.erase(it);

we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it).
But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads:

pytorch/torch/csrc/autograd/forward_grad.h

Line 124 in 191b48b

void clear() {

While clearing, we access the level (to de-register this forward grad) via

pytorch/torch/csrc/autograd/forward_grad.h

Line 139 in 191b48b

auto level = ForwardADLevel::try_get_by_idx(l_idx);

But this tries to access the level again in

pytorch/torch/csrc/autograd/forward_grad.cpp

Line 39 in 191b48b

std::shared_ptr<ForwardADLevel> ForwardADLevel::try_get_by_idx(uint64_t idx) {

and deadlocks.

pytorch-probot · 2021-11-08T12:31:46Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/albanD/pytorch/blob/040b18d3ec6a6e64d4040ecdd6682706a98af55e/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis	`ciflow/all`, `ciflow/linux`, `ciflow/mobile`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-11-08T12:31:49Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/67995
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 040b18d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

IvanYashchuk

This is great!

torch/csrc/autograd/forward_grad.cpp

facebook-github-bot · 2021-11-08T17:22:03Z

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soulitzer

Nice!

lezcano

Makes sense to me. Thanks for the write-up!

facebook-github-bot · 2021-11-09T09:34:18Z

@albanD merged this pull request in f9422e1.

Fix deadlock for multi-output forward AD

040b18d

albanD requested review from lezcano, soulitzer and IvanYashchuk November 8, 2021 12:31

pytorch-probot bot added the ciflow/default label Nov 8, 2021

facebook-github-bot added the cla signed label Nov 8, 2021

IvanYashchuk approved these changes Nov 8, 2021

View reviewed changes

torch/csrc/autograd/forward_grad.cpp Show resolved Hide resolved

soulitzer approved these changes Nov 9, 2021

View reviewed changes

lezcano approved these changes Nov 9, 2021

View reviewed changes

facebook-github-bot closed this in f9422e1 Nov 9, 2021

facebook-github-bot added the Merged label Nov 9, 2021

lezcano mentioned this pull request Nov 13, 2021

test_forward_mode_AD hangs for nn.functional.cosine_embedding_loss #67463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock for multi-output forward AD #67995

Fix deadlock for multi-output forward AD #67995

albanD commented Nov 8, 2021 •

edited

pytorch-probot bot commented Nov 8, 2021

⚛️ CI Flow

facebook-github-bot commented Nov 8, 2021 •

edited

IvanYashchuk left a comment

facebook-github-bot commented Nov 8, 2021

soulitzer left a comment

lezcano left a comment

facebook-github-bot commented Nov 9, 2021

Fix deadlock for multi-output forward AD #67995

Fix deadlock for multi-output forward AD #67995

Conversation

albanD commented Nov 8, 2021 • edited

pytorch-probot bot commented Nov 8, 2021

⚛️ CI Flow

facebook-github-bot commented Nov 8, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

IvanYashchuk left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 8, 2021

soulitzer left a comment

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 9, 2021

albanD commented Nov 8, 2021 •

edited

facebook-github-bot commented Nov 8, 2021 •

edited