Skip to content

Conversation

@lw
Copy link
Contributor

@lw lw commented Jul 16, 2020

Stack from ghstack:

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call rpc.shutdown()). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: D22582779

Differential Revision: D22582779

Addresses this bug report: #41474

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 16, 2020

💊 CI failures summary and remediations

As of commit 411dffd (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

@mrshenli
Copy link
Contributor

Do we also need to add the shutdown call to other tests, e.g., test_register_rpc_backend_and_set_and_start_rpc_backend and test_duplicate_name?

…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 16, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 107947863

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@lw
Copy link
Contributor Author

lw commented Jul 16, 2020

Added it to test_register_rpc_backend_and_set_and_start_rpc_backend (that required adding a dummy shutdown method to the stub agent, and do an ungraceful shutdown to avoid having to support _wait_all_workers()). I didn't add it to test_duplicate_name because in that test the creation of the agent is supposed to throw and thus the agent is never even set (which would then cause rpc.shutdown() to fail).

…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 17, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 107995887

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@lw
Copy link
Contributor Author

lw commented Jul 17, 2020

Actually, test_register_rpc_backend_and_set_and_start_rpc_backend doesn't need rpc.shutdown() either because it mocks _set_and_start_rpc_agent so it doesn't actually set the global agent.

…ing down RPC"

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779)

[ghstack-poisoned]
lw added a commit that referenced this pull request Jul 22, 2020
Pull Request resolved: #41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes #41474
ghstack-source-id: 108231453

Differential Revision: [D22582779](https://our.internmc.facebook.com/intern/diff/D22582779/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in fced54a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants