Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclcpp_action server could cause deadlock with its wrapper's mutex #1285

Closed
daisukes opened this issue Aug 26, 2020 · 3 comments
Closed

rclcpp_action server could cause deadlock with its wrapper's mutex #1285

daisukes opened this issue Aug 26, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@daisukes
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu Focal
  • Installation type:
    • Foxy release src build
  • Version or commit hash:
  • DDS implementation:
    • tried with Fast-RTPS, cyclonedds
  • Client library (if applicable):
    • Navigation 2

Steps to reproduce issue

The original issue is here and I think this is rclcpp issue
ros-navigation/navigation2#1961

- launch tb3 simulation and make it active
- repeat sending goal action and waiting for the result in a single loop, and wait for the deadlock 
https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d
  - I could reproduce the deadlock with this code usually in a few minutes

Expected behavior

  • does not stop, keep working

Actual behavior

  • the server will not accept goals when the deadlock happens

Problem

  • user callbacks are called in the reentlant_mutex_ lock context

    • in the callbacks, the wrapper will lock its mutex
  • in another thread, the wrapper first lock its mutex

    • then calls rclcpp_action method, which tries to lock reentlant_mutex_

This inconsistent locking order cause deadlocks.
So I think rclcpp_action needs to call user callbacks outside of reentlant_mutex_ lock context.

Additional information

Here is gdb information
Thread 9 and 14 are trying to hold two mutex in different order
https://gist.github.com/daisukes/712316a97832f5d9ab851ad47c77ad98

Thread 9: server thread
Frame# 6 nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose, rclcpp::Node>::handle_goal
Frame# 14 rclcpp_action::ServerBase::execute_goal_request_received()

Thread 14: working thread started here
Frame# 6 rclcpp_action::ServerBase::notify_goal_terminal_state()
Frame# 11 nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose, rclcpp::Node>::succeeded_current

@SteveMacenski
Copy link
Collaborator

SteveMacenski commented Aug 26, 2020

And to be clear, this isn't just a specific issue with Navigation2, any users that have mutex locks on their resources will run into this problem. We typically don't run into this issue because we don't in quick succession call an action server immediately after returning a result. But clearly there are cases where that would be expected behavior.

If you were to use actions as workers though, you'd absolutely run into this issue more frequently.

@clalancette clalancette added the bug Something isn't working label Sep 10, 2020
@clalancette clalancette self-assigned this Sep 10, 2020
@SteveMacenski
Copy link
Collaborator

SteveMacenski commented Sep 25, 2020

@clalancette any thoughts? correction, I see the PR has had some motion, nothing to see here folks...

clalancette pushed a commit that referenced this issue Jan 14, 2021
* unlock action_server_reentrant_mutex_ before calling user callback functions
add an additional lock to keep previous behavior broken by deadlock fix

Also add a test case to reproduce deadlock situation in rclcpp_action

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
daisukes pushed a commit to daisukes/rclcpp that referenced this issue Jan 15, 2021
Signed-off-by: Daisuke Sato <daisueks@cmu.edu>
daisukes added a commit to daisukes/rclcpp that referenced this issue Jan 16, 2021
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@ivanpauno
Copy link
Member

Closing as it seems to have been fixed by #1313 (and #1518 will fix it for Foxy, where the bug was reported), feel free to reopen if that's not the case.

jacobperron pushed a commit that referenced this issue Jan 20, 2021
* Add missing locking to the rclcpp_action::ServerBase. (#1421)

This patch actually does 4 related things:

1.  Renames the recursive mutex in the ServerBaseImpl class
to action_server_reentrant_mutex_, which makes it a lot
clearer what it is meant to lock.
2.  Adds some additional error checking where checks were missed.
3.  Adds a lock to publish_status so that the action_server
structure is protected.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>

* [backport] Fix action server deadlock (#1285, #1313)

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

* revert comment

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
Karsten1987 pushed a commit that referenced this issue Mar 2, 2021
* unlock action_server_reentrant_mutex_ before calling user callback functions
add an additional lock to keep previous behavior broken by deadlock fix

Also add a test case to reproduce deadlock situation in rclcpp_action

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
ylhStore pushed a commit to ylhStore/rclcpp_eloquent_patch that referenced this issue Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants