-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix action server deadlock (#1285) #1313
Conversation
rclcpp_action/src/server.cpp
Outdated
|
||
// Publish a status message any time a goal handle changes state | ||
publish_status(); | ||
// notify base so it can recalculate the expired goal timer | ||
notify_goal_terminal_state(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the problem is lock order but not locking time window. so I believe with this fix, it still makes the deadlock problem. (in case, right after rclcpp_action::ServerGoalHandle::succeed
is issued, then in another thread rclcpp_action::ServerBase::execute_goal_request_received
is fired then calling user callback handle_goal
)
I think reentrant_mutex_
must be used internally except user callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with this. This could solve only for my test case.
I think reentrant_mutex_ must be used internally except user callback.
I changed the fix, is this what you thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this works, but i'd like to hear from others.
Does this target the correct branch? This bug probably exists on the primary branch Also, are you able to create a unit test out of your test code and add that to rclcpp_action's tests? I'm about to add a bunch of tests with #1290 and I haven't encountered this issue. It would be awesome to add in a unit test that would prevent this bug from reoccurring once this PR is merged. |
My project is based on foxy, so I made a PR for foxy, but yes I think this can be for
Okay, I will try it. My test code is based on Navigation2, but I think I can reproduce its mutex behavior. |
@brawner I think I could reproduce the error with a unit test. How do you want me to commit the unit test?
|
If you need the infrastructure in rclcpp_action-test-coverage than you'll have to target that branch. If however, you can just include the test with this PR then that's ideal because it would make backporting to foxy a lot easier. |
I put the test here to prove my patch can fix the issue. I changed the order of commit, to make testing before/after the fix easier. |
rclcpp_action/src/server.cpp
Outdated
lock.unlock(); | ||
|
||
// Call user's callback, getting the user's response and a ros message to send back | ||
auto response_pair = call_handle_goal_callback(uuid, message); | ||
|
||
lock.lock(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These type of lock/unlock sequences are often problematic. It often leads to race conditions where two threads come in here simultaneously, then race to the lock.lock()
. Only one will win (and the other will have to wait), but what if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?
Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?
There are 3 call_**_callback
functions that do not call any rcl
functions that should be protected by the lock (pimpl_->reentrant_mutex_
)
call_handle_goal_callback
: this uses only local variablescall_goal_accepted_callback
:call_handle_cancel_callback
:- These will access to
goal_handles_
, but also protected bygoal_handles_mutex_
- These will access to
So, I believe these three functions are threadsafe as long as user callbacks are threadsafe which we don't need to worry about.
Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.
I think explaining with my test code is the best way. The user code has its own mutex which is used in the callback functions and also is used in their working thread which calls the succeed function.
rclcpp/rclcpp_action/test/test_server.cpp
Lines 994 to 1003 in 938f973
send_goal_request(node_, uuid1_); | |
// this will lock wrapper's mutex and intentionally wait 100ms for calling succeed | |
// to try to acquire the lock of rclcpp_action mutex | |
std::thread t(&TestDeadlockServer::GoalSucceeded, this); | |
// after the wrapper's mutex is locked and before succeed is called | |
rclcpp::sleep_for(std::chrono::milliseconds(50)); | |
// call next goal request to intentionally reproduce deadlock | |
// this first locks rclcpp_action mutex and then call callback to lock wrapper's mutex | |
send_goal_request(node_, uuid2_); | |
t.join(); |
- user code: lock user mutex and call succeed (eventually lock rclcpp_action) L997
- rclcpp_action: lock rclcpp_action mutex and call callbacks (lock user mutex) L1002
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most complicated case is the following,
in condition, the goals will be accepted always and simply only two threads.
- (T-1)
goal_request_ready_
ready, then callexecute_goal_request_received
withreentrant_mutex_
locked - (T-1) unlock
reentrant_mutex_
, then it will callcall_handle_goal_callback
(accepted) - (T-2)
goal_request_ready_
ready, then callexecute_goal_request_received
withreentrant_mutex_
locked - (T-2) unlock
reentrant_mutex_
, then it will callcall_handle_goal_callback
(accepted) - (T-2) lock
reentrant_mutex_
, callrcl_action_send_goal_response
and the rest. - (T-2) unlock
reentrant_mutex_
. - (T-1) lock
reentrant_mutex_
, callrcl_action_send_goal_response
and the rest. - (T-1) unlock
reentrant_mutex_
.
I believe this is the new prblem introduced by this fix, because goal order between acceptance and execution is not consistent. (it is really rare, but possible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that order should be guranteed, we can put another mutex which is only used at the beginning of execute_goal_request_received
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fujitatomoya
I don't think the potential issue is a critical issue because this only happens when the action server is executed by a multithread executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, obviously that is one of the case. I think the reentrant_mutex_
is to support exclusive access for execute_goal_request_received
since it was considered as issue. IMO, it would be better not to change the behavior. for example, if user application expects first comes, first accepted & executed
, which is really common I guess. this cannot be guaranteed by the fix. I'd like to hear from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another lock to keep the behavior.
I have an application that is frequently changing controller goals and I was running into this deadlock in nearly every single mission. I applied the diff in this PR to my server.cpp file locally and haven't been able to reproduce the deadlock. |
@clalancette @fujitatomoya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think of any other way to support current behavior instead of this at this moment. but before moving forward, maybe it would be better to discuss and land on the consensus since this provides additional mutex.
rclcpp_action/src/server.cpp
Outdated
@@ -45,6 +45,7 @@ class ServerBaseImpl | |||
|
|||
// Lock everything except user callbacks | |||
std::recursive_mutex reentrant_mutex_; | |||
std::recursive_mutex goal_request_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this can be private member of ServerBase
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean both mutexes or only goal_request_mutex_
?
Which do you prefer? |
Hi @clalancette, okay I will rebase them on your change. |
So I want to back way up here and look at the original problem. I think the most instructive thing to look at is the gist at https://gist.github.com/daisukes/712316a97832f5d9ab851ad47c77ad98 . In there, I see the deadlock that caused @daisukes the problem in the first place. Exactly as he says, there is a lock inversion problem:
However, looking more closely at that code shows me something else that seems weird. https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 seems to show that a new In other words, I think that implementation in Relatedly, can you give detailed instructions on how to reproduce? I tried your reproducer program at https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d , but so far I haven't been able to reproduce it. |
@clalancette Here is the test code to check if the deadlock can happen, which I committed within this PR. Maybe, my explanation was bad, but basically, it happens with two consecutive goal requests. |
Hello @clalancette, I have reread your comment
The
I remember that I could reproduce the lock with this snippet only with my configuration which has a larger map compared to the nav2 default turtle map. So I wrote the test code which I mentioned in the previous comment. |
Yes, but my point is a little different. If this was truly an independent thread, then it would kick off the new thread, then return from the callback back to the So my question is: why is that not happening? |
This deadlock problem is very tricky, as I wrote in the issue #1285
The deadlock happening in the situation like
I'm not sure, but from my experience, the higher cpu load cause the more deadlocks. |
This PR solved my issue. Thanks. |
Hi @daisukes and @justinIRBT , I'm developing an application to use the new feature of Navigation2 Dynamic Object Following. The publish rate of the goal update is 20Hz, and the RateController in my BT file is also 20Hz. Thanks to this PR, my application no longer gets a deadlock. However, I encounter a new issue that the bt_action_node of navigate_to_pose reports I've tried to increase the server_timeout from default 10ms to 100ms. 10 minutes later, everything seems to be fine, no more server_timeout happens, but the controller server begins yelling Do you guys encounter the same issues like the server_timeout or missed the control loop rate? |
@QQting can you email me a copy of that video (stevenmacenski@gmail.com)? I'd love to use that in our documentation, its hilarious. |
I think that it would be nice to keep this issue with |
I'm not sure to create a new issue because my situation is based on this PR which hasn't been merged yet. And I've also tested the latest rclcpp_action of the master-branch, the deadlock still happens in my application. |
okay, so maybe after this PR is merged then 👍 |
rclcpp/rclcpp_action/src/server.cpp Lines 46 to 47 in dd0f97f
rclcpp/rclcpp_action/src/server.cpp Lines 236 to 237 in dd0f97f
rclcpp/rclcpp_action/src/server.cpp Line 353 in dd0f97f
rclcpp/rclcpp_action/src/server.cpp Line 289 in dd0f97f
Maybe we might just unlock the |
yes, that @daisukes suggests. |
@clalancette what's blocking this now? |
I'm still not totally convinced that this is the problem. I'll have to go back to look at it again, but when I last looked at it, https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 should not have been holding the lock, and thus the deadlock should not be happening. I don't think we have an adequate explanation for why that is the case. Additionally, doing the Overall, I don't think we understand the problem enough yet to put in a solution. |
@clalancette would it be worth getting on a call? #1285 this ticket clearly explains the issue. Sure it appears in Nav2's simple action server, but that could be anyone's arbitrary wrapper on https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d there's a snipped provided by @daisukes showing that this fixes a real problem with frequently running goals and user locks. Justin also reported this and I don't think it was in a nav2 context. I don't disagree with you that the |
All right, I've spent some time and taken another look at this problem. Thanks for bearing with me while I found the time to do this. First of all, I spent the time to create a small reproducer. If you clone https://github.com/clalancette/deadlock, you should be able to build it in a workspace (along with Terminal 1:
Terminal 2:
This will usually deadlock within a few minutes (at least, on my machine). Having this minimal reproducer is helpful for looking at the stack traces. And restating what we've already said above, the deadlock is of the AB/BA type. Inside of the server, in thread 1 we end up calling Thread 1: Thread 2: So that's our deadlock. I'm actually going to ignore the part I don't quite understand about Now that I've looked at the code in https://github.com/ros2/rclcpp/blob/master/rclcpp_action/src/server.cpp , I actually agree with the direction that this PR is heading in. The main reason I agree with it is that the That being said, here are the issues I have with how the PR is implemented:
@daisukes If you can make those changes, and make sure all of the tests pass, I'm happy to then run CI on it and look at merging it. Thanks again for your patience here. |
…lback functions" This reverts commit d2bd783. Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@clalancette Thank you for your time to see the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of changes inline.
Also, I believe we need to take the mutex in execute_result_request_received
and in execute_check_expired_goals
to be totally thread-safe.
rclcpp_action/src/server.cpp
Outdated
{ | ||
std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_); | ||
pimpl_->goal_handles_[uuid] = handle; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this particular lock here. At least, this lock is meant to protect the action_server
, and that is not being accessed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rclcpp/rclcpp_action/src/server.cpp
Line 526 in 6feedbe
pimpl_->goal_handles_.erase(uuid); |
goal_handles_ can be changed simultaneously, so I put a lock guard here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to be a bit pedantic here, but the name of the lock is currently action_server_reentrant_mutex
. As a reader of the code, then, I expect that that lock will protect the action_server
, and not necessarily protect anything else.
But I do agree that the goal_handles
could be changed by two separate threads. There are 2 ways we can go with this:
- Rename the lock to
pimpl_lock
or something like that. At that point, it makes more sense that the lock protects the entire PIMPL structure. But we would also need to audit the rest of the code to make sure that the lock is being held anytime anything in the structure is being modified. - Introduce another lock specifically to protect the
goal_handles
, and lock that as appropriate.
I'm honestly not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this message and have made a change, but I added another mutex for unordered maps including goal_handles, because the pimpl structure has three groups of variables, 1) action_server and its attributes (size_t, bool) captured from actsion_server, 2) three unordered maps, and 3) references (clock and logger).
I think we don't need to lock the entire pimpl structure.
One line I'm not sure about is updating goal_request_ready_ without a lock.
All size_t and bool variables are protected because they are updated with action_server.
rclcpp/rclcpp_action/src/server.cpp
Line 292 in 5ae0e9a
pimpl_->goal_request_ready_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line I'm not sure about is updating goal_request_ready_ without a lock.
Good question. It should be atomic since it is just updating a boolean, but that is not guaranteed. One thing we could do there is make it a std::atomic<bool>
, which would guarantee it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I changed it to std::atomic<bool>
.
They already have a lock that also protects unordered_map. Do you want me to make their scope smaller too? rclcpp/rclcpp_action/src/server.cpp Lines 453 to 459 in 6feedbe
rclcpp/rclcpp_action/src/server.cpp Lines 505 to 514 in 6feedbe
|
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Oops, yeah, that's right. Let's make the scope smaller there as well. |
add unordered_map_mutex_ to protect some unordered_maps Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to go ahead and approve this, with or without the change of goal_request_ready
to std::atomic<bool>
. I'm also going to fire up CI on it and see what happens.
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@daisukes Thanks for the update. One more CI run, then I'm happy to merge: |
Looking good to me, so I'm going to merge. Thanks @daisukes for the contribution and for sticking with it. |
@clalancette and @SteveMacenski, thank you for your kind support! I wanted to include this fix in foxy initially and made a backport for my project. Do you want me to make another PR for foxy as well? While working on that, I found that these lines were deleted by this commit 8c8c268 rclcpp/rclcpp_action/src/server.cpp Line 308 in 579e9d0
rclcpp/rclcpp_action/src/server.cpp Line 392 in 579e9d0
I'm not sure these can cause problems. |
I'm honestly not sure. With the tests passing, I'm inclined to say things are working. But if you find during analysis that you think those should be set, please feel free to open a another PR to restore them.
If you are interested in having the fix in Foxy, please do. The good news with this PR is that since everything is PIMPL-style, it should be API and ABI compatible. |
You did the work, I just checked in on occasion. Thanks for sticking with it 🥳! |
* Add missing locking to the rclcpp_action::ServerBase. (#1421) This patch actually does 4 related things: 1. Renames the recursive mutex in the ServerBaseImpl class to action_server_reentrant_mutex_, which makes it a lot clearer what it is meant to lock. 2. Adds some additional error checking where checks were missed. 3. Adds a lock to publish_status so that the action_server structure is protected. Signed-off-by: Chris Lalancette <clalancette@openrobotics.org> * [backport] Fix action server deadlock (#1285, #1313) Signed-off-by: Daisuke Sato <daisukes@cmu.edu> * revert comment Signed-off-by: Daisuke Sato <daisukes@cmu.edu> Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
* unlock action_server_reentrant_mutex_ before calling user callback functions add an additional lock to keep previous behavior broken by deadlock fix Also add a test case to reproduce deadlock situation in rclcpp_action Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
To fix the issue #1285
I think the problem is that the server can let the next goal accept during processing
on_terminal_state
callback.It can be solved by putting them in a single
reentrant_mutex_
context.This does not make any deadlock with my test code so far.