Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix action server deadlock (#1285) #1313

Merged
merged 8 commits into from
Jan 14, 2021
Merged

Fix action server deadlock (#1285) #1313

merged 8 commits into from
Jan 14, 2021

Conversation

daisukes
Copy link
Contributor

To fix the issue #1285

I think the problem is that the server can let the next goal accept during processing on_terminal_state callback.
It can be solved by putting them in a single reentrant_mutex_ context.

This does not make any deadlock with my test code so far.

Comment on lines 540 to 544

// Publish a status message any time a goal handle changes state
publish_status();
// notify base so it can recalculate the expired goal timer
notify_goal_terminal_state();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the problem is lock order but not locking time window. so I believe with this fix, it still makes the deadlock problem. (in case, right after rclcpp_action::ServerGoalHandle::succeed is issued, then in another thread rclcpp_action::ServerBase::execute_goal_request_received is fired then calling user callback handle_goal)

I think reentrant_mutex_ must be used internally except user callback.

Copy link
Contributor Author

@daisukes daisukes Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this. This could solve only for my test case.

I think reentrant_mutex_ must be used internally except user callback.

I changed the fix, is this what you thought?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this works, but i'd like to hear from others.

@brawner
Copy link
Contributor

brawner commented Sep 14, 2020

Does this target the correct branch? This bug probably exists on the primary branch master, yah? We can backport the bug fix to foxy after merging it on master.

Also, are you able to create a unit test out of your test code and add that to rclcpp_action's tests? I'm about to add a bunch of tests with #1290 and I haven't encountered this issue. It would be awesome to add in a unit test that would prevent this bug from reoccurring once this PR is merged.

@daisukes
Copy link
Contributor Author

My project is based on foxy, so I made a PR for foxy, but yes I think this can be for master.

Also, are you able to create a unit test out of your test code and add that to rclcpp_action's tests?

Okay, I will try it. My test code is based on Navigation2, but I think I can reproduce its mutex behavior.

@daisukes
Copy link
Contributor Author

@brawner I think I could reproduce the error with a unit test. How do you want me to commit the unit test?

  1. works on this branch
  2. PR to Increase coverage rclcpp_action to 95% #1290 branch jacob/actions_refactor or brawner/rclcpp_action-test-coverage

@brawner
Copy link
Contributor

brawner commented Sep 15, 2020

If you need the infrastructure in rclcpp_action-test-coverage than you'll have to target that branch. If however, you can just include the test with this PR then that's ideal because it would make backporting to foxy a lot easier.

@daisukes
Copy link
Contributor Author

I put the test here to prove my patch can fix the issue. I changed the order of commit, to make testing before/after the fix easier.

Comment on lines 235 to 305
lock.unlock();

// Call user's callback, getting the user's response and a ros message to send back
auto response_pair = call_handle_goal_callback(uuid, message);

lock.lock();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These type of lock/unlock sequences are often problematic. It often leads to race conditions where two threads come in here simultaneously, then race to the lock.lock(). Only one will win (and the other will have to wait), but what if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?

Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clalancette

if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?

There are 3 call_**_callback functions that do not call any rcl functions that should be protected by the lock (pimpl_->reentrant_mutex_)

  • call_handle_goal_callback: this uses only local variables
  • call_goal_accepted_callback:
  • call_handle_cancel_callback:
    • These will access to goal_handles_, but also protected by goal_handles_mutex_

So, I believe these three functions are threadsafe as long as user callbacks are threadsafe which we don't need to worry about.

Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.

I think explaining with my test code is the best way. The user code has its own mutex which is used in the callback functions and also is used in their working thread which calls the succeed function.

send_goal_request(node_, uuid1_);
// this will lock wrapper's mutex and intentionally wait 100ms for calling succeed
// to try to acquire the lock of rclcpp_action mutex
std::thread t(&TestDeadlockServer::GoalSucceeded, this);
// after the wrapper's mutex is locked and before succeed is called
rclcpp::sleep_for(std::chrono::milliseconds(50));
// call next goal request to intentionally reproduce deadlock
// this first locks rclcpp_action mutex and then call callback to lock wrapper's mutex
send_goal_request(node_, uuid2_);
t.join();

  • user code: lock user mutex and call succeed (eventually lock rclcpp_action) L997
  • rclcpp_action: lock rclcpp_action mutex and call callbacks (lock user mutex) L1002

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clalancette

I think the most complicated case is the following,

in condition, the goals will be accepted always and simply only two threads.

  1. (T-1) goal_request_ready_ ready, then call execute_goal_request_received with reentrant_mutex_ locked
  2. (T-1) unlock reentrant_mutex_, then it will call call_handle_goal_callback (accepted)
  3. (T-2) goal_request_ready_ ready, then call execute_goal_request_received with reentrant_mutex_ locked
  4. (T-2) unlock reentrant_mutex_, then it will call call_handle_goal_callback (accepted)
  5. (T-2) lock reentrant_mutex_, call rcl_action_send_goal_response and the rest.
  6. (T-2) unlock reentrant_mutex_.
  7. (T-1) lock reentrant_mutex_, call rcl_action_send_goal_response and the rest.
  8. (T-1) unlock reentrant_mutex_.

I believe this is the new prblem introduced by this fix, because goal order between acceptance and execution is not consistent. (it is really rare, but possible.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that order should be guranteed, we can put another mutex which is only used at the beginning of execute_goal_request_received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujitatomoya
I don't think the potential issue is a critical issue because this only happens when the action server is executed by a multithread executor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, obviously that is one of the case. I think the reentrant_mutex_ is to support exclusive access for execute_goal_request_received since it was considered as issue. IMO, it would be better not to change the behavior. for example, if user application expects first comes, first accepted & executed, which is really common I guess. this cannot be guaranteed by the fix. I'd like to hear from others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another lock to keep the behavior.

@justinIRBT
Copy link

I have an application that is frequently changing controller goals and I was running into this deadlock in nearly every single mission. I applied the diff in this PR to my server.cpp file locally and haven't been able to reproduce the deadlock.

@daisukes
Copy link
Contributor Author

@clalancette @fujitatomoya
Any updates?

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think of any other way to support current behavior instead of this at this moment. but before moving forward, maybe it would be better to discuss and land on the consensus since this provides additional mutex.

@@ -45,6 +45,7 @@ class ServerBaseImpl

// Lock everything except user callbacks
std::recursive_mutex reentrant_mutex_;
std::recursive_mutex goal_request_mutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this can be private member of ServerBase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean both mutexes or only goal_request_mutex_?

@daisukes
Copy link
Contributor Author

it would be better to discuss and land on the consensus since this provides additional mutex.

  1. without an additional mutex
  • pros: without additional lock cost
  • cons: behavior change (order of call_handle_goal_callback, call_goal_accepted_callback is not guaranteed with multi-thread executor)
  1. with an additional mutex
  • pros: keep behavior
  • cons: additional lock cost

Which do you prefer?
I would say 2. It is a safer choice at a minimal cost.

@clalancette
Copy link
Contributor

@daisukes I just ended up merging #1421, which fixed one missing lock here. I believe this PR also fixed that bug, but in the interest of keeping things small, I went ahead and merged that one first. Can you please rebase this PR on top of that? Thanks.

@daisukes
Copy link
Contributor Author

Hi @clalancette, okay I will rebase them on your change.
But, before working on that, we need to decide how to deal with @fujitatomoya 's concern, which needs to introduce another lock to guarantee the original behavior. What do you think?

#1313 (comment)

@clalancette
Copy link
Contributor

So I want to back way up here and look at the original problem.

I think the most instructive thing to look at is the gist at https://gist.github.com/daisukes/712316a97832f5d9ab851ad47c77ad98 . In there, I see the deadlock that caused @daisukes the problem in the first place. Exactly as he says, there is a lock inversion problem:

  • thread 9 takes the rclcpp_action::action_server_reentrant_mutex, then calls SimpleActionServer::handle_goal (callback), which attempts to take the internal update_mutex.
  • thread 14 is running SimpleActionServer::handle_accepted(), which first takes the internal update_mutex, then calls into Server::call_goal_accepted_callback(), which takes the rclcpp_action::action_server_reentrant_mutex. This causes the deadlock.

However, looking more closely at that code shows me something else that seems weird. https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 seems to show that a new std::async should be executing the work() function. If I'm reading https://en.cppreference.com/w/cpp/thread/async properly, that should mean that a separate thread is used to execute work(). However, looking back at the stack trace for Thread 14, I see that handle_accepted is in the call stack. Now, that may just be a detail of how C++ implements async. But it also implies that the async thread is sharing state with the original thread, including the locked lock.

In other words, I think that implementation in SimpleActionServer may be incorrect. @daisukes if you change the SimpleActionServer to use a std::thread to kick off work, instead of using the std::async, does that solve the problem?

Relatedly, can you give detailed instructions on how to reproduce? I tried your reproducer program at https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d , but so far I haven't been able to reproduce it.

@daisukes
Copy link
Contributor Author

@clalancette
Thank you for dig into the problem!

Here is the test code to check if the deadlock can happen, which I committed within this PR.
938f973

Maybe, my explanation was bad, but basically, it happens with two consecutive goal requests.

@daisukes
Copy link
Contributor Author

daisukes commented Nov 5, 2020

Hello @clalancette, I have reread your comment

However, looking back at the stack trace for Thread 14, I see that handle_accepted is in the call stack. Now, that may just be a detail of how C++ implements async. But it also implies that the async thread is sharing state with the original thread, including the locked lock.

The async is called with the std::launch::async policy, so the work() should be executed in a different thread according to the documentation, indeed it is in Thread 14 which is a different thread from the main node's Thread 9.
Both threads are accessing an instance of rclcpp_action server and sharing lock state is how the lock works, right?

Relatedly, can you give detailed instructions on how to reproduce? I tried your reproducer program at https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d , but so far I haven't been able to reproduce it.

I remember that I could reproduce the lock with this snippet only with my configuration which has a larger map compared to the nav2 default turtle map. So I wrote the test code which I mentioned in the previous comment.

@clalancette
Copy link
Contributor

The async is called with the std::launch::async policy, so the work() should be executed in a different thread according to the documentation, indeed it is in Thread 14 which is a different thread from the main node's Thread 9.
Both threads are accessing an instance of rclcpp_action server and sharing lock state is how the lock works, right?

Yes, but my point is a little different. If this was truly an independent thread, then it would kick off the new thread, then return from the callback back to the rclcpp_action code, at which point it would drop the reentrant_mutex in the server. At that point, the deadlock would no longer exist, and the thread could take the lock (indeed, it may already have been waiting for it).

So my question is: why is that not happening?

@daisukes
Copy link
Contributor Author

daisukes commented Nov 6, 2020

Yes, but my point is a little different. If this was truly an independent thread, then it would kick off the new thread, then return from the callback back to the rclcpp_action code, at which point it would drop the reentrant_mutex in the server. At that point, the deadlock would no longer exist, and the thread could take the lock (indeed, it may already have been waiting for it).

This deadlock problem is very tricky, as I wrote in the issue #1285

Problem

  • user callbacks are called in the reentlant_mutex_ lock context
    • in the callbacks, the wrapper will lock its mutex
  • in another thread, the wrapper first lock its mutex
    • then calls rclcpp_action method, which tries to lock reentlant_mutex_

The deadlock happening in the situation like

  • server thread: accept goal 1, invoke async thread
  • async thread: do work, lock wrappter mutex, and about to call succeeed
  • server thread: accept goal 2 (lock rclcpp mutex) and call callback (trying to lock wrapper mutex)
  • async thread: call succeed (trying to lock rclcpp mutex)

So my question is: why is that not happening?

I'm not sure, but from my experience, the higher cpu load cause the more deadlocks.

@QQting
Copy link

QQting commented Nov 11, 2020

This PR solved my issue. Thanks.

@QQting
Copy link

QQting commented Nov 13, 2020

Hi @daisukes and @justinIRBT ,

I'm developing an application to use the new feature of Navigation2 Dynamic Object Following.
The GIF below is the screencast of my application:
pedsim-with-nb2

The publish rate of the goal update is 20Hz, and the RateController in my BT file is also 20Hz. Thanks to this PR, my application no longer gets a deadlock.

However, I encounter a new issue that the bt_action_node of navigate_to_pose reports send_goal failed due to it returns rclcpp::executor::FutureReturnCode::TIMEOUT after I keep running the program for 2 minutes.

I've tried to increase the server_timeout from default 10ms to 100ms. 10 minutes later, everything seems to be fine, no more server_timeout happens, but the controller server begins yelling Control loop missed its desired rate of 20.0000Hz.

Do you guys encounter the same issues like the server_timeout or missed the control loop rate?

@SteveMacenski
Copy link
Collaborator

@QQting can you email me a copy of that video (stevenmacenski@gmail.com)? I'd love to use that in our documentation, its hilarious.

@fujitatomoya
Copy link
Collaborator

@QQting

However, I encounter a new issue that the bt_action_node of navigate_to_pose reports send_goal failed due to it returns rclcpp::executor::FutureReturnCode::TIMEOUT after I keep running the program for 2 minutes.

I think that it would be nice to keep this issue with deadlock problem instead of mixing them up. How about creating new issue then?

@QQting
Copy link

QQting commented Nov 15, 2020

I think that it would be nice to keep this issue with deadlock problem instead of mixing them up. How about creating new issue then?

I'm not sure to create a new issue because my situation is based on this PR which hasn't been merged yet. And I've also tested the latest rclcpp_action of the master-branch, the deadlock still happens in my application.

@fujitatomoya
Copy link
Collaborator

okay, so maybe after this PR is merged then 👍

@iuhilnehc-ynos
Copy link
Collaborator

  • Lock everything except user callbacks

// Lock everything except user callbacks
std::recursive_mutex action_server_reentrant_mutex_;

  • Why the user callbacks still need to lock by action_server_reentrant_mutex_

// Call user's callback, getting the user's response and a ros message to send back
auto response_pair = call_handle_goal_callback(uuid, message);

auto response_code = call_handle_cancel_callback(uuid);

call_goal_accepted_callback(handle, uuid, message);

Maybe we might just unlock the action_server_reentrant_mutex_ before calling the above user callbacks and lock it again after user callbacks?

@fujitatomoya
Copy link
Collaborator

Maybe we might just unlock the action_server_reentrant_mutex_ before calling the above user callbacks and lock it again after user callbacks?

yes, that @daisukes suggests.

@clalancette clalancette self-assigned this Dec 17, 2020
@SteveMacenski
Copy link
Collaborator

@clalancette what's blocking this now?

@clalancette
Copy link
Contributor

@clalancette what's blocking this now?

I'm still not totally convinced that this is the problem. I'll have to go back to look at it again, but when I last looked at it, https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 should not have been holding the lock, and thus the deadlock should not be happening. I don't think we have an adequate explanation for why that is the case.

Additionally, doing the lock/unlock thing is a code smell in my opinion. Usually that leads to other races/deadlocks later on.

Overall, I don't think we understand the problem enough yet to put in a solution.

@SteveMacenski
Copy link
Collaborator

@clalancette would it be worth getting on a call? #1285 this ticket clearly explains the issue. Sure it appears in Nav2's simple action server, but that could be anyone's arbitrary wrapper on rclcpp_action and many applications need to lock other resources than just the action object itself as info being fed into it or taken out of it need thread control.

https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d there's a snipped provided by @daisukes showing that this fixes a real problem with frequently running goals and user locks.

Justin also reported this and I don't think it was in a nav2 context. I don't disagree with you that the lock/unlock isn't the most elegant thing, but it fixes a real issue that is understood by others. I don't think @daisukes would have gone down this route if the simple action server had something within it to fix (since that's where we started).

@clalancette
Copy link
Contributor

All right, I've spent some time and taken another look at this problem. Thanks for bearing with me while I found the time to do this.

First of all, I spent the time to create a small reproducer. If you clone https://github.com/clalancette/deadlock, you should be able to build it in a workspace (along with nav2_msgs and nav2_common), then run:

Terminal 1:

$ . install/setup.bash
$ ./install/deadlock/bin/dummy_action_server3

Terminal 2:

$ . install/setup.bash
$ ./install/deadlock/bin/deadlock

This will usually deadlock within a few minutes (at least, on my machine). Having this minimal reproducer is helpful for looking at the stack traces.

And restating what we've already said above, the deadlock is of the AB/BA type. Inside of the server, in thread 1 we end up calling SimpleActionServer::handle_accepted(), which takes the update_mutex in that class. Later on in that same thread we end up calling ServerBase::notify_goal_terminal_state, which takes the action_server_reentrant_mutex in that class.

Thread 1:
SimpleActionServer::handle_accepted() (takes the update_mutex lock)
SimpleActionServer::work() (this seems like it should be on a separate thread, but that's not what the backtrace shows)
PlannerServer::computePlan()
SimpleActionServer::succeeded_current()
ServerGoalHandle::succeed()
ServerBase::notify_goal_terminal_state() (takes the action_server_reentrant_mutex)

Thread 2:
NodeThread::spin()
ServerBase::execute()
ServerBase::execute_goal_request_received() (takes the action_server_reentrant_mutex)
ServerBase::call_handle_goal_callback()
SimpleActionServer::handle_goal() (takes the update_mutex lock)

So that's our deadlock. I'm actually going to ignore the part I don't quite understand about std::async; it turns out that it doesn't matter since this deadlock can happen in several other places as well.

Now that I've looked at the code in https://github.com/ros2/rclcpp/blob/master/rclcpp_action/src/server.cpp , I actually agree with the direction that this PR is heading in. The main reason I agree with it is that the action_server_reentrant_mutex is actually meant to prevent concurrent access to the action_server_ itself. Thus, that mutex should only be held while accessing action_server_, and not at other times.

That being said, here are the issues I have with how the PR is implemented:

  1. I don't think we should add in the extra goal_request_mutex, or at least we shouldn't combine it with this PR. If that is something we really think we need, let's do it in a separate PR to keep this one focused on the current problem.
  2. Let's not lock and unlock the action_server_reentrant_mutex by hand. Instead, let's use scoping to lock it; that just fits in better with how we do locking pretty much everywhere else.
  3. Related to the last point, let's make the scope of the locks even smaller. With the one exception of publish_status (where we have to hold the lock across the entire method), we should only need to hold that lock when we are accessing the pimpl->action_server_.

@daisukes If you can make those changes, and make sure all of the tests pass, I'm happy to then run CI on it and look at merging it. Thanks again for your patience here.

…lback functions"

This reverts commit d2bd783.

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@daisukes
Copy link
Contributor Author

@clalancette Thank you for your time to see the problem.
I pushed the changes to address your points.

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes inline.

Also, I believe we need to take the mutex in execute_result_request_received and in execute_check_expired_goals to be totally thread-safe.

Comment on lines 340 to 343
{
std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);
pimpl_->goal_handles_[uuid] = handle;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this particular lock here. At least, this lock is meant to protect the action_server, and that is not being accessed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pimpl_->goal_handles_.erase(uuid);

goal_handles_ can be changed simultaneously, so I put a lock guard here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to be a bit pedantic here, but the name of the lock is currently action_server_reentrant_mutex. As a reader of the code, then, I expect that that lock will protect the action_server, and not necessarily protect anything else.

But I do agree that the goal_handles could be changed by two separate threads. There are 2 ways we can go with this:

  1. Rename the lock to pimpl_lock or something like that. At that point, it makes more sense that the lock protects the entire PIMPL structure. But we would also need to audit the rest of the code to make sure that the lock is being held anytime anything in the structure is being modified.
  2. Introduce another lock specifically to protect the goal_handles, and lock that as appropriate.

I'm honestly not sure which is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this message and have made a change, but I added another mutex for unordered maps including goal_handles, because the pimpl structure has three groups of variables, 1) action_server and its attributes (size_t, bool) captured from actsion_server, 2) three unordered maps, and 3) references (clock and logger).
I think we don't need to lock the entire pimpl structure.

One line I'm not sure about is updating goal_request_ready_ without a lock.
All size_t and bool variables are protected because they are updated with action_server.

pimpl_->goal_request_ready_ = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line I'm not sure about is updating goal_request_ready_ without a lock.

Good question. It should be atomic since it is just updating a boolean, but that is not guaranteed. One thing we could do there is make it a std::atomic<bool>, which would guarantee it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I changed it to std::atomic<bool>.

rclcpp_action/src/server.cpp Outdated Show resolved Hide resolved
@daisukes
Copy link
Contributor Author

Also, I believe we need to take the mutex in execute_result_request_received and in execute_check_expired_goals to be totally thread-safe.

They already have a lock that also protects unordered_map. Do you want me to make their scope smaller too?

void
ServerBase::execute_result_request_received(std::shared_ptr<void> & data)
{
auto shared_ptr = std::static_pointer_cast
<std::tuple<rcl_ret_t, std::shared_ptr<void>, rmw_request_id_t>>(data);
auto ret = std::get<0>(*shared_ptr);
std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);

void
ServerBase::execute_check_expired_goals()
{
// Allocate expecting only one goal to expire at a time
rcl_action_goal_info_t expired_goals[1];
size_t num_expired = 1;
// Loop in case more than 1 goal expired
while (num_expired > 0u) {
std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@clalancette
Copy link
Contributor

They already have a lock that also protects unordered_map. Do you want me to make their scope smaller too?

Oops, yeah, that's right. Let's make the scope smaller there as well.

add unordered_map_mutex_ to protect some unordered_maps

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go ahead and approve this, with or without the change of goal_request_ready to std::atomic<bool>. I'm also going to fire up CI on it and see what happens.

@clalancette
Copy link
Contributor

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
@clalancette
Copy link
Contributor

@daisukes Thanks for the update. One more CI run, then I'm happy to merge:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor

Looking good to me, so I'm going to merge. Thanks @daisukes for the contribution and for sticking with it.

@clalancette clalancette merged commit ca3ad7d into ros2:master Jan 14, 2021
daisukes pushed a commit to daisukes/rclcpp that referenced this pull request Jan 15, 2021
Signed-off-by: Daisuke Sato <daisueks@cmu.edu>
@daisukes
Copy link
Contributor Author

@clalancette and @SteveMacenski, thank you for your kind support!

I wanted to include this fix in foxy initially and made a backport for my project. Do you want me to make another PR for foxy as well?
https://github.com/daisukes/rclcpp/tree/foxy-deadlock-fix-backport

While working on that, I found that these lines were deleted by this commit 8c8c268

pimpl_->cancel_request_ready_ = false;

pimpl_->result_request_ready_ = false;

I'm not sure these can cause problems.

@clalancette
Copy link
Contributor

I'm not sure these can cause problems.

I'm honestly not sure. With the tests passing, I'm inclined to say things are working. But if you find during analysis that you think those should be set, please feel free to open a another PR to restore them.

I wanted to include this fix in foxy initially and made a backport for my project. Do you want me to make another PR for foxy as well?

If you are interested in having the fix in Foxy, please do. The good news with this PR is that since everything is PIMPL-style, it should be API and ABI compatible.

@SteveMacenski
Copy link
Collaborator

@clalancette and @SteveMacenski, thank you for your kind support!

You did the work, I just checked in on occasion. Thanks for sticking with it 🥳!

daisukes added a commit to daisukes/rclcpp that referenced this pull request Jan 16, 2021
Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
jacobperron pushed a commit that referenced this pull request Jan 20, 2021
* Add missing locking to the rclcpp_action::ServerBase. (#1421)

This patch actually does 4 related things:

1.  Renames the recursive mutex in the ServerBaseImpl class
to action_server_reentrant_mutex_, which makes it a lot
clearer what it is meant to lock.
2.  Adds some additional error checking where checks were missed.
3.  Adds a lock to publish_status so that the action_server
structure is protected.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>

* [backport] Fix action server deadlock (#1285, #1313)

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

* revert comment

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
Karsten1987 pushed a commit that referenced this pull request Mar 2, 2021
* unlock action_server_reentrant_mutex_ before calling user callback functions
add an additional lock to keep previous behavior broken by deadlock fix

Also add a test case to reproduce deadlock situation in rclcpp_action

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants