Fix action server deadlock (#1285) #1313

daisukes · 2020-09-13T05:43:09Z

To fix the issue #1285

I think the problem is that the server can let the next goal accept during processing on_terminal_state callback.
It can be solved by putting them in a single reentrant_mutex_ context.

This does not make any deadlock with my test code so far.

fujitatomoya · 2020-09-14T01:04:01Z

rclcpp_action/src/server.cpp

+
+  // Publish a status message any time a goal handle changes state
+  publish_status();
+  // notify base so it can recalculate the expired goal timer
+  notify_goal_terminal_state();


I think that the problem is lock order but not locking time window. so I believe with this fix, it still makes the deadlock problem. (in case, right after rclcpp_action::ServerGoalHandle::succeed is issued, then in another thread rclcpp_action::ServerBase::execute_goal_request_received is fired then calling user callback handle_goal)

I think reentrant_mutex_ must be used internally except user callback.

Yeah, I agree with this. This could solve only for my test case.

I think reentrant_mutex_ must be used internally except user callback.

I changed the fix, is this what you thought?

I think that this works, but i'd like to hear from others.

brawner · 2020-09-14T18:13:50Z

Does this target the correct branch? This bug probably exists on the primary branch master, yah? We can backport the bug fix to foxy after merging it on master.

Also, are you able to create a unit test out of your test code and add that to rclcpp_action's tests? I'm about to add a bunch of tests with #1290 and I haven't encountered this issue. It would be awesome to add in a unit test that would prevent this bug from reoccurring once this PR is merged.

daisukes · 2020-09-14T19:26:12Z

My project is based on foxy, so I made a PR for foxy, but yes I think this can be for master.

Also, are you able to create a unit test out of your test code and add that to rclcpp_action's tests?

Okay, I will try it. My test code is based on Navigation2, but I think I can reproduce its mutex behavior.

daisukes · 2020-09-15T05:23:32Z

@brawner I think I could reproduce the error with a unit test. How do you want me to commit the unit test?

works on this branch
PR to Increase coverage rclcpp_action to 95% #1290 branch jacob/actions_refactor or brawner/rclcpp_action-test-coverage

brawner · 2020-09-15T15:49:16Z

If you need the infrastructure in rclcpp_action-test-coverage than you'll have to target that branch. If however, you can just include the test with this PR then that's ideal because it would make backporting to foxy a lot easier.

daisukes · 2020-09-15T17:04:18Z

I put the test here to prove my patch can fix the issue. I changed the order of commit, to make testing before/after the fix easier.

clalancette · 2020-09-17T13:33:11Z

rclcpp_action/src/server.cpp

+  lock.unlock();
+
  // Call user's callback, getting the user's response and a ros message to send back
  auto response_pair = call_handle_goal_callback(uuid, message);

+  lock.lock();
+


These type of lock/unlock sequences are often problematic. It often leads to race conditions where two threads come in here simultaneously, then race to the lock.lock(). Only one will win (and the other will have to wait), but what if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?

Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.

@clalancette

if T1 comes in first, calls the callback, but then T2 is the one that reacquires the lock first? Is that situation OK?

There are 3 call_**_callback functions that do not call any rcl functions that should be protected by the lock (pimpl_->reentrant_mutex_)

call_handle_goal_callback: this uses only local variables

call_goal_accepted_callback:

call_handle_cancel_callback:

These will access to goal_handles_, but also protected by goal_handles_mutex_

So, I believe these three functions are threadsafe as long as user callbacks are threadsafe which we don't need to worry about.

Can you explain more about the original deadlock situation? I'm wondering if there are other solutions we can use here.

I think explaining with my test code is the best way. The user code has its own mutex which is used in the callback functions and also is used in their working thread which calls the succeed function.

rclcpp/rclcpp_action/test/test_server.cpp

Lines 994 to 1003 in 938f973

send_goal_request(node_, uuid1_);

// this will lock wrapper's mutex and intentionally wait 100ms for calling succeed

// to try to acquire the lock of rclcpp_action mutex

std::thread t(&TestDeadlockServer::GoalSucceeded, this);

// after the wrapper's mutex is locked and before succeed is called

rclcpp::sleep_for(std::chrono::milliseconds(50));

// call next goal request to intentionally reproduce deadlock

// this first locks rclcpp_action mutex and then call callback to lock wrapper's mutex

send_goal_request(node_, uuid2_);

t.join();

user code: lock user mutex and call succeed (eventually lock rclcpp_action) L997

rclcpp_action: lock rclcpp_action mutex and call callbacks (lock user mutex) L1002

@clalancette

I think the most complicated case is the following,

in condition, the goals will be accepted always and simply only two threads.

(T-1) goal_request_ready_ ready, then call execute_goal_request_received with reentrant_mutex_ locked

(T-1) unlock reentrant_mutex_, then it will call call_handle_goal_callback (accepted)

(T-2) goal_request_ready_ ready, then call execute_goal_request_received with reentrant_mutex_ locked

(T-2) unlock reentrant_mutex_, then it will call call_handle_goal_callback (accepted)

(T-2) lock reentrant_mutex_, call rcl_action_send_goal_response and the rest.

(T-2) unlock reentrant_mutex_.

(T-1) lock reentrant_mutex_, call rcl_action_send_goal_response and the rest.

(T-1) unlock reentrant_mutex_.

I believe this is the new prblem introduced by this fix, because goal order between acceptance and execution is not consistent. (it is really rare, but possible.)

If that order should be guranteed, we can put another mutex which is only used at the beginning of execute_goal_request_received.

@fujitatomoya
I don't think the potential issue is a critical issue because this only happens when the action server is executed by a multithread executor.

yes, obviously that is one of the case. I think the reentrant_mutex_ is to support exclusive access for execute_goal_request_received since it was considered as issue. IMO, it would be better not to change the behavior. for example, if user application expects first comes, first accepted & executed, which is really common I guess. this cannot be guaranteed by the fix. I'd like to hear from others.

I added another lock to keep the behavior.

justinIRBT · 2020-10-03T00:31:05Z

I have an application that is frequently changing controller goals and I was running into this deadlock in nearly every single mission. I applied the diff in this PR to my server.cpp file locally and haven't been able to reproduce the deadlock.

daisukes · 2020-10-13T03:56:02Z

@clalancette @fujitatomoya
Any updates?

fujitatomoya

I do not think of any other way to support current behavior instead of this at this moment. but before moving forward, maybe it would be better to discuss and land on the consensus since this provides additional mutex.

fujitatomoya · 2020-10-13T05:08:40Z

rclcpp_action/src/server.cpp

@@ -45,6 +45,7 @@ class ServerBaseImpl

  // Lock everything except user callbacks
  std::recursive_mutex reentrant_mutex_;
+  std::recursive_mutex goal_request_mutex_;


i think this can be private member of ServerBase.

Do you mean both mutexes or only goal_request_mutex_?

daisukes · 2020-10-17T05:41:47Z

it would be better to discuss and land on the consensus since this provides additional mutex.

without an additional mutex

pros: without additional lock cost
cons: behavior change (order of call_handle_goal_callback, call_goal_accepted_callback is not guaranteed with multi-thread executor)

with an additional mutex

pros: keep behavior
cons: additional lock cost

Which do you prefer?
I would say 2. It is a safer choice at a minimal cost.

clalancette · 2020-10-27T00:46:40Z

@daisukes I just ended up merging #1421, which fixed one missing lock here. I believe this PR also fixed that bug, but in the interest of keeping things small, I went ahead and merged that one first. Can you please rebase this PR on top of that? Thanks.

daisukes · 2020-10-27T00:59:57Z

Hi @clalancette, okay I will rebase them on your change.
But, before working on that, we need to decide how to deal with @fujitatomoya 's concern, which needs to introduce another lock to guarantee the original behavior. What do you think?

#1313 (comment)

clalancette · 2020-10-27T20:48:26Z

So I want to back way up here and look at the original problem.

I think the most instructive thing to look at is the gist at https://gist.github.com/daisukes/712316a97832f5d9ab851ad47c77ad98 . In there, I see the deadlock that caused @daisukes the problem in the first place. Exactly as he says, there is a lock inversion problem:

thread 9 takes the rclcpp_action::action_server_reentrant_mutex, then calls SimpleActionServer::handle_goal (callback), which attempts to take the internal update_mutex.
thread 14 is running SimpleActionServer::handle_accepted(), which first takes the internal update_mutex, then calls into Server::call_goal_accepted_callback(), which takes the rclcpp_action::action_server_reentrant_mutex. This causes the deadlock.

However, looking more closely at that code shows me something else that seems weird. https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 seems to show that a new std::async should be executing the work() function. If I'm reading https://en.cppreference.com/w/cpp/thread/async properly, that should mean that a separate thread is used to execute work(). However, looking back at the stack trace for Thread 14, I see that handle_accepted is in the call stack. Now, that may just be a detail of how C++ implements async. But it also implies that the async thread is sharing state with the original thread, including the locked lock.

In other words, I think that implementation in SimpleActionServer may be incorrect. @daisukes if you change the SimpleActionServer to use a std::thread to kick off work, instead of using the std::async, does that solve the problem?

Relatedly, can you give detailed instructions on how to reproduce? I tried your reproducer program at https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d , but so far I haven't been able to reproduce it.

daisukes · 2020-10-27T21:35:42Z

@clalancette
Thank you for dig into the problem!

Here is the test code to check if the deadlock can happen, which I committed within this PR.
938f973

Maybe, my explanation was bad, but basically, it happens with two consecutive goal requests.

daisukes · 2020-11-05T21:28:20Z

Hello @clalancette, I have reread your comment

However, looking back at the stack trace for Thread 14, I see that handle_accepted is in the call stack. Now, that may just be a detail of how C++ implements async. But it also implies that the async thread is sharing state with the original thread, including the locked lock.

The async is called with the std::launch::async policy, so the work() should be executed in a different thread according to the documentation, indeed it is in Thread 14 which is a different thread from the main node's Thread 9.
Both threads are accessing an instance of rclcpp_action server and sharing lock state is how the lock works, right?

Relatedly, can you give detailed instructions on how to reproduce? I tried your reproducer program at https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d , but so far I haven't been able to reproduce it.

I remember that I could reproduce the lock with this snippet only with my configuration which has a larger map compared to the nav2 default turtle map. So I wrote the test code which I mentioned in the previous comment.

clalancette · 2020-11-06T16:51:23Z

The async is called with the std::launch::async policy, so the work() should be executed in a different thread according to the documentation, indeed it is in Thread 14 which is a different thread from the main node's Thread 9.
Both threads are accessing an instance of rclcpp_action server and sharing lock state is how the lock works, right?

Yes, but my point is a little different. If this was truly an independent thread, then it would kick off the new thread, then return from the callback back to the rclcpp_action code, at which point it would drop the reentrant_mutex in the server. At that point, the deadlock would no longer exist, and the thread could take the lock (indeed, it may already have been waiting for it).

So my question is: why is that not happening?

daisukes · 2020-11-06T17:36:28Z

Yes, but my point is a little different. If this was truly an independent thread, then it would kick off the new thread, then return from the callback back to the rclcpp_action code, at which point it would drop the reentrant_mutex in the server. At that point, the deadlock would no longer exist, and the thread could take the lock (indeed, it may already have been waiting for it).

This deadlock problem is very tricky, as I wrote in the issue #1285

Problem

user callbacks are called in the reentlant_mutex_ lock context

in the callbacks, the wrapper will lock its mutex

in another thread, the wrapper first lock its mutex

then calls rclcpp_action method, which tries to lock reentlant_mutex_

The deadlock happening in the situation like

server thread: accept goal 1, invoke async thread
async thread: do work, lock wrappter mutex, and about to call succeeed
server thread: accept goal 2 (lock rclcpp mutex) and call callback (trying to lock wrapper mutex)
async thread: call succeed (trying to lock rclcpp mutex)

So my question is: why is that not happening?

I'm not sure, but from my experience, the higher cpu load cause the more deadlocks.

QQting · 2020-11-11T11:14:11Z

This PR solved my issue. Thanks.

QQting · 2020-11-13T10:03:06Z

Hi @daisukes and @justinIRBT ,

I'm developing an application to use the new feature of Navigation2 Dynamic Object Following.
The GIF below is the screencast of my application:

The publish rate of the goal update is 20Hz, and the RateController in my BT file is also 20Hz. Thanks to this PR, my application no longer gets a deadlock.

However, I encounter a new issue that the bt_action_node of navigate_to_pose reports send_goal failed due to it returns rclcpp::executor::FutureReturnCode::TIMEOUT after I keep running the program for 2 minutes.

I've tried to increase the server_timeout from default 10ms to 100ms. 10 minutes later, everything seems to be fine, no more server_timeout happens, but the controller server begins yelling Control loop missed its desired rate of 20.0000Hz.

Do you guys encounter the same issues like the server_timeout or missed the control loop rate?

SteveMacenski · 2020-11-13T22:36:27Z

@QQting can you email me a copy of that video (stevenmacenski@gmail.com)? I'd love to use that in our documentation, its hilarious.

fujitatomoya · 2020-11-15T09:00:23Z

@QQting

However, I encounter a new issue that the bt_action_node of navigate_to_pose reports send_goal failed due to it returns rclcpp::executor::FutureReturnCode::TIMEOUT after I keep running the program for 2 minutes.

I think that it would be nice to keep this issue with deadlock problem instead of mixing them up. How about creating new issue then?

QQting · 2020-11-15T11:21:50Z

I think that it would be nice to keep this issue with deadlock problem instead of mixing them up. How about creating new issue then?

I'm not sure to create a new issue because my situation is based on this PR which hasn't been merged yet. And I've also tested the latest rclcpp_action of the master-branch, the deadlock still happens in my application.

fujitatomoya · 2020-11-15T12:40:53Z

okay, so maybe after this PR is merged then 👍

iuhilnehc-ynos · 2020-11-16T03:30:31Z

Lock everything except user callbacks

rclcpp/rclcpp_action/src/server.cpp

Lines 46 to 47 in dd0f97f

    
           // Lock everything except user callbacks 
        
           std::recursive_mutex action_server_reentrant_mutex_;

Why the user callbacks still need to lock by action_server_reentrant_mutex_

rclcpp/rclcpp_action/src/server.cpp

Lines 236 to 237 in dd0f97f

    
           // Call user's callback, getting the user's response and a ros message to send back 
        
           auto response_pair = call_handle_goal_callback(uuid, message);

rclcpp/rclcpp_action/src/server.cpp

Line 353 in dd0f97f

auto response_code = call_handle_cancel_callback(uuid);

rclcpp/rclcpp_action/src/server.cpp

Line 289 in dd0f97f

call_goal_accepted_callback(handle, uuid, message);

Maybe we might just unlock the action_server_reentrant_mutex_ before calling the above user callbacks and lock it again after user callbacks?

fujitatomoya · 2020-11-16T03:50:27Z

Maybe we might just unlock the action_server_reentrant_mutex_ before calling the above user callbacks and lock it again after user callbacks?

yes, that @daisukes suggests.

SteveMacenski · 2021-01-05T23:03:21Z

@clalancette what's blocking this now?

clalancette · 2021-01-06T18:01:45Z

@clalancette what's blocking this now?

I'm still not totally convinced that this is the problem. I'll have to go back to look at it again, but when I last looked at it, https://github.com/ros-planning/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L128 should not have been holding the lock, and thus the deadlock should not be happening. I don't think we have an adequate explanation for why that is the case.

Additionally, doing the lock/unlock thing is a code smell in my opinion. Usually that leads to other races/deadlocks later on.

Overall, I don't think we understand the problem enough yet to put in a solution.

SteveMacenski · 2021-01-06T18:16:29Z

@clalancette would it be worth getting on a call? #1285 this ticket clearly explains the issue. Sure it appears in Nav2's simple action server, but that could be anyone's arbitrary wrapper on rclcpp_action and many applications need to lock other resources than just the action object itself as info being fed into it or taken out of it need thread control.

https://gist.github.com/daisukes/46298b083a48e5db7016f3c0efba7e2d there's a snipped provided by @daisukes showing that this fixes a real problem with frequently running goals and user locks.

Justin also reported this and I don't think it was in a nav2 context. I don't disagree with you that the lock/unlock isn't the most elegant thing, but it fixes a real issue that is understood by others. I don't think @daisukes would have gone down this route if the simple action server had something within it to fix (since that's where we started).

clalancette · 2021-01-11T19:44:18Z

All right, I've spent some time and taken another look at this problem. Thanks for bearing with me while I found the time to do this.

First of all, I spent the time to create a small reproducer. If you clone https://github.com/clalancette/deadlock, you should be able to build it in a workspace (along with nav2_msgs and nav2_common), then run:

Terminal 1:

$ . install/setup.bash
$ ./install/deadlock/bin/dummy_action_server3

Terminal 2:

$ . install/setup.bash
$ ./install/deadlock/bin/deadlock

This will usually deadlock within a few minutes (at least, on my machine). Having this minimal reproducer is helpful for looking at the stack traces.

And restating what we've already said above, the deadlock is of the AB/BA type. Inside of the server, in thread 1 we end up calling SimpleActionServer::handle_accepted(), which takes the update_mutex in that class. Later on in that same thread we end up calling ServerBase::notify_goal_terminal_state, which takes the action_server_reentrant_mutex in that class.

Thread 1:
SimpleActionServer::handle_accepted() (takes the update_mutex lock)
SimpleActionServer::work() (this seems like it should be on a separate thread, but that's not what the backtrace shows)
PlannerServer::computePlan()
SimpleActionServer::succeeded_current()
ServerGoalHandle::succeed()
ServerBase::notify_goal_terminal_state() (takes the action_server_reentrant_mutex)

Thread 2:
NodeThread::spin()
ServerBase::execute()
ServerBase::execute_goal_request_received() (takes the action_server_reentrant_mutex)
ServerBase::call_handle_goal_callback()
SimpleActionServer::handle_goal() (takes the update_mutex lock)

So that's our deadlock. I'm actually going to ignore the part I don't quite understand about std::async; it turns out that it doesn't matter since this deadlock can happen in several other places as well.

Now that I've looked at the code in https://github.com/ros2/rclcpp/blob/master/rclcpp_action/src/server.cpp , I actually agree with the direction that this PR is heading in. The main reason I agree with it is that the action_server_reentrant_mutex is actually meant to prevent concurrent access to the action_server_ itself. Thus, that mutex should only be held while accessing action_server_, and not at other times.

That being said, here are the issues I have with how the PR is implemented:

I don't think we should add in the extra goal_request_mutex, or at least we shouldn't combine it with this PR. If that is something we really think we need, let's do it in a separate PR to keep this one focused on the current problem.
Let's not lock and unlock the action_server_reentrant_mutex by hand. Instead, let's use scoping to lock it; that just fits in better with how we do locking pretty much everywhere else.
Related to the last point, let's make the scope of the locks even smaller. With the one exception of publish_status (where we have to hold the lock across the entire method), we should only need to hold that lock when we are accessing the pimpl->action_server_.

@daisukes If you can make those changes, and make sure all of the tests pass, I'm happy to then run CI on it and look at merging it. Thanks again for your patience here.

…lback functions" This reverts commit d2bd783. Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes · 2021-01-11T22:12:08Z

@clalancette Thank you for your time to see the problem.
I pushed the changes to address your points.

clalancette

A couple of changes inline.

Also, I believe we need to take the mutex in execute_result_request_received and in execute_check_expired_goals to be totally thread-safe.

clalancette · 2021-01-12T13:44:35Z

rclcpp_action/src/server.cpp

+    {
+      std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);
+      pimpl_->goal_handles_[uuid] = handle;
+    }


I don't think we need this particular lock here. At least, this lock is meant to protect the action_server, and that is not being accessed here.

rclcpp/rclcpp_action/src/server.cpp

Line 526 in 6feedbe

pimpl_->goal_handles_.erase(uuid);

goal_handles_ can be changed simultaneously, so I put a lock guard here too.

I'm sorry to be a bit pedantic here, but the name of the lock is currently action_server_reentrant_mutex. As a reader of the code, then, I expect that that lock will protect the action_server, and not necessarily protect anything else.

But I do agree that the goal_handles could be changed by two separate threads. There are 2 ways we can go with this:

Rename the lock to pimpl_lock or something like that. At that point, it makes more sense that the lock protects the entire PIMPL structure. But we would also need to audit the rest of the code to make sure that the lock is being held anytime anything in the structure is being modified.

Introduce another lock specifically to protect the goal_handles, and lock that as appropriate.

I'm honestly not sure which is better.

I missed this message and have made a change, but I added another mutex for unordered maps including goal_handles, because the pimpl structure has three groups of variables, 1) action_server and its attributes (size_t, bool) captured from actsion_server, 2) three unordered maps, and 3) references (clock and logger).
I think we don't need to lock the entire pimpl structure.

One line I'm not sure about is updating goal_request_ready_ without a lock.
All size_t and bool variables are protected because they are updated with action_server.

rclcpp/rclcpp_action/src/server.cpp

Line 292 in 5ae0e9a

pimpl_->goal_request_ready_ = false;

One line I'm not sure about is updating goal_request_ready_ without a lock.

Good question. It should be atomic since it is just updating a boolean, but that is not guaranteed. One thing we could do there is make it a std::atomic<bool>, which would guarantee it.

Thanks for the suggestion! I changed it to std::atomic<bool>.

rclcpp_action/src/server.cpp

daisukes · 2021-01-12T14:46:59Z

Also, I believe we need to take the mutex in execute_result_request_received and in execute_check_expired_goals to be totally thread-safe.

They already have a lock that also protects unordered_map. Do you want me to make their scope smaller too?

rclcpp/rclcpp_action/src/server.cpp

Lines 453 to 459 in 6feedbe

    
           void 
        
           ServerBase::execute_result_request_received(std::shared_ptr<void> & data) 
        
           { 
        
             auto shared_ptr = std::static_pointer_cast 
        
               <std::tuple<rcl_ret_t, std::shared_ptr<void>, rmw_request_id_t>>(data); 
        
             auto ret = std::get<0>(*shared_ptr); 
        
             std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);

rclcpp/rclcpp_action/src/server.cpp

Lines 505 to 514 in 6feedbe

    
           void 
        
           ServerBase::execute_check_expired_goals() 
        
           { 
        
             // Allocate expecting only one goal to expire at a time 
        
             rcl_action_goal_info_t expired_goals[1]; 
        
             size_t num_expired = 1; 
        
             // Loop in case more than 1 goal expired 
        
             while (num_expired > 0u) { 
        
               std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

clalancette · 2021-01-12T15:37:31Z

They already have a lock that also protects unordered_map. Do you want me to make their scope smaller too?

Oops, yeah, that's right. Let's make the scope smaller there as well.

add unordered_map_mutex_ to protect some unordered_maps Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

clalancette

I'm going to go ahead and approve this, with or without the change of goal_request_ready to std::atomic<bool>. I'm also going to fire up CI on it and see what happens.

Obsolete.

clalancette · 2021-01-13T14:41:53Z

CI:

Linux
Linux-aarch64
macOS
Windows

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

clalancette · 2021-01-14T15:23:55Z

@daisukes Thanks for the update. One more CI run, then I'm happy to merge:

Linux
Linux-aarch64
macOS
Windows

clalancette · 2021-01-14T22:17:45Z

Looking good to me, so I'm going to merge. Thanks @daisukes for the contribution and for sticking with it.

Signed-off-by: Daisuke Sato <daisueks@cmu.edu>

daisukes · 2021-01-15T19:08:22Z

@clalancette and @SteveMacenski, thank you for your kind support!

I wanted to include this fix in foxy initially and made a backport for my project. Do you want me to make another PR for foxy as well?
https://github.com/daisukes/rclcpp/tree/foxy-deadlock-fix-backport

While working on that, I found that these lines were deleted by this commit 8c8c268

rclcpp/rclcpp_action/src/server.cpp

Line 308 in 579e9d0

pimpl_->cancel_request_ready_ = false;

rclcpp/rclcpp_action/src/server.cpp

Line 392 in 579e9d0

pimpl_->result_request_ready_ = false;

I'm not sure these can cause problems.

clalancette · 2021-01-15T19:42:19Z

I'm not sure these can cause problems.

I'm honestly not sure. With the tests passing, I'm inclined to say things are working. But if you find during analysis that you think those should be set, please feel free to open a another PR to restore them.

I wanted to include this fix in foxy initially and made a backport for my project. Do you want me to make another PR for foxy as well?

If you are interested in having the fix in Foxy, please do. The good news with this PR is that since everything is PIMPL-style, it should be API and ABI compatible.

SteveMacenski · 2021-01-15T19:59:49Z

@clalancette and @SteveMacenski, thank you for your kind support!

You did the work, I just checked in on occasion. Thanks for sticking with it 🥳!

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

* Add missing locking to the rclcpp_action::ServerBase. (#1421) This patch actually does 4 related things: 1. Renames the recursive mutex in the ServerBaseImpl class to action_server_reentrant_mutex_, which makes it a lot clearer what it is meant to lock. 2. Adds some additional error checking where checks were missed. 3. Adds a lock to publish_status so that the action_server structure is protected. Signed-off-by: Chris Lalancette <clalancette@openrobotics.org> * [backport] Fix action server deadlock (#1285, #1313) Signed-off-by: Daisuke Sato <daisukes@cmu.edu> * revert comment Signed-off-by: Daisuke Sato <daisukes@cmu.edu> Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>

* unlock action_server_reentrant_mutex_ before calling user callback functions add an additional lock to keep previous behavior broken by deadlock fix Also add a test case to reproduce deadlock situation in rclcpp_action Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes force-pushed the foxy branch from 11ae5aa to 92adace Compare September 13, 2020 05:45

fujitatomoya requested changes Sep 14, 2020

View reviewed changes

daisukes changed the base branch from foxy to master September 14, 2020 19:21

daisukes force-pushed the foxy branch from 0f1ff5f to 3343397 Compare September 14, 2020 19:25

clalancette reviewed Sep 17, 2020

View reviewed changes

fujitatomoya requested changes Oct 13, 2020

View reviewed changes

clalancette mentioned this pull request Oct 22, 2020

Add missing locking to the rclcpp_action::ServerBase. #1421

Merged

clalancette self-assigned this Dec 17, 2020

daisukes added 2 commits January 11, 2021 15:23

Revert "unlock action_server_reentrant_mutex_ before calling user cal…

e1ee318

…lback functions" This reverts commit d2bd783. Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

minimize lock scopes to avoid deadlock

6feedbe

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

clalancette previously requested changes Jan 12, 2021

View reviewed changes

remove an unnecessary lock

054bf57

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes added 2 commits January 12, 2021 12:28

make action_server_reentrant_mutex_ scope smaller

5ae0e9a

add unordered_map_mutex_ to protect some unordered_maps Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

Merge remote-tracking branch 'master' into foxy

ca4c851

clalancette approved these changes Jan 13, 2021

View reviewed changes

use std::atomic<bool> to make pimpl flags to atomic

5adb529

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

clalancette merged commit ca3ad7d into ros2:master Jan 14, 2021

daisukes pushed a commit to daisukes/rclcpp that referenced this pull request Jan 15, 2021

[backport] Fix action server deadlock (ros2#1285, ros2#1313)

0e20cc3

Signed-off-by: Daisuke Sato <daisueks@cmu.edu>

daisukes added a commit to daisukes/rclcpp that referenced this pull request Jan 16, 2021

[backport] Fix action server deadlock (ros2#1285, ros2#1313)

83a45b5

Signed-off-by: Daisuke Sato <daisukes@cmu.edu>

daisukes mentioned this pull request Jan 16, 2021

[Backport] Fix action server deadlock (#1285) (#1313) #1518

Merged

ivanpauno mentioned this pull request Jan 18, 2021

rclcpp_action server could cause deadlock with its wrapper's mutex #1285

Closed

	send_goal_request(node_, uuid1_);
	// this will lock wrapper's mutex and intentionally wait 100ms for calling succeed
	// to try to acquire the lock of rclcpp_action mutex
	std::thread t(&TestDeadlockServer::GoalSucceeded, this);
	// after the wrapper's mutex is locked and before succeed is called
	rclcpp::sleep_for(std::chrono::milliseconds(50));
	// call next goal request to intentionally reproduce deadlock
	// this first locks rclcpp_action mutex and then call callback to lock wrapper's mutex
	send_goal_request(node_, uuid2_);
	t.join();

Fix action server deadlock (#1285) #1313

Fix action server deadlock (#1285) #1313

Conversation

daisukes commented Sep 13, 2020

Choose a reason for hiding this comment

daisukes Sep 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brawner commented Sep 14, 2020

daisukes commented Sep 14, 2020

daisukes commented Sep 15, 2020

brawner commented Sep 15, 2020

daisukes commented Sep 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinIRBT commented Oct 3, 2020

daisukes commented Oct 13, 2020

fujitatomoya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daisukes commented Oct 17, 2020

clalancette commented Oct 27, 2020

daisukes commented Oct 27, 2020

clalancette commented Oct 27, 2020

daisukes commented Oct 27, 2020

daisukes commented Nov 5, 2020 • edited Loading

clalancette commented Nov 6, 2020

daisukes commented Nov 6, 2020

Problem

QQting commented Nov 11, 2020

QQting commented Nov 13, 2020 • edited Loading

SteveMacenski commented Nov 13, 2020

fujitatomoya commented Nov 15, 2020

QQting commented Nov 15, 2020 • edited Loading

fujitatomoya commented Nov 15, 2020

iuhilnehc-ynos commented Nov 16, 2020

fujitatomoya commented Nov 16, 2020

SteveMacenski commented Jan 5, 2021

clalancette commented Jan 6, 2021

SteveMacenski commented Jan 6, 2021

clalancette commented Jan 11, 2021

daisukes commented Jan 11, 2021

clalancette left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daisukes commented Jan 12, 2021

clalancette commented Jan 12, 2021

clalancette left a comment

Choose a reason for hiding this comment

clalancette commented Jan 13, 2021

clalancette commented Jan 14, 2021

clalancette commented Jan 14, 2021

daisukes commented Jan 15, 2021

clalancette commented Jan 15, 2021

SteveMacenski commented Jan 15, 2021

daisukes Sep 14, 2020 •

edited

Loading

daisukes commented Nov 5, 2020 •

edited

Loading

QQting commented Nov 13, 2020 •

edited

Loading

QQting commented Nov 15, 2020 •

edited

Loading