Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

schornakj · 2023-04-10T18:53:43Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04 LTS
Installation type:
- Binaries
Version or commit hash:
- Humble, from the 2023-01-27 sync snapshot
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Calling rclcpp_action::Client::async_get_result() after rclcpp_action::Client::async_send_goal() sometimes throws rclcpp::exceptions::RCLError with a "failed to add guard condition to wait set: guard condition implementation is invalid" message.

While I don't have a totally-atomic code sample that reproduces this issue independent of my project, the code in question is similar to this:

  const auto goal_handle_future = client_->async_send_goal(goal);
  if (goal_handle_future.wait_for(goal_response_timeout_) == std::future_status::timeout)
  {
    // handle failure to get response to goal request
    return false;
  }

  const auto goal_handle = goal_handle_future.get();
  if (!goal_handle)
  {
    // handle rejection of the goal request
    return false;
  }

  auto wrapped_result_future = client_->async_get_result(goal_handle);
  if (wrapped_result_future.wait_for(goal_result_timeout_) == std::future_status::timeout)
  {
    // handle timeout before action result is received
    return false;
  }

  // handle success
  return true;

Expected behavior

If the goal_handle future returned from async_send_goal() is set (indicating that the goal request was accepted), then I should always be able to use the goal_handle to wait for the action result.

Actual behavior

For some small proportion of action goals (on the order of 1 in 100), an exception is thrown:

    [test_node_main-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
    [test_node_main-1]   what():  failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
    [test_node_main-1] Stack trace (most recent call last) in thread 18342:
    [test_node_main-1] #16   Object "/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2", at 0xffffffffffffffff, in 
    [test_node_main-1] #15   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe4282bb3, in __clone
    [test_node_main-1] #14   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f1b42, in 
    [test_node_main-1] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44812b2, in 
    [test_node_main-1] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b418e1, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
    [test_node_main-1] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b3aab2, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #10   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b37d94, in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #9    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b4ac86, in 
    [test_node_main-1] #8    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b385ab, in rclcpp::detail::add_guard_condition_to_rcl_wait_set(rcl_wait_set_s&, rclcpp::GuardCondition const&)
    [test_node_main-1] #7    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b34838, in rclcpp::exceptions::throw_from_rcl_error(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rcutils_error_state_s const*, void (*)())
    [test_node_main-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445323d, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
    [test_node_main-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44532b6, in std::terminate()
    [test_node_main-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445324b, in 
    [test_node_main-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe4447bbd, in 
    [test_node_main-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41857f2, in abort
    [test_node_main-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe419f475, in raise
    [test_node_main-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f3a7c, in pthread_kill
    [test_node_main-1] Aborted (Signal sent by tkill() 18291 0)

Additional information

Based on the stack trace, this looks like a problem at the intersection of the rclcpp action client and the MultiThreadedExecutor, which is a place where I've had issues roughly similar to this one in the past.

The text was updated successfully, but these errors were encountered:

schornakj · 2023-04-10T21:36:35Z

There's a gap in the stack trace between rclcpp::Executor::wait_for_work and rclcpp::detail::add_guard_condition_to_rcl_wait_set, so I searched through the rclcpp code to learn more and I think I found specific place in rclcpp::Executor::wait_for_work where this originates:

There are a few places where add_guard_condition_to_rcl_wait_set is used, but the one that seems relevant to the executor is this usage inAllocatorMemoryStrategy::add_handles_to_wait_set:

    for (auto guard_condition : guard_conditions_) {
      detail::add_guard_condition_to_rcl_wait_set(*wait_set, *guard_condition);
    }

This would end up being called here in Executor::wait_for_work:

    if (!memory_strategy_->add_handles_to_wait_set(&wait_set_)) {
      throw std::runtime_error("Couldn't fill wait set");
    }

In this case if add_guard_condition_to_rcl_wait_set throws an exception, it doesn't get caught in any of the functions I linked, which means it propagates back up to the executor and results in the stack trace I originally posted.

Barry-Xu-2018 · 2023-04-12T03:18:35Z

rclcpp/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp

Line 493 in 5f9695a

VectorRebind<const rclcpp::GuardCondition *> guard_conditions_;

shared_ptr wasn't used here. So this problem occurs.
I don't know why not use. Maybe there is some reason.

mjcarroll · 2023-04-12T13:55:49Z

shared_ptr wasn't used here. So this problem occurs.

I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj
#2142

fujitatomoya · 2023-04-13T07:10:57Z

rclcpp/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp

Line 493 in 5f9695a

VectorRebind<const rclcpp::GuardCondition *> guard_conditions_;

shared_ptr wasn't used here. So this problem occurs. I don't know why not use. Maybe there is some reason.

@Barry-Xu-2018 can you elaborate a bit here? i think that get_next_executable is protected by wait_mutex_ in MultiThreadedExecutor. Why(How?) having guard_conditions_ as shared pointer can solve this problem?

Barry-Xu-2018 · 2023-04-13T08:29:49Z

@fujitatomoya

The problem is related to use rclcpp::GuardCondition * for guard_conditions_.

Guard condition for node is defined at

rclcpp/rclcpp/include/rclcpp/node_interfaces/node_base.hpp

Line 156 in 3088b53

rclcpp::GuardCondition notify_guard_condition_;

While executing below codes,

rclcpp/rclcpp/src/rclcpp/executor.cpp

Lines 761 to 772 in 3088b53

    
           ret = rcl_wait_set_resize( 
        
             &wait_set_, memory_strategy_->number_of_ready_subscriptions(), 
        
             memory_strategy_->number_of_guard_conditions(), memory_strategy_->number_of_ready_timers(), 
        
             memory_strategy_->number_of_ready_clients(), memory_strategy_->number_of_ready_services(), 
        
             memory_strategy_->number_of_ready_events()); 
        
           if (RCL_RET_OK != ret) { 
        
             throw_from_rcl_error(ret, "Couldn't resize the wait set"); 
        
           } 
        
           if (!memory_strategy_->add_handles_to_wait_set(&wait_set_)) { 
        
             throw std::runtime_error("Couldn't fill wait set"); 
        
           }

Before calling add_handles_to_wait_set(), node may be destroyed.
So rclcpp::GuardCondition * in guard_conditions_ is invalid.
This leads to error while calling this function.

rclcpp/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp

Lines 233 to 235 in 3088b53

    
           for (auto guard_condition : guard_conditions_) { 
        
             detail::add_guard_condition_to_rcl_wait_set(*wait_set, *guard_condition); 
        
           }

mjcarroll · 2023-04-13T12:15:52Z

In the new executor structures (https://github.com/ros2/rclcpp/pull/2143/files), I have worked around this by making the node return a rclcpp::GuardCondition::SharedPtr, which makes it more consistent with the CallbackGroup API.

These guard conditions are all added to a single waitable and held as weak pointers, which are locked right before adding to the rcl_wait_set so that we can ensure they are still valid: https://github.com/ros2/rclcpp/blob/mjcarroll/executor_structures/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L46-L64

fujitatomoya · 2023-04-13T16:32:34Z

@mjcarroll @Barry-Xu-2018 thank you very much for the explanation 👍

schornakj · 2023-04-13T19:48:22Z

shared_ptr wasn't used here. So this problem occurs.

I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj #2142

Cool! That looks like a good set of changes to a very important part of the core ROS 2 functionality.

This will be somewhat tricky for me to replicate on my robots, though, since I'm currently using a specific version of Humble and I'd need to work out some sort of ad-hoc Rolling-based deployment.

schornakj · 2023-04-20T19:38:55Z

I've been working on isolating this issue, and I think it's related to creating or destroying a Subscriber in one thread while waiting on an action result future in a different thread.

I'll try to create a reproducible example now that I have a clearer idea of what's going on.

clalancette · 2023-04-20T19:52:12Z

I'll try to create a reproducible example now that I have a clearer idea of what's going on.

Yeah, that would be fantastic. It would be good to have a test that shows this problem, and thus shows that @mjcarroll 's fixes actually fix it (hopefully).

fmrico · 2023-06-08T17:02:23Z

Hi,

If I can help, I was doing a tutorial and found this problem. The error is reproducible, which may help solve the problem by providing a testbed for this issue.

https://github.com/fmrico/vector_transmission

In different terminals

ros2 run rclcpp_components component_container

ros2 component load /ComponentManager vector_transmission vector_transmission::VectorProducer
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorConsumer

The error always happens when unloading the Consumer:

ros2 component unload /ComponentManager 2

I hope this helps!!

schornakj · 2023-06-27T14:52:31Z

@fmrico thank you for sharing that! I've been struggling to come up with a minimal reproducible example that's independent of my project, and your test helps narrow down the problem a lot.

I modified the VectorConsumer node in your example to remove everything except for a single subscriber and the crash still happens. I think that shows that the specific cause of this crash is destroying the subscriber.

sea-bass · 2023-07-31T20:46:27Z

Hey folks -- I just tried @fmrico's repo and reproduction steps on the latest Rolling binaries and could still reproduce the exception.

moriarty · 2023-08-03T16:22:31Z

@sea-bass the latest rolling binaries aren't expected to fix this.

I tried @fmrico steps #2163 (comment) to reproduce with the open PR #2142 from @mjcarroll and this seems to work.

~/ros/r/rclcpp$ vcs status
..
=== ./src/rclcpp (git) ===
On branch mjcarroll/rclcpp_waitset_executor
Your branch is up to date with 'origin/mjcarroll/rclcpp_waitset_executor'.

nothing to commit, working tree clean
=== ./src/vector_transmission (git) ===
On branch main
Your branch is up to date with 'origin/main'.

moriarty · 2023-11-30T16:27:17Z

moveit/moveit2_tutorials#822

sea-bass · 2023-12-11T21:35:22Z

I'm also running into this segfault on rclpy on Humble, by the way. This is when using the rosbridge_suite stack to advertise/unadvertise an action server. The stack trace is:

[rosbridge_websocket-2] ERROR:tornado.application:Exception in callback <function main.<locals>.<lambda> at 0x7f83e0230b80>
[rosbridge_websocket-2] Traceback (most recent call last):
[rosbridge_websocket-2]   File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 905, in _run
[rosbridge_websocket-2]     return self.callback()
[rosbridge_websocket-2]   File "/opt/underlay_ws/install/rosbridge_server/lib/rosbridge_server/rosbridge_websocket", line 340, in <lambda>
[rosbridge_websocket-2]     spin_callback = PeriodicCallback(lambda: executor.spin_once(timeout_sec=0.01), 1)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 705, in spin_once
[rosbridge_websocket-2]     handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 691, in wait_for_ready_callbacks
[rosbridge_websocket-2]     return next(self._cb_iter)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 585, in _wait_for_ready_callbacks
[rosbridge_websocket-2]     waitable.add_to_wait_set(wait_set)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 499, in add_to_wait_set
[rosbridge_websocket-2]     self._handle.add_to_waitset(wait_set)
[rosbridge_websocket-2] rclpy._rclpy_pybind11.RCLError: Failed to add 'rcl_action_server_t' to wait set: action server pointer is invalid, at ./src/rcl_action/action_server.c:929

ros-discourse · 2023-12-14T16:48:43Z

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/manipulation-wg-moveit-maintainer-meeting-november-30-rescheduled/34686/2

clalancette assigned mjcarroll Apr 20, 2023

schornakj mentioned this issue Aug 21, 2023

Utilize rclcpp::WaitSet as part of the executors #2142

Merged

3 tasks

moriarty mentioned this issue Nov 30, 2023

Flaky: failed to add guard condition to wait set: guard condition implementation is invalid ros2/rcl#1120

Closed

moriarty mentioned this issue Nov 30, 2023

MTC tutorial not working on main moveit/moveit2_tutorials#822

Closed

sea-bass mentioned this issue Dec 11, 2023

Fix issues when canceling and unadvertising actions RobotWebTools/rosbridge_suite#896

Merged

Bckempa mentioned this issue Dec 28, 2023

build-testing does not appear fully con... space-ros/space-ros#103

Open

beyzanurkaya mentioned this issue Apr 1, 2024

behavior_planning_container dies because of guard_condition autowarefoundation/autoware.universe#6452

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

schornakj commented Apr 10, 2023

schornakj commented Apr 10, 2023 •

edited

Loading

Barry-Xu-2018 commented Apr 12, 2023 •

edited

Loading

mjcarroll commented Apr 12, 2023

fujitatomoya commented Apr 13, 2023

Barry-Xu-2018 commented Apr 13, 2023

mjcarroll commented Apr 13, 2023

fujitatomoya commented Apr 13, 2023

schornakj commented Apr 13, 2023 •

edited

Loading

schornakj commented Apr 20, 2023

clalancette commented Apr 20, 2023

fmrico commented Jun 8, 2023

schornakj commented Jun 27, 2023

sea-bass commented Jul 31, 2023

moriarty commented Aug 3, 2023

moriarty commented Nov 30, 2023

sea-bass commented Dec 11, 2023

ros-discourse commented Dec 14, 2023

Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

Comments

schornakj commented Apr 10, 2023

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

schornakj commented Apr 10, 2023 • edited Loading

Barry-Xu-2018 commented Apr 12, 2023 • edited Loading

mjcarroll commented Apr 12, 2023

fujitatomoya commented Apr 13, 2023

Barry-Xu-2018 commented Apr 13, 2023

mjcarroll commented Apr 13, 2023

fujitatomoya commented Apr 13, 2023

schornakj commented Apr 13, 2023 • edited Loading

schornakj commented Apr 20, 2023

clalancette commented Apr 20, 2023

fmrico commented Jun 8, 2023

schornakj commented Jun 27, 2023

sea-bass commented Jul 31, 2023

moriarty commented Aug 3, 2023

moriarty commented Nov 30, 2023

sea-bass commented Dec 11, 2023

ros-discourse commented Dec 14, 2023

schornakj commented Apr 10, 2023 •

edited

Loading

Barry-Xu-2018 commented Apr 12, 2023 •

edited

Loading

schornakj commented Apr 13, 2023 •

edited

Loading