Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception thrown while waiting for action result: RCLError "failed to add guard condition to wait set" #2163

Open
schornakj opened this issue Apr 10, 2023 · 17 comments
Assignees

Comments

@schornakj
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 LTS
  • Installation type:
    • Binaries
  • Version or commit hash:
    • Humble, from the 2023-01-27 sync snapshot
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Calling rclcpp_action::Client::async_get_result() after rclcpp_action::Client::async_send_goal() sometimes throws rclcpp::exceptions::RCLError with a "failed to add guard condition to wait set: guard condition implementation is invalid" message.

While I don't have a totally-atomic code sample that reproduces this issue independent of my project, the code in question is similar to this:

  const auto goal_handle_future = client_->async_send_goal(goal);
  if (goal_handle_future.wait_for(goal_response_timeout_) == std::future_status::timeout)
  {
    // handle failure to get response to goal request
    return false;
  }

  const auto goal_handle = goal_handle_future.get();
  if (!goal_handle)
  {
    // handle rejection of the goal request
    return false;
  }

  auto wrapped_result_future = client_->async_get_result(goal_handle);
  if (wrapped_result_future.wait_for(goal_result_timeout_) == std::future_status::timeout)
  {
    // handle timeout before action result is received
    return false;
  }

  // handle success
  return true;

Expected behavior

If the goal_handle future returned from async_send_goal() is set (indicating that the goal request was accepted), then I should always be able to use the goal_handle to wait for the action result.

Actual behavior

For some small proportion of action goals (on the order of 1 in 100), an exception is thrown:

    [test_node_main-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
    [test_node_main-1]   what():  failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
    [test_node_main-1] Stack trace (most recent call last) in thread 18342:
    [test_node_main-1] #16   Object "/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2", at 0xffffffffffffffff, in 
    [test_node_main-1] #15   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe4282bb3, in __clone
    [test_node_main-1] #14   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f1b42, in 
    [test_node_main-1] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44812b2, in 
    [test_node_main-1] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b418e1, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
    [test_node_main-1] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b3aab2, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #10   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b37d94, in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
    [test_node_main-1] #9    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b4ac86, in 
    [test_node_main-1] #8    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b385ab, in rclcpp::detail::add_guard_condition_to_rcl_wait_set(rcl_wait_set_s&, rclcpp::GuardCondition const&)
    [test_node_main-1] #7    Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f9fe4b34838, in rclcpp::exceptions::throw_from_rcl_error(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rcutils_error_state_s const*, void (*)())
    [test_node_main-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445323d, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
    [test_node_main-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe44532b6, in std::terminate()
    [test_node_main-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe445324b, in 
    [test_node_main-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f9fe4447bbd, in 
    [test_node_main-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41857f2, in abort
    [test_node_main-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe419f475, in raise
    [test_node_main-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f9fe41f3a7c, in pthread_kill
    [test_node_main-1] Aborted (Signal sent by tkill() 18291 0)

Additional information

Based on the stack trace, this looks like a problem at the intersection of the rclcpp action client and the MultiThreadedExecutor, which is a place where I've had issues roughly similar to this one in the past.

@schornakj
Copy link
Author

schornakj commented Apr 10, 2023

There's a gap in the stack trace between rclcpp::Executor::wait_for_work and rclcpp::detail::add_guard_condition_to_rcl_wait_set, so I searched through the rclcpp code to learn more and I think I found specific place in rclcpp::Executor::wait_for_work where this originates:

There are a few places where add_guard_condition_to_rcl_wait_set is used, but the one that seems relevant to the executor is this usage inAllocatorMemoryStrategy::add_handles_to_wait_set:

    for (auto guard_condition : guard_conditions_) {
      detail::add_guard_condition_to_rcl_wait_set(*wait_set, *guard_condition);
    }

This would end up being called here in Executor::wait_for_work:

    if (!memory_strategy_->add_handles_to_wait_set(&wait_set_)) {
      throw std::runtime_error("Couldn't fill wait set");
    }

In this case if add_guard_condition_to_rcl_wait_set throws an exception, it doesn't get caught in any of the functions I linked, which means it propagates back up to the executor and results in the stack trace I originally posted.

@Barry-Xu-2018
Copy link
Collaborator

Barry-Xu-2018 commented Apr 12, 2023

VectorRebind<const rclcpp::GuardCondition *> guard_conditions_;

shared_ptr wasn't used here. So this problem occurs.
I don't know why not use. Maybe there is some reason.

@mjcarroll
Copy link
Member

shared_ptr wasn't used here. So this problem occurs.

I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj
#2142

@fujitatomoya
Copy link
Collaborator

VectorRebind<const rclcpp::GuardCondition *> guard_conditions_;

shared_ptr wasn't used here. So this problem occurs. I don't know why not use. Maybe there is some reason.

@Barry-Xu-2018 can you elaborate a bit here? i think that get_next_executable is protected by wait_mutex_ in MultiThreadedExecutor. Why(How?) having guard_conditions_ as shared pointer can solve this problem?

@Barry-Xu-2018
Copy link
Collaborator

@fujitatomoya

The problem is related to use rclcpp::GuardCondition * for guard_conditions_.

Guard condition for node is defined at

rclcpp::GuardCondition notify_guard_condition_;

While executing below codes,

ret = rcl_wait_set_resize(
&wait_set_, memory_strategy_->number_of_ready_subscriptions(),
memory_strategy_->number_of_guard_conditions(), memory_strategy_->number_of_ready_timers(),
memory_strategy_->number_of_ready_clients(), memory_strategy_->number_of_ready_services(),
memory_strategy_->number_of_ready_events());
if (RCL_RET_OK != ret) {
throw_from_rcl_error(ret, "Couldn't resize the wait set");
}
if (!memory_strategy_->add_handles_to_wait_set(&wait_set_)) {
throw std::runtime_error("Couldn't fill wait set");
}

Before calling add_handles_to_wait_set(), node may be destroyed.
So rclcpp::GuardCondition * in guard_conditions_ is invalid.
This leads to error while calling this function.

for (auto guard_condition : guard_conditions_) {
detail::add_guard_condition_to_rcl_wait_set(*wait_set, *guard_condition);
}

@mjcarroll
Copy link
Member

In the new executor structures (https://github.com/ros2/rclcpp/pull/2143/files), I have worked around this by making the node return a rclcpp::GuardCondition::SharedPtr, which makes it more consistent with the CallbackGroup API.

These guard conditions are all added to a single waitable and held as weak pointers, which are locked right before adding to the rcl_wait_set so that we can ensure they are still valid: https://github.com/ros2/rclcpp/blob/mjcarroll/executor_structures/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L46-L64

@fujitatomoya
Copy link
Collaborator

@mjcarroll @Barry-Xu-2018 thank you very much for the explanation 👍

@schornakj
Copy link
Author

schornakj commented Apr 13, 2023

shared_ptr wasn't used here. So this problem occurs.

I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj #2142

Cool! That looks like a good set of changes to a very important part of the core ROS 2 functionality.

This will be somewhat tricky for me to replicate on my robots, though, since I'm currently using a specific version of Humble and I'd need to work out some sort of ad-hoc Rolling-based deployment.

@schornakj
Copy link
Author

I've been working on isolating this issue, and I think it's related to creating or destroying a Subscriber in one thread while waiting on an action result future in a different thread.

I'll try to create a reproducible example now that I have a clearer idea of what's going on.

@clalancette
Copy link
Contributor

I'll try to create a reproducible example now that I have a clearer idea of what's going on.

Yeah, that would be fantastic. It would be good to have a test that shows this problem, and thus shows that @mjcarroll 's fixes actually fix it (hopefully).

@fmrico
Copy link
Contributor

fmrico commented Jun 8, 2023

Hi,

If I can help, I was doing a tutorial and found this problem. The error is reproducible, which may help solve the problem by providing a testbed for this issue.

https://github.com/fmrico/vector_transmission

In different terminals

ros2 run rclcpp_components component_container
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorProducer
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorConsumer

The error always happens when unloading the Consumer:

ros2 component unload /ComponentManager 2

I hope this helps!!

@schornakj
Copy link
Author

@fmrico thank you for sharing that! I've been struggling to come up with a minimal reproducible example that's independent of my project, and your test helps narrow down the problem a lot.

I modified the VectorConsumer node in your example to remove everything except for a single subscriber and the crash still happens. I think that shows that the specific cause of this crash is destroying the subscriber.

@sea-bass
Copy link

Hey folks -- I just tried @fmrico's repo and reproduction steps on the latest Rolling binaries and could still reproduce the exception.

@moriarty
Copy link

moriarty commented Aug 3, 2023

@sea-bass the latest rolling binaries aren't expected to fix this.

I tried @fmrico steps #2163 (comment) to reproduce with the open PR #2142 from @mjcarroll and this seems to work.

~/ros/r/rclcpp$ vcs status
..
=== ./src/rclcpp (git) ===
On branch mjcarroll/rclcpp_waitset_executor
Your branch is up to date with 'origin/mjcarroll/rclcpp_waitset_executor'.

nothing to commit, working tree clean
=== ./src/vector_transmission (git) ===
On branch main
Your branch is up to date with 'origin/main'.

@moriarty
Copy link

@sea-bass
Copy link

I'm also running into this segfault on rclpy on Humble, by the way. This is when using the rosbridge_suite stack to advertise/unadvertise an action server. The stack trace is:

[rosbridge_websocket-2] ERROR:tornado.application:Exception in callback <function main.<locals>.<lambda> at 0x7f83e0230b80>
[rosbridge_websocket-2] Traceback (most recent call last):
[rosbridge_websocket-2]   File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 905, in _run
[rosbridge_websocket-2]     return self.callback()
[rosbridge_websocket-2]   File "/opt/underlay_ws/install/rosbridge_server/lib/rosbridge_server/rosbridge_websocket", line 340, in <lambda>
[rosbridge_websocket-2]     spin_callback = PeriodicCallback(lambda: executor.spin_once(timeout_sec=0.01), 1)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 705, in spin_once
[rosbridge_websocket-2]     handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 691, in wait_for_ready_callbacks
[rosbridge_websocket-2]     return next(self._cb_iter)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 585, in _wait_for_ready_callbacks
[rosbridge_websocket-2]     waitable.add_to_wait_set(wait_set)
[rosbridge_websocket-2]   File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 499, in add_to_wait_set
[rosbridge_websocket-2]     self._handle.add_to_waitset(wait_set)
[rosbridge_websocket-2] rclpy._rclpy_pybind11.RCLError: Failed to add 'rcl_action_server_t' to wait set: action server pointer is invalid, at ./src/rcl_action/action_server.c:929

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/manipulation-wg-moveit-maintainer-meeting-november-30-rescheduled/34686/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants