Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spin_some easily deadlocks action clients #2451

Open
asasine opened this issue Mar 14, 2024 · 3 comments
Open

spin_some easily deadlocks action clients #2451

asasine opened this issue Mar 14, 2024 · 3 comments

Comments

@asasine
Copy link

asasine commented Mar 14, 2024

Bug report

When running an rclcpp_action::Client<T> in using spin_some() and a rclcpp::Rate, if the server you connect to publishes feedback at a faster rate than your rclcpp::Rate, your client waits for the goal to be accepted forever, effectively becoming deadlocked.

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Binaries
  • Version or commit hash:
    • Galactic
  • DDS implementation:
    • Cyclone DDS
  • Client library (if applicable):
    • rclcpp

Note: based on my understanding of the root cause (see Additional information), I believe this bug exists in rolling too, but I've only tested on Galactic as that's what I have access to at the moment.

Steps to reproduce issue

The below server and client reproduce 100% of the time on my machine when running over loopback. Compilation settings shouldn't matter. Once built, they can be run with ros2 run ... or directly with ./path/to/binary. I'm using action_tutorials_interfaces/action/Fibonacci, but this should repro with any action interface.

Server
#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>

int main(int argc, char *argv[])
{
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("fast_feedback_server");
  using Action = action_tutorials_interfaces::action::Fibonacci;
  std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> goal_handle = nullptr;

  auto server = rclcpp_action::create_server<Action>(
      node, "fast_feedback",
      [&](const rclcpp_action::GoalUUID &, std::shared_ptr<const Action::Goal>)
      { return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE; },
      [&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>>)
      {
        RCLCPP_INFO(node->get_logger(), "Cancellation requested, accepting.");
        goal_handle = nullptr;
        return rclcpp_action::CancelResponse::ACCEPT;
      },
      [&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> gh)
      {
        RCLCPP_INFO(node->get_logger(), "Goal was accepted");
        goal_handle = gh;
      });

  rclcpp::Rate rate(100.0);  // 100 Hz
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  RCLCPP_INFO(node->get_logger(), "Server spinning, ready to accept goals.");
  while (rclcpp::ok())
  {
    executor.spin_some();

    // NOTE: this server intentionally does not send a result
    if (goal_handle)
    {
      goal_handle->publish_feedback(std::make_unique<Action::Feedback>());
    }

    rate.sleep();
  }

  return 0;
}

Notes:

  1. The server spins at 100 Hz.
  2. Feedback is published at the same rate.
  3. The goal is never completed as it's not relevant to this bug.
Client
#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>

int main(int argc, char *argv[])
{
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("fast_feedback_client");
  using Action = action_tutorials_interfaces::action::Fibonacci;
  using Client = rclcpp_action::Client<Action>;
  auto client = rclcpp_action::create_client<Action>(node, "fast_feedback");
  if (!client->wait_for_action_server())
  {
    RCLCPP_ERROR(node->get_logger(), "Action server not available after waiting");
    return 1;
  }

  // NOTE: no feedback_callback is set
  Client::SendGoalOptions send_goal_options;
  bool goal_response = false;
  send_goal_options.goal_response_callback = [&](std::shared_future<Client::GoalHandle::SharedPtr>)
  {
    RCLCPP_INFO(node->get_logger(), "Goal response.");
    goal_response = true;
  };

  RCLCPP_INFO(node->get_logger(), "Sending goal.");
  auto future = client->async_send_goal(Action::Goal(), send_goal_options);
  rclcpp::Rate rate(10.0);  // 10 Hz, 10% of the server's rate
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  while (rclcpp::ok() && !goal_response)
  {
    RCLCPP_INFO_THROTTLE(node->get_logger(), *node->get_clock(), 10000, "Waiting for goal response.");
    executor.spin_some();
    rate.sleep();
  }

  RCLCPP_INFO_STREAM(node->get_logger(), "Goal response received: " << std::boolalpha << goal_response);
  return 0;
}

Notes

  1. The client's rate in this example is 10% of the server's rate to make the repro foolproof for different machines/network conditions, but I've seen the bug reproduce reliably when the client is >90% of the server's rate, and even repro most of the time when they have the same rate.
  2. No feedback callback is set on SendGoalOptions as the deadlock is not dependent on one.
  3. The client exits once the goal is accepted since that's all that's relevant to this bug.

Expected behavior

The action client will accept the goal, perhaps after working through a short backlog of feedback up to the queue depth in the feedback subscription's QoS.

Actual behavior

The server has accepted and started working on the goal but the client never sees that. The client continuously logs Waiting for goal response. in the repro and debug logs have Received feedback for unknown goal. Ignoring.... Debug logs show Client in wait set is ready, which I presume to be the goal response, but it's never taken, so the client deadlocks.

Additional information

I did some digging and think I have tracked down the root cause, some other contributing factors, and some related impacts.

rclcpp_action::Client<T> derives from rclcpp_action::ClientBase derives from rclcpp::Waitable, and these are scheduled for work in an rclcpp::Executor as a single executable, not multiple executables of their constituent parts. When using rclcpp::Executor::spin_some, which only collects work from ready entities once per call, the rclcpp::Waitable is scheduled once, regardless of the amount of work it has ready. This should be fine, but the rclcpp::ClientBase::take_data implementation only yields a single executable per call and therefore the subsequent execute invoked by rclcpp::Executor only performs one thing on the client rather than everything that's ready.

std::shared_ptr<void>
ClientBase::take_data()
{
if (pimpl_->is_feedback_ready) {
std::shared_ptr<void> feedback_message = this->create_feedback_message();
rcl_ret_t ret = rcl_action_take_feedback(
pimpl_->client_handle.get(), feedback_message.get());
return std::static_pointer_cast<void>(
std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(
ret, feedback_message));
} else if (pimpl_->is_status_ready) {
std::shared_ptr<void> status_message = this->create_status_message();
rcl_ret_t ret = rcl_action_take_status(
pimpl_->client_handle.get(), status_message.get());
return std::static_pointer_cast<void>(
std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(
ret, status_message));
} else if (pimpl_->is_goal_response_ready) {
rmw_request_id_t response_header;
std::shared_ptr<void> goal_response = this->create_goal_response();
rcl_ret_t ret = rcl_action_take_goal_response(
pimpl_->client_handle.get(), &response_header, goal_response.get());
return std::static_pointer_cast<void>(
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(
ret, response_header, goal_response));
} else if (pimpl_->is_result_response_ready) {
rmw_request_id_t response_header;
std::shared_ptr<void> result_response = this->create_result_response();
rcl_ret_t ret = rcl_action_take_result_response(
pimpl_->client_handle.get(), &response_header, result_response.get());
return std::static_pointer_cast<void>(
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(
ret, response_header, result_response));
} else if (pimpl_->is_cancel_response_ready) {
rmw_request_id_t response_header;
std::shared_ptr<void> cancel_response = this->create_cancel_response();
rcl_ret_t ret = rcl_action_take_cancel_response(
pimpl_->client_handle.get(), &response_header, cancel_response.get());
return std::static_pointer_cast<void>(
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(
ret, response_header, cancel_response));
} else {
throw std::runtime_error("Taking data from action client but nothing is ready");
}
}
std::shared_ptr<void>
ClientBase::take_data_by_entity_id(size_t id)
{
// Mark as ready the entity from which we want to take data
switch (static_cast<EntityType>(id)) {
case EntityType::GoalClient:
pimpl_->is_goal_response_ready = true;
break;
case EntityType::ResultClient:
pimpl_->is_result_response_ready = true;
break;
case EntityType::CancelClient:
pimpl_->is_cancel_response_ready = true;
break;
case EntityType::FeedbackSubscription:
pimpl_->is_feedback_ready = true;
break;
case EntityType::StatusSubscription:
pimpl_->is_status_ready = true;
break;
}
return take_data();
}
void
ClientBase::execute(std::shared_ptr<void> & data)
{
if (!data) {
throw std::runtime_error("'data' is empty");
}
if (pimpl_->is_feedback_ready) {
auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data);
auto ret = std::get<0>(*shared_ptr);
pimpl_->is_feedback_ready = false;
if (RCL_RET_OK == ret) {
auto feedback_message = std::get<1>(*shared_ptr);
this->handle_feedback_message(feedback_message);
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) {
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking feedback");
}
} else if (pimpl_->is_status_ready) {
auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data);
auto ret = std::get<0>(*shared_ptr);
pimpl_->is_status_ready = false;
if (RCL_RET_OK == ret) {
auto status_message = std::get<1>(*shared_ptr);
this->handle_status_message(status_message);
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) {
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking status");
}
} else if (pimpl_->is_goal_response_ready) {
auto shared_ptr = std::static_pointer_cast<
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data);
auto ret = std::get<0>(*shared_ptr);
pimpl_->is_goal_response_ready = false;
if (RCL_RET_OK == ret) {
auto response_header = std::get<1>(*shared_ptr);
auto goal_response = std::get<2>(*shared_ptr);
this->handle_goal_response(response_header, goal_response);
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) {
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking goal response");
}
} else if (pimpl_->is_result_response_ready) {
auto shared_ptr = std::static_pointer_cast<
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data);
auto ret = std::get<0>(*shared_ptr);
pimpl_->is_result_response_ready = false;
if (RCL_RET_OK == ret) {
auto response_header = std::get<1>(*shared_ptr);
auto result_response = std::get<2>(*shared_ptr);
this->handle_result_response(response_header, result_response);
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) {
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking result response");
}
} else if (pimpl_->is_cancel_response_ready) {
auto shared_ptr = std::static_pointer_cast<
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data);
auto ret = std::get<0>(*shared_ptr);
pimpl_->is_cancel_response_ready = false;
if (RCL_RET_OK == ret) {
auto response_header = std::get<1>(*shared_ptr);
auto cancel_response = std::get<2>(*shared_ptr);
this->handle_cancel_response(response_header, cancel_response);
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) {
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking cancel response");
}
} else {
throw std::runtime_error("Executing action client but nothing is ready");
}
}

In most scenarios, this is fine, but it deadlocks in this specific scenario: a server publishing feedback faster than the client is spinning (specifically, faster than it calls rclcpp::Executor::wait_for_work). This can happen with any of the spin* implementations but happens most readily with spin_some. Why? Because rclcpp::ClientBase::take_data only returns the first ready executable and it prioritizes feedback over responses. The implementation of rclcpp::ClientBase::take_data and execute take and execute in this order: feedback, status, goal response, result response, cancel response.

When you combine the implementation of rclcpp::Waitable that only returns execution data representing a single unit of work, the implicit prioritization mechanism built into ClientBase, and an action server whose feedback topic publishes faster than the spin rate of the client, you get a client that always has ready feedback for an unknown goal, ironically never executing the goal response that feedback is for.

Some ancillary effects of this: the goal response is never taken, the client never creates the goal handle, and therefore cancellation isn't easy. Cancellation can be done through rclcpp_action::Client<T>::async_cancel_all_goals but that may have adverse side effects depending on your scenario. Additionally, the client never becomes "result aware" for the goal since that occurs when the goal response is processed, so result responses are permanently lost if they occur during the deadlock. Feedback is also always dropped because it's for an unknown goal.

@mauropasse
Copy link
Collaborator

The issue can be fixed by re-ordering ClientBase::execute and ClientBase::take_data() such as:

if (pimpl_->is_feedback_ready) { ... }
else if (pimpl_->is_status_ready) {...}

go as the last conditions. So it'd be for both take_data and execute:

if (pimpl_->is_goal_response_ready) { }
else if (pimpl_->is_result_response_ready) {}
else if (pimpl_->is_cancel_response_ready) {}
else if (pimpl_->is_feedback_ready) {}
else if (pimpl_->is_status_ready) {}

Otherwise feedback & status will always have priority over server responses.
Another alternative is using the EventsExecutor instead of the SingleThreadedExecutor, since the former always executes events in the order in which they were produced (no feedback over result priority).

@asasine
Copy link
Author

asasine commented Mar 15, 2024

I think reordering would work.

What of the fact that rclcpp::Waitable has to be taken multiple times to execute, unlike other executables like subscription/clients/etc? Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

EventsExecutor might work if event.num_events corresponds to the sum of the return values of the various get_number_of_ready_*() methods on rclcpp::Waitable. I haven't dug deep enough to know if it does yet.

for (size_t i = 0; i < event.num_events; i++) {
auto data = waitable->take_data_by_entity_id(event.waitable_data);
waitable->execute(data);
}

@mauropasse
Copy link
Collaborator

mauropasse commented Mar 15, 2024

Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

The idea is that one single execute() of the executor, executes a single piece of work. If a rclcpp::Waitable has many events, it has to execute them separately.
When calling executor->spin_once(), we expect to execute a single event and not all from a waitable.

PD: I've already tested with the EventsExecutor and it works.
Nevertheless I think this has to be fixed for the SingleThreadedExecutor in rolling.

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Mar 20, 2024
mauropasse added a commit to irobot-ros/rclcpp that referenced this issue Apr 9, 2024
* Fixes for intra-process Actions

* Fixes for Clang builds

* Fix deadlock

* Server to store results until client requests them

* Fix feedback/result data race

See ros2#2451

* Add missing mutex

* Check return value of intra_process_action_send

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>
apojomovsky pushed a commit to apojomovsky/rclcpp that referenced this issue Jun 20, 2024
* Fixes for intra-process Actions

* Fixes for Clang builds

* Fix deadlock

* Server to store results until client requests them

* Fix feedback/result data race

See ros2#2451

* Add missing mutex

* Check return value of intra_process_action_send

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>
apojomovsky pushed a commit to apojomovsky/rclcpp that referenced this issue Jun 21, 2024
* Fixes for intra-process Actions

* Fixes for Clang builds

* Fix deadlock

* Server to store results until client requests them

* Fix feedback/result data race

See ros2#2451

* Add missing mutex

* Check return value of intra_process_action_send

---------

Co-authored-by: Mauro Passerino <mpasserino@irobot.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants