spin_some never exits #280

mikaelarguedas · 2016-11-29T01:10:56Z

If we let a few messages (~10-20) queue before processing them, spin _some never exits.
Discourse thread: http://discourse.ros.org/t/fastrtps-throughput-issues/625/8
Reproduceable example: ros2/examples@1278fb5
I didn't have time to investigate but am ticketing it so that we don't forget

The text was updated successfully, but these errors were encountered:

geoffviola · 2017-01-03T19:12:20Z

There's a quick fix to add "rclcpp::utilities::ok &&" to the condition in https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L191. The spinning atomic bool is not changing state. Should it update in https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L344.

wjwwood · 2017-01-03T19:22:20Z

I think @mikaelarguedas means that it never exits because the queue is never empty, not when you ctrl-c, but that might also be an issue.

I think the right thing to do to address the original issue is to somehow extract all work that can be done first, then burn that list down, never checking for work again.

geoffviola · 2017-03-14T16:51:31Z

Will this bug be fixed for beta 2? Beta 1 doesn't have opensplice middleware and this bug makes fastRTPS not useful so we're still using alpha8.

wjwwood · 2017-03-15T05:37:08Z

Not sure. It might be addressable when we make some changes to support more complicated custom executors, which is something @Karsten1987 and his colleagues are working on.

BrannonKing · 2017-03-27T20:59:13Z

I think a "maximum time" parameter on spin_some would be useful. You could default it to something large. I think, though, if I had a 30fps render loop and I wanted to "spin some" each loop, I would be able to figure out how much time I could allot to the spin. Obviously, I could do that trivially by calling spin_once directly in my own loop.

wjwwood · 2017-03-28T00:40:17Z

@BrannonKing "maximum time" might be misleading, since we cannot actually make a guarantee about when a single item will be finished. It could, however, have a timeout after which no new items would be started. The logic would be something like:

enter spin_some
get a list of all work to be done right now
while timeout not exceeded and more items to execute from list, execute next item

If the timeout is exceeded while executing something, it will not execute anything else, but it will also not necessarily return immediately when the timeout expires.

clalancette · 2017-05-16T19:06:55Z

I believe I am also running into this issue on the Turtlebot2 with FastRTPS. I have the kobuki_node (which subscribes to the cmd_vel topic), and a keyboard teleop program (which publishes to the cmd_vel topic). If I set the QoS on both sides to have a depth of 10, everything is fine. If I set the QoS to have a depth of 1, then I spin forever in "spin_once", never finishing. I'm not exactly sure of the circumstances that lead to that, but it seems like this problem.

wjwwood · 2017-05-17T22:01:22Z

spin_once will block until there is something to do (like new data received) unless you set a timeout. This is expected behavior, so unless it never returns even with a timeout set, then I'd say this is a different issue @clalancette.

Even if that is the case I think there are two separate issues being discussed here, as I point out in #280 (comment)

clalancette · 2017-05-23T14:25:53Z

I'm not sure why I said "spin_once"; the actual problem I'm having is with spin_some(). I did some debugging on it, and I think it is similar to what @mikaelarguedas said; sometimes, the queue never seems to empty out. More specifically, what happens when we get into this situation is that get_next_subscription() always sets any_exec->subscription, which makes get_next_ready_executable() return a non-NULL pointer. At that point, get_next_executable() returns non-NULL, which makes spin_some() call execute_any_executable(). execute_any_executable() calls down into execute_subscription(), which then calls rcl_take(). At this point, something unexpected happens; rcl_take() returns a value of RCL_RET_SUBSCRIPTION_TAKE_FAILED, which then ends up causing an infinite loop down here in spin_some().

I've found that by increasing the depth size of my nodes, I can workaround the issue. That's what I'm doing for now, but we should probably figure out exactly why we end up in this death spiral.

deng02 · 2017-11-27T21:26:09Z

So we are running into this issue with the latest release of ROS2. It can easily be reproduced by creating a subscriber that is slower than the publisher i.e. if you create a talker that pushes out N messages per second, and create a corresponding listener that each second spins with a QoS depth of N-1, you'll see the listener process jack up to almost 100% on the CPU and it never exits the spin_some() method.

 # listener loop
  rclcpp::WallRate loop_rate(1);
  while (rclcpp::utilities::ok()) {
    rclcpp::spin_some(node);
    loop_rate.sleep();
  }

We've managed to track the issue down to the CustomSubscriberInfo class and what is going on with the underlying eProsima SubscriberHistory. Bottom line is there is a mismatch between the amount of data the CustomSubscriber reports is available vs. what the eProsima actually has available to give back, due to the assumption that on every onNewDataMessage() call, a new piece of data has been added and not overwritten.

Below is a more detailed explanation of what we think is going on.

Whenever the onNewDataMessage() is called in the SubListener for ROS2 it will increase the data_ member assuming that every message means it can be taken. However this isn't always the case. In the situation where the QoS policy is KEEP_LAST and the depth has been reached, an older sample will end up getting removed (see https://github.com/eProsima/Fast-RTPS/blob/master/src/cpp/subscriber/SubscriberHistory.cpp#L81).

So for example, say the depth of the queue is 5 and there are already 5 unread samples in the Subscriber history; when the 6th one comes in eProsima will actually overwrite it so there is still only 5 unread samples, but the CustomSubscriberInfo will report on a data count of 6. By the time we reach the 5th take, the Subscriber history will be of size 0 with an unread count of 0 but the CustomerSubscriberInfo will still have a data_ value of 1. As a result, when get_next_executable() runs the subscription will return since it's hasData() method will return true.

When rcl_take is run, the Subscriber takeNextData() returns false and so the taken flag remains false but because no actual error occurred RMW_RET_OK is returned. This causes rcl_take to return RCL_RET_SUBSCRIPTION_TAKE_FAILED (see https://github.com/ros2/rcl/blob/master/rcl/src/rcl/subscription.c#L236) as noted above which is basically ignored by execute_subscription(). Because the subscription data_taken() was never called, the data_ remains 1 and hasData() stays on. Consequently we never exit, spinning on a subscriber that will never drain.

It's a bit tricky how to deal with this situation. We though of having the hasData() method report what the SubscriberHistory unread count is but unfortunately that member is not public. We alsot hough of trying to set the data to 0 when hitting a RCL_RET_SUBSCRIPTION_TAKE_FAILED but that can cause messages that show up afterwards to be incorrectly ignored.

dirk-thomas · 2017-12-05T06:12:33Z

@deng02 The detailed information were really helpful. Based on that I was able to implement a patch. But that requires to expose more information in the FastRTPS API. I have created a PR against upstream (also considering other ways how to address this): eProsima/Fast-DDS#166 Once we get a response on that I will hopefully be able to move forward and fix this issue.

dirk-thomas · 2017-12-05T14:24:30Z

@geoffviola @BrannonKing @deng02 It would be great of you could try the proposed patch (eProsima/Fast-DDS#166 and ros2/rmw_fastrtps#178) and report back if it addresses your problem.

dirk-thomas · 2017-12-05T15:58:45Z

I ran the reproduceable example (ros2/examples@1278fb5) mentioned in the description and without the patch the listener was running forever and could not be Ctrl-C-ed. With the patch it exists on its own after printing a few received messages.

…#280) Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

mikaelarguedas added the bug Something isn't working label Nov 29, 2016

mikaelarguedas mentioned this issue Nov 9, 2017

Single-threaded executor can be starved by timers #392

Open

mikaelarguedas added the hitlist label Dec 1, 2017

dirk-thomas mentioned this issue Dec 5, 2017

expose getUnreadCount from history in subscriber eProsima/Fast-DDS#166

Merged

dirk-thomas mentioned this issue Dec 5, 2017

use newly exposed unread-count to avoid getting out-of-sync ros2/rmw_fastrtps#178

Merged

dirk-thomas self-assigned this Dec 5, 2017

dirk-thomas added the in review Waiting for review (Kanban column) label Dec 5, 2017

dirk-thomas mentioned this issue Dec 5, 2017

add test to check behavior when the queue size of the subscriber is reached #421

Open

dirk-thomas closed this as completed in ros2/rmw_fastrtps#178 Dec 5, 2017

dirk-thomas removed the in review Waiting for review (Kanban column) label Dec 5, 2017

mikaelarguedas referenced this issue in ros2/examples Dec 6, 2017

reproducible callback issues example

1278fb5

wjwwood mentioned this issue May 4, 2018

Executor::spin_some() should check for work only once #471

Closed

DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this issue Aug 5, 2022

code style only: wrap after open parenthesis if not in one line (ros2…

bcb034d

…#280) Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spin_some never exits #280

spin_some never exits #280

mikaelarguedas commented Nov 29, 2016 •

edited

Loading

geoffviola commented Jan 3, 2017

wjwwood commented Jan 3, 2017 •

edited

Loading

geoffviola commented Mar 14, 2017

wjwwood commented Mar 15, 2017

BrannonKing commented Mar 27, 2017

wjwwood commented Mar 28, 2017

clalancette commented May 16, 2017

wjwwood commented May 17, 2017

clalancette commented May 23, 2017

deng02 commented Nov 27, 2017

dirk-thomas commented Dec 5, 2017

dirk-thomas commented Dec 5, 2017

dirk-thomas commented Dec 5, 2017

spin_some never exits #280

spin_some never exits #280

Comments

mikaelarguedas commented Nov 29, 2016 • edited Loading

geoffviola commented Jan 3, 2017

wjwwood commented Jan 3, 2017 • edited Loading

geoffviola commented Mar 14, 2017

wjwwood commented Mar 15, 2017

BrannonKing commented Mar 27, 2017

wjwwood commented Mar 28, 2017

clalancette commented May 16, 2017

wjwwood commented May 17, 2017

clalancette commented May 23, 2017

deng02 commented Nov 27, 2017

dirk-thomas commented Dec 5, 2017

dirk-thomas commented Dec 5, 2017

dirk-thomas commented Dec 5, 2017

mikaelarguedas commented Nov 29, 2016 •

edited

Loading

wjwwood commented Jan 3, 2017 •

edited

Loading