Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spin_some never exits #280

Closed
mikaelarguedas opened this issue Nov 29, 2016 · 13 comments · Fixed by ros2/rmw_fastrtps#178
Closed

spin_some never exits #280

mikaelarguedas opened this issue Nov 29, 2016 · 13 comments · Fixed by ros2/rmw_fastrtps#178
Assignees
Labels
bug Something isn't working hitlist

Comments

@mikaelarguedas
Copy link
Member

mikaelarguedas commented Nov 29, 2016

If we let a few messages (~10-20) queue before processing them, spin _some never exits.
Discourse thread: http://discourse.ros.org/t/fastrtps-throughput-issues/625/8
Reproduceable example: ros2/examples@1278fb5
I didn't have time to investigate but am ticketing it so that we don't forget

@mikaelarguedas mikaelarguedas added the bug Something isn't working label Nov 29, 2016
@geoffviola
Copy link

There's a quick fix to add "rclcpp::utilities::ok &&" to the condition in https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L191. The spinning atomic bool is not changing state. Should it update in https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L344.

@wjwwood
Copy link
Member

wjwwood commented Jan 3, 2017

I think @mikaelarguedas means that it never exits because the queue is never empty, not when you ctrl-c, but that might also be an issue.

I think the right thing to do to address the original issue is to somehow extract all work that can be done first, then burn that list down, never checking for work again.

@geoffviola
Copy link

Will this bug be fixed for beta 2? Beta 1 doesn't have opensplice middleware and this bug makes fastRTPS not useful so we're still using alpha8.

@wjwwood
Copy link
Member

wjwwood commented Mar 15, 2017

Not sure. It might be addressable when we make some changes to support more complicated custom executors, which is something @Karsten1987 and his colleagues are working on.

@BrannonKing
Copy link

I think a "maximum time" parameter on spin_some would be useful. You could default it to something large. I think, though, if I had a 30fps render loop and I wanted to "spin some" each loop, I would be able to figure out how much time I could allot to the spin. Obviously, I could do that trivially by calling spin_once directly in my own loop.

@wjwwood
Copy link
Member

wjwwood commented Mar 28, 2017

@BrannonKing "maximum time" might be misleading, since we cannot actually make a guarantee about when a single item will be finished. It could, however, have a timeout after which no new items would be started. The logic would be something like:

  • enter spin_some
  • get a list of all work to be done right now
  • while timeout not exceeded and more items to execute from list, execute next item

If the timeout is exceeded while executing something, it will not execute anything else, but it will also not necessarily return immediately when the timeout expires.

@clalancette
Copy link
Contributor

I believe I am also running into this issue on the Turtlebot2 with FastRTPS. I have the kobuki_node (which subscribes to the cmd_vel topic), and a keyboard teleop program (which publishes to the cmd_vel topic). If I set the QoS on both sides to have a depth of 10, everything is fine. If I set the QoS to have a depth of 1, then I spin forever in "spin_once", never finishing. I'm not exactly sure of the circumstances that lead to that, but it seems like this problem.

@wjwwood
Copy link
Member

wjwwood commented May 17, 2017

spin_once will block until there is something to do (like new data received) unless you set a timeout. This is expected behavior, so unless it never returns even with a timeout set, then I'd say this is a different issue @clalancette.

Even if that is the case I think there are two separate issues being discussed here, as I point out in #280 (comment)

@clalancette
Copy link
Contributor

I'm not sure why I said "spin_once"; the actual problem I'm having is with spin_some(). I did some debugging on it, and I think it is similar to what @mikaelarguedas said; sometimes, the queue never seems to empty out. More specifically, what happens when we get into this situation is that get_next_subscription() always sets any_exec->subscription, which makes get_next_ready_executable() return a non-NULL pointer. At that point, get_next_executable() returns non-NULL, which makes spin_some() call execute_any_executable(). execute_any_executable() calls down into execute_subscription(), which then calls rcl_take(). At this point, something unexpected happens; rcl_take() returns a value of RCL_RET_SUBSCRIPTION_TAKE_FAILED, which then ends up causing an infinite loop down here in spin_some().

I've found that by increasing the depth size of my nodes, I can workaround the issue. That's what I'm doing for now, but we should probably figure out exactly why we end up in this death spiral.

@deng02
Copy link
Contributor

deng02 commented Nov 27, 2017

So we are running into this issue with the latest release of ROS2. It can easily be reproduced by creating a subscriber that is slower than the publisher i.e. if you create a talker that pushes out N messages per second, and create a corresponding listener that each second spins with a QoS depth of N-1, you'll see the listener process jack up to almost 100% on the CPU and it never exits the spin_some() method.

 # listener loop
  rclcpp::WallRate loop_rate(1);
  while (rclcpp::utilities::ok()) {
    rclcpp::spin_some(node);
    loop_rate.sleep();
  }

We've managed to track the issue down to the CustomSubscriberInfo class and what is going on with the underlying eProsima SubscriberHistory. Bottom line is there is a mismatch between the amount of data the CustomSubscriber reports is available vs. what the eProsima actually has available to give back, due to the assumption that on every onNewDataMessage() call, a new piece of data has been added and not overwritten.

Below is a more detailed explanation of what we think is going on.

Whenever the onNewDataMessage() is called in the SubListener for ROS2 it will increase the data_ member assuming that every message means it can be taken. However this isn't always the case. In the situation where the QoS policy is KEEP_LAST and the depth has been reached, an older sample will end up getting removed (see https://github.com/eProsima/Fast-RTPS/blob/master/src/cpp/subscriber/SubscriberHistory.cpp#L81).

So for example, say the depth of the queue is 5 and there are already 5 unread samples in the Subscriber history; when the 6th one comes in eProsima will actually overwrite it so there is still only 5 unread samples, but the CustomSubscriberInfo will report on a data count of 6. By the time we reach the 5th take, the Subscriber history will be of size 0 with an unread count of 0 but the CustomerSubscriberInfo will still have a data_ value of 1. As a result, when get_next_executable() runs the subscription will return since it's hasData() method will return true.

When rcl_take is run, the Subscriber takeNextData() returns false and so the taken flag remains false but because no actual error occurred RMW_RET_OK is returned. This causes rcl_take to return RCL_RET_SUBSCRIPTION_TAKE_FAILED (see https://github.com/ros2/rcl/blob/master/rcl/src/rcl/subscription.c#L236) as noted above which is basically ignored by execute_subscription(). Because the subscription data_taken() was never called, the data_ remains 1 and hasData() stays on. Consequently we never exit, spinning on a subscriber that will never drain.

It's a bit tricky how to deal with this situation. We though of having the hasData() method report what the SubscriberHistory unread count is but unfortunately that member is not public. We alsot hough of trying to set the data to 0 when hitting a RCL_RET_SUBSCRIPTION_TAKE_FAILED but that can cause messages that show up afterwards to be incorrectly ignored.

@dirk-thomas
Copy link
Member

@deng02 The detailed information were really helpful. Based on that I was able to implement a patch. But that requires to expose more information in the FastRTPS API. I have created a PR against upstream (also considering other ways how to address this): eProsima/Fast-DDS#166 Once we get a response on that I will hopefully be able to move forward and fix this issue.

@dirk-thomas
Copy link
Member

@geoffviola @BrannonKing @deng02 It would be great of you could try the proposed patch (eProsima/Fast-DDS#166 and ros2/rmw_fastrtps#178) and report back if it addresses your problem.

@dirk-thomas
Copy link
Member

I ran the reproduceable example (ros2/examples@1278fb5) mentioned in the description and without the patch the listener was running forever and could not be Ctrl-C-ed. With the patch it exists on its own after printing a few received messages.

@dirk-thomas dirk-thomas self-assigned this Dec 5, 2017
@dirk-thomas dirk-thomas added the in review Waiting for review (Kanban column) label Dec 5, 2017
@dirk-thomas dirk-thomas removed the in review Waiting for review (Kanban column) label Dec 5, 2017
DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this issue Aug 5, 2022
…#280)

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hitlist
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants