race condition in graph changes and service is available #201

wjwwood · 2016-10-27T02:13:55Z

I noticed this when debugging the flaky test in rcl called test_rcl_service_server_is_available which is in the rcl/test/rcl/test_graph.cpp file:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L477

The race seems to be between the graph guard condition being triggered (and waiting wait sets being woken up):

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L523

And the rcl_service_server_is_available function reporting that a service that was previously available is no longer available:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L542

Normally the test only checks this when a change occurs in the graph, but this caused this test to fail with connext periodically. So I added a condition for connext where it will check on each loop regardless of whether or not a graph change was detected:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L525-L538

The rcl_service_server_is_available function normally reported the right state on the next loop. This special case for connext should be removed after this is fixed.

This could be caused by graph changes getting combined through some sort of coalescing of events or it could be a delay introduced by connext, I'm not sure yet. I've decided to work around and document the issue rather than solve it now.

The text was updated successfully, but these errors were encountered:

wjwwood · 2016-10-27T21:16:36Z

I also had to make this pr to rclcpp to work around this problem: ros2/rclcpp#262

I believe it could be reversed once we resolve this. I did not condition this case on Connext only. It doesn't hurt to do this with other rmw implementations that can pass the test unmodified.

wjwwood · 2016-10-28T00:37:25Z

Also changed tests in test_rclcpp to allow for the first message sent in some pub sub tests to get lost and retry by publishing again: ros2/system_tests@06bd4f7

This only happens on Connext, and perhaps could be made more strict again (i.e. enforce one publish and one message received) if this is fixed. Also see: #168 (comment)

dhood · 2018-05-22T03:06:03Z

I have been looking at the implementation of wait_for_service_nanoseconds while trying to resolve #280. An idea I have for what might be causing the delay between a graph event being triggered and the server being reported as ready is:

Services are made up of a request subscriber, and a response publisher.
rmw_service_server_is_available will only return true if both are matched.
I imagine (don't have evidence, might be incorrect) that the graph guard condition will be triggered when either the request subscriber OR the response publisher comes up.

Maybe what causes the issue in #280 is that the graph guard condition is triggered for one of the two (request subscriber/response publisher), and then we miss the other one's "event" because we are not waiting on a graph change at the time it occurs, we are busy responding to the first event. That might explain why we then end up waiting indefinitely for another graph event to occur: it already happened while we were responding to the first. (Don't have evidence for this, just an idea).

However, even if that is what is happening for the race condition in services becoming available, I don't think it could explain the race condition presented in this thread about services being reported as no longer available, because if either the request subscriber or the response publisher goes away (triggering the guard condition), the rmw_service_server_is_available should return false correctly.

So, I'm not sure that this is the issue here, but wanted to document it as a potential lead.

wjwwood · 2018-05-22T04:00:45Z

@dhood That's a reasonable idea, though my understanding is that if you trigger a guard condition while not waiting on it, then the next time you go into wait you will still wake up immediately, i.e. guard conditions triggered when not waiting are not "lost". I'm pretty sure that's how they work, but it would be worth checking.

My suspicion for why this is flakey is due to the way graph events for local topics are implemented. Basically when the user calls create_publisher(), the graph guard condition is triggered immediately (like before that function returns) (

rmw_connext/rmw_connext_cpp/src/rmw_publisher.cpp

Line 234 in cb9a9fd

node_info->publisher_listener->trigger_graph_guard_condition();

), but there is a period of time after that when discovery is occurring such that you might be woken for a graph change but discovery is still occurring so the source of that graph change is still not registered as "matched". See these instances where we manually notify the graph when creating a service client (it's similar for a service server):

rmw_connext/rmw_connext_cpp/src/rmw_client.cpp

Line 231 in bbbc754

node_info->subscriber_listener->trigger_graph_guard_condition();
rmw_connext/rmw_connext_cpp/src/rmw_client.cpp

Line 242 in bbbc754

node_info->publisher_listener->trigger_graph_guard_condition();
rmw_connext/rmw_connext_cpp/src/rmw_client.cpp

Line 319 in bbbc754

node_info->subscriber_listener->trigger_graph_guard_condition();
rmw_connext/rmw_connext_cpp/src/rmw_client.cpp

Line 324 in bbbc754

node_info->publisher_listener->trigger_graph_guard_condition();

This also doesn't occur with pub/sub since in those cases we almost never use "matched" status as a measure for "ready" as we would need a publisher or subscription to check matched status, whereas count_publishers() doesn't need a subscription to function and instead just accumulates graph events and stores them. For services, we need a client anyways to wait, so we use the matched status of the underlying datareader/datawriter instead of the accumulated graph state because we can.

One potential solution (if my guess is correct) is to notify the graph for services with the "on matched" callback from a DDS listener and not within the local function to create a pub/sub/service.

clalancette · 2023-06-21T20:41:28Z

Closing, since with Foxy now being End-of-Life, this repository is no longer used.

wjwwood added the bug Something isn't working label Oct 27, 2016

wjwwood mentioned this issue Oct 27, 2016

always check if the service is available, even if the graph event wasn't triggered ros2/rclcpp#262

Merged

dhood mentioned this issue Mar 13, 2018

wait_for_service not being woken by graph events #280

Closed

dhood mentioned this issue May 22, 2018

Workaround for wait_for_service lasting the full timeout with connext ros2/rclcpp#476

Merged

clalancette closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition in graph changes and service is available #201

race condition in graph changes and service is available #201

wjwwood commented Oct 27, 2016

wjwwood commented Oct 27, 2016

wjwwood commented Oct 28, 2016 •

edited

Loading

dhood commented May 22, 2018

wjwwood commented May 22, 2018

clalancette commented Jun 21, 2023

race condition in graph changes and service is available #201

race condition in graph changes and service is available #201

Comments

wjwwood commented Oct 27, 2016

wjwwood commented Oct 27, 2016

wjwwood commented Oct 28, 2016 • edited Loading

dhood commented May 22, 2018

wjwwood commented May 22, 2018

clalancette commented Jun 21, 2023

wjwwood commented Oct 28, 2016 •

edited

Loading