Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

race condition in graph changes and service is available #201

Closed
wjwwood opened this issue Oct 27, 2016 · 5 comments
Closed

race condition in graph changes and service is available #201

wjwwood opened this issue Oct 27, 2016 · 5 comments
Labels
bug Something isn't working

Comments

@wjwwood
Copy link
Member

wjwwood commented Oct 27, 2016

I noticed this when debugging the flaky test in rcl called test_rcl_service_server_is_available which is in the rcl/test/rcl/test_graph.cpp file:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L477

The race seems to be between the graph guard condition being triggered (and waiting wait sets being woken up):

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L523

And the rcl_service_server_is_available function reporting that a service that was previously available is no longer available:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L542

Normally the test only checks this when a change occurs in the graph, but this caused this test to fail with connext periodically. So I added a condition for connext where it will check on each loop regardless of whether or not a graph change was detected:

https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L525-L538

The rcl_service_server_is_available function normally reported the right state on the next loop. This special case for connext should be removed after this is fixed.

This could be caused by graph changes getting combined through some sort of coalescing of events or it could be a delay introduced by connext, I'm not sure yet. I've decided to work around and document the issue rather than solve it now.

@wjwwood
Copy link
Member Author

wjwwood commented Oct 27, 2016

I also had to make this pr to rclcpp to work around this problem: ros2/rclcpp#262

I believe it could be reversed once we resolve this. I did not condition this case on Connext only. It doesn't hurt to do this with other rmw implementations that can pass the test unmodified.

@wjwwood
Copy link
Member Author

wjwwood commented Oct 28, 2016

Also changed tests in test_rclcpp to allow for the first message sent in some pub sub tests to get lost and retry by publishing again: ros2/system_tests@06bd4f7

This only happens on Connext, and perhaps could be made more strict again (i.e. enforce one publish and one message received) if this is fixed. Also see: #168 (comment)

@dhood
Copy link
Member

dhood commented May 22, 2018

I have been looking at the implementation of wait_for_service_nanoseconds while trying to resolve #280. An idea I have for what might be causing the delay between a graph event being triggered and the server being reported as ready is:

  • Services are made up of a request subscriber, and a response publisher.
  • rmw_service_server_is_available will only return true if both are matched.
  • I imagine (don't have evidence, might be incorrect) that the graph guard condition will be triggered when either the request subscriber OR the response publisher comes up.

Maybe what causes the issue in #280 is that the graph guard condition is triggered for one of the two (request subscriber/response publisher), and then we miss the other one's "event" because we are not waiting on a graph change at the time it occurs, we are busy responding to the first event. That might explain why we then end up waiting indefinitely for another graph event to occur: it already happened while we were responding to the first. (Don't have evidence for this, just an idea).

However, even if that is what is happening for the race condition in services becoming available, I don't think it could explain the race condition presented in this thread about services being reported as no longer available, because if either the request subscriber or the response publisher goes away (triggering the guard condition), the rmw_service_server_is_available should return false correctly.

So, I'm not sure that this is the issue here, but wanted to document it as a potential lead.

@wjwwood
Copy link
Member Author

wjwwood commented May 22, 2018

@dhood That's a reasonable idea, though my understanding is that if you trigger a guard condition while not waiting on it, then the next time you go into wait you will still wake up immediately, i.e. guard conditions triggered when not waiting are not "lost". I'm pretty sure that's how they work, but it would be worth checking.

My suspicion for why this is flakey is due to the way graph events for local topics are implemented. Basically when the user calls create_publisher(), the graph guard condition is triggered immediately (like before that function returns) (

node_info->publisher_listener->trigger_graph_guard_condition();
), but there is a period of time after that when discovery is occurring such that you might be woken for a graph change but discovery is still occurring so the source of that graph change is still not registered as "matched". See these instances where we manually notify the graph when creating a service client (it's similar for a service server):

This also doesn't occur with pub/sub since in those cases we almost never use "matched" status as a measure for "ready" as we would need a publisher or subscription to check matched status, whereas count_publishers() doesn't need a subscription to function and instead just accumulates graph events and stores them. For services, we need a client anyways to wait, so we use the matched status of the underlying datareader/datawriter instead of the accumulated graph state because we can.

One potential solution (if my guess is correct) is to notify the graph for services with the "on matched" callback from a DDS listener and not within the local function to create a pub/sub/service.

@clalancette
Copy link
Contributor

Closing, since with Foxy now being End-of-Life, this repository is no longer used.

@clalancette clalancette closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants