-
Notifications
You must be signed in to change notification settings - Fork 35
race condition in graph changes and service is available #201
Comments
I also had to make this pr to rclcpp to work around this problem: ros2/rclcpp#262 I believe it could be reversed once we resolve this. I did not condition this case on Connext only. It doesn't hurt to do this with other rmw implementations that can pass the test unmodified. |
Also changed tests in This only happens on Connext, and perhaps could be made more strict again (i.e. enforce one publish and one message received) if this is fixed. Also see: #168 (comment) |
I have been looking at the implementation of
Maybe what causes the issue in #280 is that the graph guard condition is triggered for one of the two (request subscriber/response publisher), and then we miss the other one's "event" because we are not waiting on a graph change at the time it occurs, we are busy responding to the first event. That might explain why we then end up waiting indefinitely for another graph event to occur: it already happened while we were responding to the first. (Don't have evidence for this, just an idea). However, even if that is what is happening for the race condition in services becoming available, I don't think it could explain the race condition presented in this thread about services being reported as no longer available, because if either the request subscriber or the response publisher goes away (triggering the guard condition), the So, I'm not sure that this is the issue here, but wanted to document it as a potential lead. |
@dhood That's a reasonable idea, though my understanding is that if you trigger a guard condition while not waiting on it, then the next time you go into wait you will still wake up immediately, i.e. guard conditions triggered when not waiting are not "lost". I'm pretty sure that's how they work, but it would be worth checking. My suspicion for why this is flakey is due to the way graph events for local topics are implemented. Basically when the user calls
This also doesn't occur with pub/sub since in those cases we almost never use "matched" status as a measure for "ready" as we would need a publisher or subscription to check matched status, whereas One potential solution (if my guess is correct) is to notify the graph for services with the "on matched" callback from a DDS listener and not within the local function to create a pub/sub/service. |
Closing, since with Foxy now being End-of-Life, this repository is no longer used. |
I noticed this when debugging the flaky test in
rcl
calledtest_rcl_service_server_is_available
which is in thercl/test/rcl/test_graph.cpp
file:https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L477
The race seems to be between the graph guard condition being triggered (and waiting wait sets being woken up):
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L523
And the
rcl_service_server_is_available
function reporting that a service that was previously available is no longer available:https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L542
Normally the test only checks this when a change occurs in the graph, but this caused this test to fail with connext periodically. So I added a condition for connext where it will check on each loop regardless of whether or not a graph change was detected:
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L525-L538
The
rcl_service_server_is_available
function normally reported the right state on the next loop. This special case for connext should be removed after this is fixed.This could be caused by graph changes getting combined through some sort of coalescing of events or it could be a delay introduced by connext, I'm not sure yet. I've decided to work around and document the issue rather than solve it now.
The text was updated successfully, but these errors were encountered: