-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A flaky test about QosRclcppTestFixture.test_best_available_policies_publisher #440
Comments
diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index acfc33d..6e801d0 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -903,6 +903,10 @@ static void handle_builtintopic_endpoint(
ppgid,
qos_profile,
is_reader);
+ // printf("handle_builtintopic_endpoint %d %s\n", is_reader, s->topic_name);
+ if (strcmp(s->topic_name, "rt/test_best_available_publisher") == 0) {
+ dds_sleepfor(DDS_MSECS(10));
+ }
}
dds_return_loan(reader, &raw, 1);
} The above test patch makes the test case fail easily on my machine. |
If you want that test to work reliably, maybe calling It's either that, or using the introspection functions provided by DDS without relying on ROS' graph cache, like it used to do before the graph cache got introduced. That'd lead to a lot of extra code and, no doubt, new problems in the future. So that'd not be my first choice. |
besides, I think adding some sleep would be also reasonable as implementation-agnostic test case? (there are many places to do that.) |
I kinda hate being dependent on short sleeps in test cases. That said, we run Cyclone's CI on Azure pipelines with several tests running in parallel, and we seem to get much nastier variations in scheduling than what the ROS 2 CI seems to get. If a short sleep is sufficient, then that is good enough for a test case, I'd say. |
yeah, CI takes long enough time already. If it is stable w/o sleep, that is much better. |
@eboasson, thanks for your suggestion, but I am afraid that it's no guarantee that the
@fujitatomoya, thank you, this can fix the flaky issue(actually, I like doing that), but this kind of code snippet seems no good, users might not like it. If |
The way I was looking at is that the built-in topics (DCPSPublication, DCPSSubscription, DCPSParticipant) are published synchronously by Cyclone, and so the DCPSPublication sample describing a newly data writer in the same thread is visible in a reader for DCPSPublication by the time the Following that line of reasoning, the race must be between the I suspect that's not worth it for a test, but it would be if application code is expected to make the same assumptions as the test. Or, alternatively, you could document there is no guarantee that |
I'm just taking another look at this test, and I'm actually wondering if the bug is something slightly different. In this test, we are instantiating a So at line 105, we call However, what we do next is suspicious. In line 120, we call So with all of that said, I think the fix here is to have the test attempt to What does everybody think of that? |
That works, and I be completely fine with that if it is considered acceptable that discovery can lag for events that occurred prior to the query in a causal ordering. (I'm hope I am formulating it sufficiently incomprehensibly to get away with the statement 😂) It is just that I have found time and again that such asynchronous behaviour even when the creation and the check are on the same thread trips people up and results in flaky application behaviour. This holds even when it is properly documented, because nobody really reads the documentation anyway until they have figured out the problem. On the other hand, I suspect hardly anyone will ever run into this outside of test cases. And so while it is possible to modify the discovery thread and I'm cool with it either way, all I want to be sure of is that pros and cons are weighed. I kinda like the option of checking a few times in this test (like you propose) and making sure there is a note in the documentation of That way, the documentation tries to warn anyone of the trap while leaving open the possibility of giving stronger guarantees in the future, should that turn out to be worth the effort. |
Thanks for the response @eboasson .
I understand what you are saying (and I think this is a reiteration of what you said before, I just didn't understand it then :).
Yeah, that worries me too. But I would say that this particular issue has always potentially been there.
We'll have to see. It might be worthwhile to do that, but there is probably a trade-off involved.
Good call! I've gone ahead and added in a note to the RMW documentation in ros2/rmw#352 |
cannot agree more.
@clalancette are you going to address this fix as well as doc update? |
I think @iuhilnehc-ynos already did this in ros2/system_tests#513 . Along with my documentation update in ros2/rmw#352 , I think that should cover everything we talked about here. |
https://ci.ros2.org/job/ci_linux/18164/testReport/junit/test_quality_of_service/QosRclcppTestFixture__rmw_cyclonedds_cpp/test_best_available_policies_publisher/
Try to analyze the reason below,
https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L110
which just makes sure there is a reader in the
data_readers_
of graphrcl_wait_for_subscribers ... common_context->graph_cache.get_reader_count __get_count(data_readers_, ...)
and it can't promise that the
data_writers_
will be updated immediately.https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L126
publisher->get_publishers_info_by_topic Node::get_publishers_info_by_topic node_graph_->get_publishers_info_by_topic get_info_by_topic(..., rcl_get_publishers_info_by_topic) rcl_get_publishers_info_by_topic(..., rmw_get_publishers_info_by_topic) rmw_get_publishers_info_by_topic in the DDS common_context->graph_cache.get_writers_info_by_topic __get_entities_info_by_topic(data_writers_, ...)
As the
data_readers_
anddata_writers_
are updated separately,rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp
Lines 899 to 905 in 4bec4c8
If the
data_writers_
is slowly updated, the flaky test happens.Any suggestion or idea is welcomed.
The text was updated successfully, but these errors were encountered: