Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A flaky test about QosRclcppTestFixture.test_best_available_policies_publisher #440

Closed
iuhilnehc-ynos opened this issue Mar 6, 2023 · 12 comments · Fixed by ros2/system_tests#513
Projects

Comments

@iuhilnehc-ynos
Copy link

https://ci.ros2.org/job/ci_linux/18164/testReport/junit/test_quality_of_service/QosRclcppTestFixture__rmw_cyclonedds_cpp/test_best_available_policies_publisher/

Stacktrace
/home/jenkins-agent/workspace/ci_linux/ws/src/ros2/system_tests/test_quality_of_service/test/test_best_available.cpp:128
Expected equality of these values:
  publishers_info.size()
    Which is: 0
  1u
    Which is: 1

Try to analyze the reason below,

https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L110
which just makes sure there is a reader in the data_readers_ of graph

    rcl_wait_for_subscribers
        ...
	common_context->graph_cache.get_reader_count
            __get_count(data_readers_, ...)

and it can't promise that the data_writers_ will be updated immediately.
https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L126

publisher->get_publishers_info_by_topic
	Node::get_publishers_info_by_topic
		node_graph_->get_publishers_info_by_topic
			get_info_by_topic(..., rcl_get_publishers_info_by_topic)
				rcl_get_publishers_info_by_topic(..., rmw_get_publishers_info_by_topic)
					rmw_get_publishers_info_by_topic in the DDS
						common_context->graph_cache.get_writers_info_by_topic
						    __get_entities_info_by_topic(data_writers_, ...)

As the data_readers_ and data_writers_ are updated separately,

impl->common.graph_cache.add_entity(
gid,
std::string(s->topic_name),
std::string(s->type_name),
ppgid,
qos_profile,
is_reader);

If the data_writers_ is slowly updated, the flaky test happens.

Any suggestion or idea is welcomed.

@iuhilnehc-ynos
Copy link
Author

diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index acfc33d..6e801d0 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -903,6 +903,10 @@ static void handle_builtintopic_endpoint(
         ppgid,
         qos_profile,
         is_reader);
+      // printf("handle_builtintopic_endpoint %d %s\n", is_reader, s->topic_name);
+      if (strcmp(s->topic_name, "rt/test_best_available_publisher") == 0) {
+        dds_sleepfor(DDS_MSECS(10));
+      }
     }
     dds_return_loan(reader, &raw, 1);
   }

The above test patch makes the test case fail easily on my machine.

@eboasson
Copy link
Collaborator

eboasson commented Mar 6, 2023

If you want that test to work reliably, maybe calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option? If there is no data available (the common case), it'd will take very little time. (Obviously there are a few analogous cases.)

It's either that, or using the introspection functions provided by DDS without relying on ROS' graph cache, like it used to do before the graph cache got introduced. That'd lead to a lot of extra code and, no doubt, new problems in the future. So that'd not be my first choice.

@fujitatomoya
Copy link
Contributor

besides, I think adding some sleep would be also reasonable as implementation-agnostic test case? (there are many places to do that.)

@eboasson
Copy link
Collaborator

eboasson commented Mar 6, 2023

I kinda hate being dependent on short sleeps in test cases. That said, we run Cyclone's CI on Azure pipelines with several tests running in parallel, and we seem to get much nastier variations in scheduling than what the ROS 2 CI seems to get.

If a short sleep is sufficient, then that is good enough for a test case, I'd say.

@fujitatomoya
Copy link
Contributor

yeah, CI takes long enough time already. If it is stable w/o sleep, that is much better.

@iuhilnehc-ynos
Copy link
Author

calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option?

@eboasson, thanks for your suggestion, but I am afraid that it's no guarantee that the data_writers_ can be updated as well.

adding some sleep would be also reasonable

@fujitatomoya, thank you, this can fix the flaky issue(actually, I like doing that), but this kind of code snippet seems no good, users might not like it.

If Logic B can't depend on Logic A, can I just remove Logic A and use another way to make sure Logic B can be successful? What I mean is to remove the https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L107-L118, and then use a new utility function with a timeout to call publisher->get_publishers_info_by_topic for what we expect.

@eboasson
Copy link
Collaborator

eboasson commented Mar 7, 2023

calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option?
@eboasson, thanks for your suggestion, but I am afraid that it's no guarantee that the data_writers_ can be updated as well.

The way I was looking at is that the built-in topics (DCPSPublication, DCPSSubscription, DCPSParticipant) are published synchronously by Cyclone, and so the DCPSPublication sample describing a newly data writer in the same thread is visible in a reader for DCPSPublication by the time the dds_create_writer call returns. The graph cache update functions are also all synchronous.

Following that line of reasoning, the race must be between the discovery_thread thread and the main thread. You're right that simply processing whatever data is currently available in those built-in topic readers doesn't guarantee anything because it fails to account for whatever that discovery thread is currently doing. But that is, in itself, a manageable problem, it just means synchronising with that thread.

I suspect that's not worth it for a test, but it would be if application code is expected to make the same assumptions as the test. Or, alternatively, you could document there is no guarantee that rmw_get_publishers_info_by_topic (&c.) are always consistent with events that causally preceded it. Or something like that.

@clalancette clalancette added this to To do in Iron Irwini via automation Mar 28, 2023
@clalancette
Copy link
Contributor

I'm just taking another look at this test, and I'm actually wondering if the bug is something slightly different.

In this test, we are instantiating a QosTestPublisher at line 100, and a QosTestSubscriber at line 103. Each of those contain an rclcpp::Publisher and rclcpp::Subscription, respectively, but crucially they do not instantiate them until toggle is called.

So at line 105, we call toggle on the subscriber, which actually instantiates the rclcpp::Subscription and causes the RMW/DDS machinery to run and do discovery. In line 110, we wait for the subscriber with a 5 second timeout, which is enough time in almost all cases for that discovery to have happened.

However, what we do next is suspicious. In line 120, we call publisher->toggle(), which is the first time that we are instantiating the underlying rclcpp::Publisher. And then we are immediately calling publisher->get_publishers_info_by_topic(), which doesn't wait around at all; it just returns what it has right now. So we have not given discovery any time to happen, and update the graph cache (which is what we are looking at here).

So with all of that said, I think the fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

What does everybody think of that?

@eboasson
Copy link
Collaborator

@clalancette

So with all of that said, I think the fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

That works, and I be completely fine with that if it is considered acceptable that discovery can lag for events that occurred prior to the query in a causal ordering. (I'm hope I am formulating it sufficiently incomprehensibly to get away with the statement 😂)

It is just that I have found time and again that such asynchronous behaviour even when the creation and the check are on the same thread trips people up and results in flaky application behaviour. This holds even when it is properly documented, because nobody really reads the documentation anyway until they have figured out the problem.

On the other hand, I suspect hardly anyone will ever run into this outside of test cases. And so while it is possible to modify the discovery thread and get_publishers_info_by_topic (& friends) to guarantee you will see the publisher (in this example), it may not be worth the effort.

I'm cool with it either way, all I want to be sure of is that pros and cons are weighed. I kinda like the option of checking a few times in this test (like you propose) and making sure there is a note in the documentation of get_publishers_info_by_topic (& friends) that discovery may be asynchronous and the result need not reflect events immediately, including those caused by the calling thread.

That way, the documentation tries to warn anyone of the trap while leaving open the possibility of giving stronger guarantees in the future, should that turn out to be worth the effort.

@clalancette
Copy link
Contributor

Thanks for the response @eboasson .

That works, and I be completely fine with that if it is considered acceptable that discovery can lag for events that occurred prior to the query in a causal ordering. (I'm hope I am formulating it sufficiently incomprehensibly to get away with the statement joy)

I understand what you are saying (and I think this is a reiteration of what you said before, I just didn't understand it then :).

It is just that I have found time and again that such asynchronous behaviour even when the creation and the check are on the same thread trips people up and results in flaky application behaviour. This holds even when it is properly documented, because nobody really reads the documentation anyway until they have figured out the problem.

Yeah, that worries me too. But I would say that this particular issue has always potentially been there.

On the other hand, I suspect hardly anyone will ever run into this outside of test cases. And so while it is possible to modify the discovery thread and get_publishers_info_by_topic (& friends) to guarantee you will see the publisher (in this example), it may not be worth the effort.

We'll have to see. It might be worthwhile to do that, but there is probably a trade-off involved.

I'm cool with it either way, all I want to be sure of is that pros and cons are weighed. I kinda like the option of checking a few times in this test (like you propose) and making sure there is a note in the documentation of get_publishers_info_by_topic (& friends) that discovery may be asynchronous and the result need not reflect events immediately, including those caused by the calling thread.

Good call! I've gone ahead and added in a note to the RMW documentation in ros2/rmw#352

@fujitatomoya
Copy link
Contributor

@clalancette @eboasson

cannot agree more.

fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

@clalancette are you going to address this fix as well as doc update?

@clalancette
Copy link
Contributor

@clalancette are you going to address this fix as well as doc update?

I think @iuhilnehc-ynos already did this in ros2/system_tests#513 . Along with my documentation update in ros2/rmw#352 , I think that should cover everything we talked about here.

Iron Irwini automation moved this from In progress to Done Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants