Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the reliability of test_get_type_description_service. #1107

Merged

Conversation

clalancette
Copy link
Contributor

We were seeing occasional failures of test_get_type_description_service on all platforms.

It turns out that the removal of a service from the graph cache is an asynchronous operation. In particular, all of our current RMW implementations publish a graph update to the graph topic, and only remove things from the graph once that data has been delivered. This can potentially happen in another thread.

That means that immediately after service_fini(), a call to get_service_names_and_types() may actually still have the old service name in it. Since our get_type_description tests were relying on this to go away, this was causing it to be flakey.

We fix this here by adding in a new helper function, service_not_exists(). This helper is not just the inverse of service_exists() (which returns immediately when it finds the service in the graph cache). Instead, service_not_exists() waits until the service has left the cache before returning (or times out).

In my testing, this fixes the flakiness locally.

We were seeing occasional failures of test_get_type_description_service
on all platforms.

It turns out that the removal of a service from the graph
cache is an asynchronous operation.  In particular, all of
our current RMW implementations publish a graph update to
the graph topic, and only remove things from the graph once
that data has been delivered.  This can potentially happen
in another thread.

That means that immediately after service_fini(), a call to
get_service_names_and_types() may actually still have the
old service name in it.  Since our get_type_description
tests were relying on this to go away, this was causing it
to be flakey.

We fix this here by adding in a new helper function,
service_not_exists().  This helper is not just the inverse
of service_exists() (which returns immediately when it finds
the service in the graph cache).  Instead, service_not_exists()
waits until the service has *left* the cache before returning
(or times out).

In my testing, this fixes the flakiness locally.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
@clalancette
Copy link
Contributor Author

clalancette commented Oct 4, 2023

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status
  • Linux Build Status

@claraberendsen
Copy link

claraberendsen commented Oct 4, 2023

This would close #1108

std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::this_thread::sleep_for(std::chrono::milliseconds(10));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nice to have) adding argument std::chrono::milliseconds period would be more flexible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nice to have) adding argument std::chrono::milliseconds period would be more flexible.

Yeah, that's true. Given that this is only used in this test, it would be easy enough to add later if we need the flexibility. So I'm going to not do that for now, but thanks for the idea.

@clalancette clalancette merged commit 0a537d2 into rolling Oct 4, 2023
2 checks passed
@clalancette clalancette deleted the clalancette/improve-get-type-desc-test-reliability branch October 4, 2023 21:51
clalancette added a commit that referenced this pull request Jun 25, 2024
In particular, make sure to match all service_description_init
with the appropriate finis.  While we are in here, also add in
a "service_not_exists" function that will quit much earlier,
speeding up the test.

This is essentially a backport of #1107
and parts of #1112, but adapted
for Iron.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
clalancette added a commit that referenced this pull request Jun 25, 2024
In particular, make sure to match all service_description_init
with the appropriate finis.  While we are in here, also add in
a "service_not_exists" function that will quit much earlier,
speeding up the test.

This is essentially a backport of #1107
and parts of #1112, but adapted
for Iron.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants