Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Action CI tests to pass reliably #376

Merged
merged 3 commits into from
Apr 15, 2020

Conversation

emersonknapp
Copy link
Collaborator

@emersonknapp emersonknapp commented Apr 14, 2020

Note: this description is updated over the original PR

In our regular CI runs, since the upgrade to Fast-RTPS 1.10.x (), we have been seeing consistent failures in the following tests on our Action CI:

2020-04-14T20:21:18.5513377Z The following tests FAILED:
2020-04-14T20:21:18.5513644Z 	 11 - test_record_all__rmw_fastrtps_cpp (Failed)
2020-04-14T20:21:18.5513914Z 	 18 - test_play__rmw_fastrtps_cpp (Failed)
2020-04-14T20:21:18.5514188Z 	 20 - test_record_all__rmw_fastrtps_dynamic_cpp (Failed)
2020-04-14T20:21:18.5514448Z 	 27 - test_play__rmw_fastrtps_dynamic_cpp (Failed)

The test_play failure is receiving one fewer message than it testing that a specific number of messages arrive, when we only really care that any messages arrive at all for that test. The upgrade to Fast-RTPS 1.10.x introduced a some nondeterministic behavior on shutdown and made it uncertain whether the final message in a playback would be delivered. This particular test doesn't care about that behavior so it has been modified to expect fewer messages.

The test_record_all failure is receiving more messages than it expects. The extra message is a warning passed on /rosout (introduced here https://app.chime.aws/meetings/7596499512) - the warning is irrelevant to the test, I change it to expect at least the number of messages we are looking for, the specific by-type checks are sufficient.

Signed-off-by: Emerson Knapp emerson.b.knapp@gmail.com

Copy link
Collaborator

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with ci

Copy link
Contributor

@piraka9011 piraka9011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just re-ordering and changing the QoS to use durability instead of reliability no?

@emersonknapp
Copy link
Collaborator Author

OK - so my goal here was to make these tests pass reliably, the incorrect logic was just something I saw.

What we saw on this Action CI run was the following

2020-04-14T20:21:18.5513377Z The following tests FAILED:
2020-04-14T20:21:18.5513644Z 	 11 - test_record_all__rmw_fastrtps_cpp (Failed)
2020-04-14T20:21:18.5513914Z 	 18 - test_play__rmw_fastrtps_cpp (Failed)
2020-04-14T20:21:18.5514188Z 	 20 - test_record_all__rmw_fastrtps_dynamic_cpp (Failed)
2020-04-14T20:21:18.5514448Z 	 27 - test_play__rmw_fastrtps_dynamic_cpp (Failed)

So we are effectively seeing two failure cases, both on FastRTPS. CycloneDDS performs these tests very reliably (hundreds of repetitions locally without problem), whereas I am able to reproduce the FastRTPS failures on repeat, usually on just the second iteration.

Specifically, the problem tests are

  1. RosBag2PlayQosOverrideTestFixture.topic_qos_profiles_overridden
  2. RecordIntegrationTestFixture.published_messages_from_multiple_topics_are_recorded

Case: topic_qos_profiles_overridden

When this test fails, it always has received 2 out of the 3 expected messages. This is with reliable transport, so delivery retries are expected. This occurs even when adding a large initial padding time to aid discovery.

When checking the order of messages - the first 2 published messages are received, and the third message is never received, then the test times out.

I am now investigating further why this would happen.

Case: published_messages_from_multiple_topics_are_recorded

This test is failing right now after receiving 5 messages, when only 4 are expected!

I am also reproducing this locally, investigating further why this would happen.

@emersonknapp emersonknapp changed the title Make topic_qos_profiles_overridden actually test something Fix Action CI tests to pass reliably Apr 14, 2020
@emersonknapp
Copy link
Collaborator Author

Update on play test: If I publish one more message (4) than I want to receive (3), I always receive that third message, but never the last one because the test gets shut down.

What this is starting to look like is that the publisher is being torn down before it is actually able to deliver the last message in the bag (but publish is definitely called). If I add a std::this_thread::sleep_for(std::chrono::milliseconds(1)); before the player is torn down here - the test starts working reliably for me.

This makes it seem that there is a very slight timing condition in some thread (in Fast-RTPS?) that is able to complete its work to deliver the final message if we give it a little bit of time before destroying the Player.

@emersonknapp
Copy link
Collaborator Author

emersonknapp commented Apr 15, 2020

Something in the upgrade to Fast-RTPS 1.10.x has caused this change in behavior (ros2/ros2#888)

If I check out Fast-RTPS 1.9.x then the test works reliably again.

Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
@emersonknapp emersonknapp force-pushed the emersonknapp/correct-play-test branch from da358e4 to 5b96833 Compare April 15, 2020 17:27
@emersonknapp
Copy link
Collaborator Author

Latest update to this fixes the test_play issue. Still investigating test_record_all

@piraka9011
Copy link
Contributor

Maybe this has to do with the default QoS for the publisher manager here

Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
@emersonknapp
Copy link
Collaborator Author

None of the failures ended up being QoS-related. See my latest version of the PR description.

@Karsten1987 @thomas-moulard this is read for re-review, I believe the Action CI will pass now

Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
@emersonknapp
Copy link
Collaborator Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Contributor

@thomas-moulard thomas-moulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emersonknapp
Copy link
Collaborator Author

Action CI now green again (finally) - and above ci.ros2.org instability is the ros-tracing CMake warning across the board, plus the known compression windows tests failing on Windows.

Merging

@emersonknapp emersonknapp merged commit 9909f8c into master Apr 15, 2020
@delete-merged-branch delete-merged-branch bot deleted the emersonknapp/correct-play-test branch April 15, 2020 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants