-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Action CI tests to pass reliably #376
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just re-ordering and changing the QoS to use durability instead of reliability no?
OK - so my goal here was to make these tests pass reliably, the incorrect logic was just something I saw. What we saw on this Action CI run was the following
So we are effectively seeing two failure cases, both on FastRTPS. CycloneDDS performs these tests very reliably (hundreds of repetitions locally without problem), whereas I am able to reproduce the FastRTPS failures on repeat, usually on just the second iteration. Specifically, the problem tests are
Case:
|
Update on What this is starting to look like is that the publisher is being torn down before it is actually able to deliver the last message in the bag (but This makes it seem that there is a very slight timing condition in some thread (in Fast-RTPS?) that is able to complete its work to deliver the final message if we give it a little bit of time before destroying the Player. |
Something in the upgrade to Fast-RTPS 1.10.x has caused this change in behavior (ros2/ros2#888) If I check out Fast-RTPS 1.9.x then the test works reliably again. |
Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
da358e4
to
5b96833
Compare
Latest update to this fixes the |
Maybe this has to do with the default QoS for the publisher manager here |
Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
None of the failures ended up being QoS-related. See my latest version of the PR description. @Karsten1987 @thomas-moulard this is read for re-review, I believe the Action CI will pass now |
Signed-off-by: Emerson Knapp <emerson.b.knapp@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Action CI now green again (finally) - and above ci.ros2.org instability is the ros-tracing CMake warning across the board, plus the known compression windows tests failing on Windows. Merging |
Note: this description is updated over the original PR
In our regular CI runs, since the upgrade to Fast-RTPS 1.10.x (), we have been seeing consistent failures in the following tests on our Action CI:
The
test_play
failure is receiving one fewer message than it testing that a specific number of messages arrive, when we only really care that any messages arrive at all for that test. The upgrade to Fast-RTPS 1.10.x introduced a some nondeterministic behavior on shutdown and made it uncertain whether the final message in a playback would be delivered. This particular test doesn't care about that behavior so it has been modified to expect fewer messages.The
test_record_all
failure is receiving more messages than it expects. The extra message is a warning passed on/rosout
(introduced here https://app.chime.aws/meetings/7596499512) - the warning is irrelevant to the test, I change it to expect at least the number of messages we are looking for, the specific by-type checks are sufficient.Signed-off-by: Emerson Knapp emerson.b.knapp@gmail.com