Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address flakiness in rosbag2_play_end_to_end tests #1297

Merged
merged 12 commits into from
May 12, 2023

Conversation

MichaelOrlov
Copy link
Contributor

@MichaelOrlov MichaelOrlov commented Apr 15, 2023

  • Closes test_rosbag2_play_end_to_end flaky #408
    Make test_rosbag2_play_end_to_end more deterministic
  • Change QoS depth in test databases to correspond number of messages
  • Change QoS durability to transient local in test DB and mcap file
  • Explicitly specify QoS depth = 10 for subscribers
  • Explicitly specify QoS reliability to reliable for subscribers
  • Explicitly specify QoS durability to transient local for subscribers
  • Update metadata in test DB and mcap files to the latest version(7)
  • Start player in pause mode and wait on subscribers for matched
    publishers from player then send resume service call to unpause.
  • Add spin_and_wait_for_matched(topic_names) for SubscriptionManager
  • Remove xfail test_rosbag2_play_end_to_end

Add wait_for_matched for record_end_to_end_exits_gracefully_on_sigterm

  • Remove xfail for test_rosbag2_info_end_to_end

Fix for play_filters_by_topic test

  • Uncomment play_filters_by_topic test
  • Use proper qos settings for subscribers in play_filters_by_topic
    and fix expectations about number of published messages.
  • Log warning if SubscriptionManager::continue_spinning(..) finished by
    timeout.
  • Enable play_end_to_end_test for windows.

Note: Didn't remove xfail for test_rosbag2_record_end_to_end because tests fails on Windows build due to absence of the console control handling for CTRL_C_EVENT and CTRL_SHUTDOWN_EVENT events.
Will address test_rosbag2_record_end_to_end tests in follow up PRs.

@MichaelOrlov
Copy link
Contributor Author

Gist: https://gist.githubusercontent.com/MichaelOrlov/001cdcfa299a0cd40badfb30fae6aa8c/raw/c737f0f4f0068f00fccb96bf909caf35cb4fb722/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_tests
TEST args: --packages-above rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/11915

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

- Change QoS depth in test databases to correspond number of messages
- Change QoS durability to transient local in test DB and mcap file
- Explicitly specify QoS depth = 10 for subscribers
- Explicitly specify QoS reliability to reliable for subscribers
- Explicitly specify QoS durability to transient local for subscribers
- Update metadata in test DB and mcap files to the latest version(7)
- Remove xfail test_rosbag2_play_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
- Remove xfail for test_rosbag2_info_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov MichaelOrlov force-pushed the morlov/address_flakiness_in_end_to_end_tests branch from 3cebc12 to a41482e Compare May 4, 2023 23:52
@MichaelOrlov MichaelOrlov changed the title Address flakiness in rosbag2 end to end tests Address flakiness in rosbag2_play_end_to_end tests May 5, 2023
@MichaelOrlov
Copy link
Contributor Author

Re-run CI with --retest-until-fail 5 after returning xfail for test_rosbag2_record_end_to_end and enforcing QoS settings for subscribers.
Gist: https://gist.githubusercontent.com/MichaelOrlov/001cdcfa299a0cd40badfb30fae6aa8c/raw/c737f0f4f0068f00fccb96bf909caf35cb4fb722/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_tests
TEST args: --packages-above rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12008

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

@ros-pull-request-builder retest this please

@MichaelOrlov
Copy link
Contributor Author

Re-run CI with --retest-until-fail 5 after fixing and enabling play_filters_by_topic test.
Gist: https://gist.githubusercontent.com/MichaelOrlov/001cdcfa299a0cd40badfb30fae6aa8c/raw/c737f0f4f0068f00fccb96bf909caf35cb4fb722/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests
TEST args: --packages-above rosbag2_test_common rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12009

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov MichaelOrlov force-pushed the morlov/address_flakiness_in_end_to_end_tests branch from 93b546b to 31c07f0 Compare May 5, 2023 07:34
- Uncomment play_filters_by_topic test
- Use proper qos settings for subscribers in `play_filters_by_topic`
and fix expectations about number of published messages.
- Log warning if SubscriptionManager::continue_spinning(..) finished by
timeout.
- Enable `play_end_to_end_test` for windows.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
- Start player in pause mode and wait on subscribers for matched
publishers from player then send resume service call to unpause.
- Add spin_and_wait_for_matched(topic_names) for SubscriptionManager

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov MichaelOrlov force-pushed the morlov/address_flakiness_in_end_to_end_tests branch from 31c07f0 to bc48555 Compare May 6, 2023 08:51
@MichaelOrlov MichaelOrlov marked this pull request as ready for review May 6, 2023 16:23
@MichaelOrlov MichaelOrlov requested a review from a team as a code owner May 6, 2023 16:23
@MichaelOrlov MichaelOrlov requested review from emersonknapp, jhdcs, clalancette and james-rms and removed request for a team May 6, 2023 16:23
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few pieces of minor feedback.

MichaelOrlov and others added 3 commits May 8, 2023 14:28
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
…anager.hpp


Co-authored-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

Co-authored-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov
Copy link
Contributor Author

Re-run CI with --retest-until-fail 5 after review.

BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests
TEST args: --packages-above rosbag2_test_common rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12033

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

@clalancette Fair warning about or CI.
I know that there are race condition in process destruction in this PR and play_filters_by_topic test fails every other time on my local machine. But our CI didn't catch it for Linux versions even with --retest-until-fail 5. And as you can see rpr job also passes green.
This is not a first time I see that our CI not catching pretty obvious failures to witch it was pretty sensitive before (priore to the code freeze for Iron).
It looks like now we have pretty performant machines in CI and they are not very loaded with anything else.
IMO it should be some balanced load on CI infrastructure to catch corner cases and race conditions.

- Added wait_until_completion(process_id, timeout) helper function

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov
Copy link
Contributor Author

Re-run CI with --retest-until-fail 5 after addressing race condition in process destruction.

BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests
TEST args: --packages-above rosbag2_test_common rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12034

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

@clalancette FYI I have fixed race conditions in process termination routines. Which was causing intermittent test failures due to sending SIGINT signal when the process was already in the destruction phase.
Now we will wait to finish rosbag2 play process instead of trying forcefully stop it.
See my latest commit 3af5b41 for details.

@clalancette
Copy link
Contributor

It looks like this probably needs some PUBLIC annotations for Windows CI to pass.

…rror

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov
Copy link
Contributor Author

@clalancette Nop. It's a bit more straightforward. The Windows build was failing because I start using wait_until_completion inside stop_execution and the former one was declared after the latter. I made a fix in the new commit.

Re-running CI for Windows.

  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

@clalancette This PR has ABI and API changes in the test framework, although I want to backport it to the Iron branch if you wouldn't mind. It will help us to have a more stable CI for Iron.

@clalancette
Copy link
Contributor

@clalancette This PR has ABI and API changes in the test framework, although I want to backport it to the Iron branch if you wouldn't mind. It will help us to have a more stable CI for Iron.

Yeah, that should be fine here.

@MichaelOrlov
Copy link
Contributor Author

Re-run Windows CI after fixing build errors. It was copy-past from Unix version and process_id shall be renamed to the handle

Re-running CI for Windows.

  • Windows Build Status

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov MichaelOrlov force-pushed the morlov/address_flakiness_in_end_to_end_tests branch from c7beaef to 3966c81 Compare May 12, 2023 07:03
@MichaelOrlov
Copy link
Contributor Author

Re-running CI for Windows after fixing the compilation error from the previous run.

  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

CI failed on Windows in the test_rosbag2_play_end_to_end with error message

C:\ci\ws\src\ros2\rosbag2\rosbag2_tests\test\rosbag2_tests\test_rosbag2_play_end_to_end.cpp(80): error: Expected equality of these values:
  future.wait_for(service_call_timeout_)
    Which is: 4-byte object <01-00 00-00>
  std::future_status::ready
    Which is: 4-byte object <00-00 00-00>

Line test_rosbag2_play_end_to_end.cpp(80) corresponds to the

EXPECT_EQ(future.wait_for(service_call_timeout_), std::future_status::ready);

Which means timeout for getting a response from a service call. The timeout is defined there as 4 seconds. It might be not enough for Windows.

@MichaelOrlov
Copy link
Contributor Author

Re-running CI for Windows after increasing timeout for service call wait.

  • Windows Build Status

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov MichaelOrlov force-pushed the morlov/address_flakiness_in_end_to_end_tests branch from 7a697a0 to 0e5ab52 Compare May 12, 2023 20:27
@MichaelOrlov MichaelOrlov merged commit af4ca0c into rolling May 12, 2023
@delete-merged-branch delete-merged-branch bot deleted the morlov/address_flakiness_in_end_to_end_tests branch May 12, 2023 20:29
@MichaelOrlov
Copy link
Contributor Author

https://github.com/Mergifyio backport iron

@mergify
Copy link

mergify bot commented May 12, 2023

backport iron

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request May 12, 2023
* Make test_rosbag2_play_end_to_end more deterministic

- Change QoS depth in test databases to correspond number of messages
- Change QoS durability to transient local in test DB and mcap file
- Explicitly specify QoS depth = 10 for subscribers
- Explicitly specify QoS reliability to reliable for subscribers
- Explicitly specify QoS durability to transient local for subscribers
- Update metadata in test DB and mcap files to the latest version(7)
- Remove xfail test_rosbag2_play_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Add wait_for_matched for record_end_to_end_exits_gracefully_on_sigterm

- Remove xfail for test_rosbag2_info_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Fix for play_filters_by_topic test

- Uncomment play_filters_by_topic test
- Use proper qos settings for subscribers in `play_filters_by_topic`
and fix expectations about number of published messages.
- Log warning if SubscriptionManager::continue_spinning(..) finished by
timeout.
- Enable `play_end_to_end_test` for windows.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Make test_rosbag2_play_end_to_end deterministic

- Start player in pause mode and wait on subscribers for matched
publishers from player then send resume service call to unpause.
- Add spin_and_wait_for_matched(topic_names) for SubscriptionManager

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Remove redundant includes from test_rosbag2_play_end_to_end.cpp

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Sleep for a few milliseconds in SubscriptionManager to avoid busy loop

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Update rosbag2_test_common/include/rosbag2_test_common/subscription_manager.hpp

Co-authored-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

Co-authored-by: Chris Lalancette <clalancette@gmail.com>

* Add missing include<thread> in process_execution_helpers_unix.hpp

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Address race condition in process termination routines

- Added wait_until_completion(process_id, timeout) helper function

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Move wait_until_completion before stop_execution to fix compilation error

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Fix for Windows build error. Rename process_id to handle.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Increase timeout for service call and wait_until_completion up to 10 sec

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

---------

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
Co-authored-by: Chris Lalancette <clalancette@gmail.com>
(cherry picked from commit af4ca0c)
MichaelOrlov added a commit that referenced this pull request May 12, 2023
* Make test_rosbag2_play_end_to_end more deterministic

- Change QoS depth in test databases to correspond number of messages
- Change QoS durability to transient local in test DB and mcap file
- Explicitly specify QoS depth = 10 for subscribers
- Explicitly specify QoS reliability to reliable for subscribers
- Explicitly specify QoS durability to transient local for subscribers
- Update metadata in test DB and mcap files to the latest version(7)
- Remove xfail test_rosbag2_play_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Add wait_for_matched for record_end_to_end_exits_gracefully_on_sigterm

- Remove xfail for test_rosbag2_info_end_to_end

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Fix for play_filters_by_topic test

- Uncomment play_filters_by_topic test
- Use proper qos settings for subscribers in `play_filters_by_topic`
and fix expectations about number of published messages.
- Log warning if SubscriptionManager::continue_spinning(..) finished by
timeout.
- Enable `play_end_to_end_test` for windows.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Make test_rosbag2_play_end_to_end deterministic

- Start player in pause mode and wait on subscribers for matched
publishers from player then send resume service call to unpause.
- Add spin_and_wait_for_matched(topic_names) for SubscriptionManager

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Remove redundant includes from test_rosbag2_play_end_to_end.cpp

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Sleep for a few milliseconds in SubscriptionManager to avoid busy loop

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Update rosbag2_test_common/include/rosbag2_test_common/subscription_manager.hpp

Co-authored-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

Co-authored-by: Chris Lalancette <clalancette@gmail.com>

* Add missing include<thread> in process_execution_helpers_unix.hpp

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Address race condition in process termination routines

- Added wait_until_completion(process_id, timeout) helper function

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Move wait_until_completion before stop_execution to fix compilation error

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Fix for Windows build error. Rename process_id to handle.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

* Increase timeout for service call and wait_until_completion up to 10 sec

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

---------

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
Co-authored-by: Chris Lalancette <clalancette@gmail.com>
(cherry picked from commit af4ca0c)

Co-authored-by: Michael Orlov <michael.orlov@apex.ai>
@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2023-05-18/31587/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_rosbag2_play_end_to_end flaky
4 participants