Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test in CI: test_rclcpp gtest_subscription__rmw_connextdds #136

Open
clalancette opened this issue Nov 13, 2023 · 3 comments
Open
Assignees

Comments

@clalancette
Copy link
Contributor

clalancette commented Nov 13, 2023

System Info

  • OS
    • Ubuntu 22.04
  • ROS version and installation type
    • rolling/source
  • RTI Connext DDS version and installation type
    • Connext Pro 6.0.1
  • RMW version or commit hash
    • rolling branch (a6053be as of this writing)

Bug Description

If you run the gtest_subscription__rmw_connextdds test from the test_rclcpp package with no load, it "always" passes. It also "always" seems to pass in CI. However, if you put load on the machine, it fails very often, maybe like 75% of the time.

Expected Behavior

The test always passes, even with a lot of load on the machine.

How to Reproduce

In terminal 1, put a lot of stress on the machine. In my case:

$ stress --cpu 16

In terminal 2, run the test:

$ colcon test --event-handlers console_direct+ --packages-select test_rclcpp --ctest-args -R gtest_subscription__rmw_connextdds

You may have to adjust the amount of stress on the machine, and you may have to run the test a few times, but it should fail fairly quickly.

Workarounds

Mark this test as xfail.

Additional context

This same test works fine on Fast-DDS (gtest_subscription__rmw_fastrtps_cpp) and Cyclone DDS (gtest_subscription__rmw_cyclonedds_cpp), even with load on the machine.

This has come up now because we are about to merge in ros2/rclcpp#2142, which seems to exacerbate this problem. However, I can produce this completely with rolling packages as of today, so it is not the fault of that PR.

I did some additional debugging to try to track this down. When it fails, the executor is waiting for new data to come in via rmw_wait via a condition variable:

timedout = !this->condition.wait_for(lock, n, on_condition_active);
. The failure occurs because the condition variable never wakes up and times out. In turn, this suggests that the condition variable is never triggered.

That condition variable should be triggered when new data comes in via the on_data_available callback in the subscriber:

RMW_Connext_DataReaderListener_on_data_available(
. I've verified that it is installed to the listener at .

Finally, I've verified that the publisher side is indeed writing the data out via DDS_DataWriter_write_untypedI in message (

DDS_DataWriter_write_untypedI(
) .

So as far as I can tell, those are all of the pieces necessary to get this working, and it does work sometimes. But under load, it seems to fail. I could use some advice on how to debug this further.

In the meantime, I'm going to propose a PR to mark that particular test as xfail so we can make progress on the other PRs.

@clalancette
Copy link
Contributor Author

I realized that due to a recent refactoring, we were no longer marking the test as xfail anymore. ros2/system_tests#544 should fix that.

@fujitatomoya
Copy link
Collaborator

i did not get to the root cause, https://github.com/ros2/rosbag2/blob/4882d30fc2d1e5b9305e5d46b8460466f9280d27/rosbag2_transport/test/rosbag2_transport/test_burst.cpp#L404-L408 looks like the same issue. when it comes to transient workload, it sometimes miss the subscription data, that leads to the test failure of that test here.

@Crola1702
Copy link

I think this issue is happening consistently on Windows repeated, and it's also flaky on windows debug and rhel repeated.

It's a timeout of this test.

Reference builds:

Test regression:

Log output:

  Start 12: gtest_multithreaded__rmw_connextdds

12: Test command: C:\Python38\python.exe "-u" "C:/ci/ws/install/share/ament_cmake_test/cmake/run_test.py" "C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml" "--package-name" "test_rclcpp" "--output-file" "C:/ci/ws/build/test_rclcpp/ament_cmake_gtest/gtest_multithreaded__rmw_connextdds.txt" "--env" "RMW_IMPLEMENTATION=rmw_connextdds" "--append-env" "PATH=C:/ci/ws/build/test_rclcpp/Release" "--command" "C:/ci/ws/build/test_rclcpp/Release/gtest_multithreaded.exe" "--gtest_output=xml:C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml"
12: Working Directory: C:/ci/ws/build/test_rclcpp
12: Test timeout computed to be: 90
12: -- run_test.py: extra environment variables:
12:  - RMW_IMPLEMENTATION=rmw_connextdds
12: -- run_test.py: extra environment variables to append:
12:  - PATH+=C:/ci/ws/build/test_rclcpp/Release
12: -- run_test.py: invoking following command in 'C:\ci\ws\build\test_rclcpp':
12:  - C:/ci/ws/build/test_rclcpp/Release/gtest_multithreaded.exe --gtest_output=xml:C:/ci/ws/build/test_rclcpp/test_results/test_rclcpp/gtest_multithreaded__rmw_connextdds.gtest.xml
12: Running main() from C:\ci\ws\install\src\gtest_vendor\src\gtest_main.cc
12: [==========] Running 5 tests from 1 test suite.
12: [----------] Global test environment set-up.
12: [----------] 5 tests from test_multithreaded
12: [ RUN      ] test_multithreaded.multi_consumer_single_producer
    Test #12: gtest_multithreaded__rmw_connextdds .............................***Timeout  90.01 sec

Flaky ratio:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-08-21 2024-08-06 13 13 100.0
nightly_win_deb 2024-08-20 2024-08-20 14 1 7.14
nightly_linux-rhel_repeated 2024-08-18 2024-08-14 16 2 12.5
nightly_linux_coverage 2024-08-07 2024-08-07 15 1 6.67

WDYT @clalancette?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants