Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #381 unstable play_end_to_end test #396

Merged
merged 14 commits into from
May 12, 2020
Merged

Fix #381 unstable play_end_to_end test #396

merged 14 commits into from
May 12, 2020

Conversation

mabelzhang
Copy link
Contributor

I think this should fix #381. The failing test was play_filters_by_topic in rosbag2_tests.

I took the same kind of fix from rosbag2_transport https://github.com/ros2/rosbag2/blob/master/rosbag2_transport/test/rosbag2_transport/test_play.cpp#L140 , so I think it should make the tests more stable.

I ran GitHub Actions on this branch in my fork (temporarily the default branch to run the scheduled Actions) https://github.com/mabelzhang/rosbag2/actions , and the ros-tooling/action-ros-ci passed most of the time. The occasional times it fails are at ament_cmake_core, which I think is a glitch in Actions. The next Actions check, actions/upload-artifact, always fails in my fork. I think that's not relevant to the test failures. The tests are passing.

For example, these are passing runs scheduled ~5 minutes apart
348 https://github.com/mabelzhang/rosbag2/actions/runs/91812093
349 https://github.com/mabelzhang/rosbag2/actions/runs/91814897
350 https://github.com/mabelzhang/rosbag2/actions/runs/91818893
351 https://github.com/mabelzhang/rosbag2/actions/runs/91821981
352 https://github.com/mabelzhang/rosbag2/actions/runs/91825552
354 https://github.com/mabelzhang/rosbag2/actions/runs/91839933
355 https://github.com/mabelzhang/rosbag2/actions/runs/91843979
356 https://github.com/mabelzhang/rosbag2/actions/runs/91847046

As @Karsten1987 mentioned, the problem is probably due to publishers and subscribers not starting at the same time. A real fix is to revive and finish the start_paused branch, to add a feature that starts bag playback paused.

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
@Karsten1987
Copy link
Collaborator

@mabelzhang can you run CI on this?

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Copy link
Collaborator

@emersonknapp emersonknapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - on green CI

@mabelzhang
Copy link
Contributor Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@Karsten1987
Copy link
Collaborator

I am a bit concerned about what I see in the GitHub actions:

  18: [ RUN      ] RecordIntegrationTestFixture.mixed_qos_subscribes
  18: [INFO] [1588287042.885843194] [rosbag2_transport]: Listening for topics...
  18: [WARN] [1588287042.886893342] [rosbag2_transport]: Some, but not all, publishers on topic "/string_topic" are offering RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL. Falling back to RMW_QOS_POLICY_DURABILITY_VOLATILE as it will connect to all publishers. Previously-published latched messages will not be retrieved.
  18: [INFO] [1588287042.888077997] [rosbag2_transport]: Subscribed to topic '/string_topic'
  18: [INFO] [1588287042.888764328] [rosbag2_transport]: All requested topics are subscribed. Stopping discovery...
  18: -- run_test.py: return code -11
  18: -- run_test.py: generate result file '/home/runner/work/rosbag2/rosbag2/ros_ws/build/rosbag2_transport/test_results/rosbag2_transport/test_record__rmw_fastrtps_cpp.gtest.xml' with failed test
  18: -- run_test.py: verify result file '/home/runner/work/rosbag2/rosbag2/ros_ws/build/rosbag2_transport/test_results/rosbag2_transport/test_record__rmw_fastrtps_cpp.gtest.xml'
  18/37 Test #18: test_record__rmw_fastrtps_cpp ............................***Failed    0.76 sec

That return code -11 usually indicates a segmentation fault somewhere. That's a pretty uncool situation where you have two CI runs and it happens only on one of them. Who's to be believe?!

Can anybody reproduce that return code?

@emersonknapp
Copy link
Collaborator

My #1 concern with the Action CI right now is that it's running on Bionic, which is not a supported platform for this version of the code. I'm working on trying to upgrade it to run on Focal, see this WIP PR - I have a feeling that this will solve the issues #390

@Karsten1987
Copy link
Collaborator

I can't reproduce this locally on my Ubuntu18.04.

@emersonknapp
Copy link
Collaborator

Fair - I'm trying to scrape the logs to find out how often this occurs

@emersonknapp
Copy link
Collaborator

Added some more context here #381 (comment) - i saw the -11 return code in about a quarter of the runs over the last 24 hours.

@mabelzhang
Copy link
Contributor Author

CI with --retest-until-fail 10 to test stability

  • Linux Build Status

I need to do this on the other 3 platforms.

@mabelzhang
Copy link
Contributor Author

--retest-until-fail 10 on the other 3 OSes:

  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

So looks like on both Linux CIs, it is passing 10/10.
On Windows, play_end_to_end timed out in the first run.
On macOS, tests pass. I have seen the test time out yesterday on a separate CI run. So I ran it with --retest-until-fail 10 again. It still seems to pass:

  • macOS Build Status

It seems the version here fails less often than master branch. Any objections to adding a #if to skip this test on Windows, and keep the others?

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
@mabelzhang
Copy link
Contributor Author

Disabled play_end_to_end and play_filters_by_topic on Windows.
Tests are passing on CI with --retest-until-fail 10

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Collaborator

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's okay for now to disable the tests (again) on windows.
@emersonknapp what do you think?

@mabelzhang
Copy link
Contributor Author

@emersonknapp If I get an okay about disabling the Windows test, I can merge this.

@emersonknapp
Copy link
Collaborator

yeah, I have no problem with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unstable tests on Action CI
3 participants