Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable tests on Action CI #381

Closed
emersonknapp opened this issue Apr 16, 2020 · 13 comments · Fixed by #396
Closed

Unstable tests on Action CI #381

emersonknapp opened this issue Apr 16, 2020 · 13 comments · Fixed by #396
Assignees
Labels
bug Something isn't working
Milestone

Comments

@emersonknapp
Copy link
Collaborator

Description

The regularly scheduled CI builds are seeing periodic failure of the test

See the following overnight builds for example:

It does not fail every time, but has failed in approximately half of today's hourly runs.

The test times out when it fails, this suggests it is waiting for messages that are never delivered, or were delivered before the Subscriptions came online.

System (please complete the following information)

  • OS: Ubuntu Bionic
  • ROS 2 Distro: Foxy
  • Version: master, reporting at cb44cdd
@emersonknapp emersonknapp added the bug Something isn't working label Apr 16, 2020
@emersonknapp emersonknapp changed the title Periodic test failure on test_rosbag2_play_end_to_end Unstable tests on Action CI May 1, 2020
@emersonknapp
Copy link
Collaborator Author

From the last day or so these are the test results, it's not always the same tests failing. The list of unstable tests seems to include:

  • test_play_topic_remap__rmw_fastrtps_cpp (Timeout)

  • test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)

  • test_record__rmw_fastrtps_cpp (Failed)

  • test_record__rmw_fastrtps_dynamic_cpp (Failed)

  • test_record__rmw_cyclonedds_cpp (Failed)

  • test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Failed)

  • test_rosbag2_play_end_to_end (Timeout)

The timeouts probably have something to do with discovery, or shutdown behavior not receiving the final message published by a publisher.

The failures, I'm guessing are segmentation faults of some kind. It seems common with fastrtps, and occurred once in cyclone in this list.

Here's my output from the script I used to scrape, for context:

TEST RUN 93323352 - 2020-05-01T19:03:53Z
b'2020-05-01T19:38:31.9162579Z The following tests FAILED:'
b'2020-05-01T19:38:31.9162907Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T19:38:31.9163376Z \t 30 - test_record__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T19:38:31.9164216Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T19:38:31.9164349Z Errors while running CTest'

b'2020-05-01T19:40:03.0866840Z The following tests FAILED:'
b'2020-05-01T19:40:03.0867133Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T19:40:03.0867235Z Errors while running CTest'

TEST RUN 93280328 - 2020-05-01T18:04:10Z
b'2020-05-01T18:35:20.8619814Z The following tests FAILED:'
b'2020-05-01T18:35:20.8620177Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T18:35:20.8620287Z Errors while running CTest'

b'2020-05-01T18:36:51.5043900Z The following tests FAILED:'
b'2020-05-01T18:36:51.5044307Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T18:36:51.5044399Z Errors while running CTest'

TEST RUN 93238438 - 2020-05-01T17:03:56Z
b'2020-05-01T17:32:03.5810154Z The following tests FAILED:'
b'2020-05-01T17:32:03.5810416Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T17:32:03.5810677Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T17:32:03.5810956Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T17:32:03.5811165Z Errors while running CTest'

b'2020-05-01T17:33:33.1755227Z The following tests FAILED:'
b'2020-05-01T17:33:33.1755510Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T17:33:33.1755601Z Errors while running CTest'

TEST RUN 93196491 - 2020-05-01T16:04:10Z
b'2020-05-01T16:33:03.8155439Z The following tests FAILED:'
b'2020-05-01T16:33:03.8155720Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T16:33:03.8156111Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T16:33:03.8156235Z Errors while running CTest'

b'2020-05-01T16:34:32.4372036Z The following tests FAILED:'
b'2020-05-01T16:34:32.4372243Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T16:34:32.4372335Z Errors while running CTest'

TEST RUN 93154036 - 2020-05-01T15:03:52Z
b'2020-05-01T15:37:13.5761261Z The following tests FAILED:'
b'2020-05-01T15:37:13.5761525Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T15:37:13.5761775Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T15:37:13.5762050Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T15:37:13.5762165Z Errors while running CTest'

b'2020-05-01T15:38:43.8400797Z The following tests FAILED:'
b'2020-05-01T15:38:43.8401031Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T15:38:43.8401136Z Errors while running CTest'

TEST RUN 93112954 - 2020-05-01T14:03:41Z
b'2020-05-01T14:32:13.9392698Z The following tests FAILED:'
b'2020-05-01T14:32:13.9392945Z \t  8 - test_record__rmw_cyclonedds_cpp (Failed)'
b'2020-05-01T14:32:13.9393219Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T14:32:13.9393336Z Errors while running CTest'

b'2020-05-01T14:33:43.3575792Z The following tests FAILED:'
b'2020-05-01T14:33:43.3576040Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T14:33:43.3576139Z Errors while running CTest'

TEST RUN 93075480 - 2020-05-01T13:03:52Z
b'2020-05-01T13:34:00.5559294Z The following tests FAILED:'
b'2020-05-01T13:34:00.5559583Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T13:34:00.5559839Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T13:34:00.5559952Z Errors while running CTest'

b'2020-05-01T13:35:30.2741230Z The following tests FAILED:'
b'2020-05-01T13:35:30.2741470Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T13:35:30.2741566Z Errors while running CTest'

TEST RUN 93039869 - 2020-05-01T12:04:39Z
b'2020-05-01T12:35:31.9366081Z The following tests FAILED:'
b'2020-05-01T12:35:31.9366523Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T12:35:31.9366817Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T12:35:31.9366923Z Errors while running CTest'

b'2020-05-01T12:37:02.3758277Z The following tests FAILED:'
b'2020-05-01T12:37:02.3758864Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T12:37:02.3759070Z Errors while running CTest'

TEST RUN 93004273 - 2020-05-01T11:03:33Z
b'2020-05-01T11:34:21.8739779Z The following tests FAILED:'
b'2020-05-01T11:34:21.8739991Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T11:34:21.8740177Z \t 31 - test_play__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T11:34:21.8740386Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T11:34:21.8740482Z Errors while running CTest'

b'2020-05-01T11:35:51.8787516Z The following tests FAILED:'
b'2020-05-01T11:35:51.8787718Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T11:35:51.8787806Z Errors while running CTest'

TEST RUN 92970747 - 2020-05-01T10:03:47Z
b'2020-05-01T10:29:07.8768310Z The following tests FAILED:'
b'2020-05-01T10:29:07.8768506Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T10:29:07.8768592Z Errors while running CTest'

TEST RUN 92936606 - 2020-05-01T09:04:01Z
b'2020-05-01T09:36:13.5265852Z The following tests FAILED:'
b'2020-05-01T09:36:13.5266121Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T09:36:13.5266403Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T09:36:13.5266524Z Errors while running CTest'

b'2020-05-01T09:37:41.7927209Z The following tests FAILED:'
b'2020-05-01T09:37:41.7927443Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T09:37:41.7927547Z Errors while running CTest'

TEST RUN 92901768 - 2020-05-01T08:04:17Z
b'2020-05-01T08:34:05.0969907Z The following tests FAILED:'
b'2020-05-01T08:34:05.0970184Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T08:34:05.0970454Z \t 30 - test_record__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T08:34:05.0970730Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T08:34:05.0970844Z Errors while running CTest'

b'2020-05-01T08:35:33.7374257Z The following tests FAILED:'
b'2020-05-01T08:35:33.7374508Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T08:35:33.7374597Z Errors while running CTest'

TEST RUN 92869475 - 2020-05-01T07:03:59Z
b'2020-05-01T07:32:49.0618630Z The following tests FAILED:'
b'2020-05-01T07:32:49.0618897Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T07:32:49.0619183Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T07:32:49.0619303Z Errors while running CTest'

TEST RUN 92839121 - 2020-05-01T06:04:15Z
b'2020-05-01T06:34:24.5639592Z The following tests FAILED:'
b'2020-05-01T06:34:24.5639859Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T06:34:24.5640032Z Errors while running CTest'

b'2020-05-01T06:35:53.3269251Z The following tests FAILED:'
b'2020-05-01T06:35:53.3269532Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T06:35:53.3269638Z Errors while running CTest'

TEST RUN 92809599 - 2020-05-01T05:03:58Z
b'2020-05-01T05:32:46.6473384Z The following tests FAILED:'
b'2020-05-01T05:32:46.6473617Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T05:32:46.6473867Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T05:32:46.6474126Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T05:32:46.6474232Z Errors while running CTest'

b'2020-05-01T05:34:15.6431719Z The following tests FAILED:'
b'2020-05-01T05:34:15.6431966Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T05:34:15.6432051Z Errors while running CTest'

TEST RUN 92779846 - 2020-05-01T04:04:15Z
b'2020-05-01T04:34:43.2583194Z The following tests FAILED:'
b'2020-05-01T04:34:43.2583485Z \t 19 - test_record__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T04:34:43.2583745Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T04:34:43.2584039Z \t 25 - test_record_all_no_discovery__rmw_fastrtps_dynamic_cpp (Failed)'
b'2020-05-01T04:34:43.2584159Z Errors while running CTest'

TEST RUN 92751423 - 2020-05-01T03:04:01Z
b'2020-05-01T03:36:00.7390039Z The following tests FAILED:'
b'2020-05-01T03:36:00.7390295Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T03:36:00.7390553Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T03:36:00.7390664Z Errors while running CTest'

b'2020-05-01T03:37:30.5714729Z The following tests FAILED:'
b'2020-05-01T03:37:30.5714968Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T03:37:30.5715072Z Errors while running CTest'

TEST RUN 92722447 - 2020-05-01T02:04:15Z
b'2020-05-01T02:37:32.1472052Z The following tests FAILED:'
b'2020-05-01T02:37:32.1472340Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T02:37:32.1472617Z Errors while running CTest'

b'2020-05-01T02:39:04.3343220Z The following tests FAILED:'
b'2020-05-01T02:39:04.3343460Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T02:39:04.3343568Z Errors while running CTest'

TEST RUN 92691030 - 2020-05-01T01:04:05Z
b'2020-05-01T01:34:46.3406834Z The following tests FAILED:'
b'2020-05-01T01:34:46.3407183Z \t 20 - test_play__rmw_fastrtps_cpp (Failed)'
b'2020-05-01T01:34:46.3407465Z \t 22 - test_play_topic_remap__rmw_fastrtps_cpp (Timeout)'
b'2020-05-01T01:34:46.3407723Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-05-01T01:34:46.3407830Z Errors while running CTest'

b'2020-05-01T01:36:16.3433446Z The following tests FAILED:'
b'2020-05-01T01:36:16.3433669Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T01:36:16.3433769Z Errors while running CTest'

TEST RUN 92656135 - 2020-05-01T00:09:17Z
b'2020-05-01T00:38:39.7218986Z The following tests FAILED:'
b'2020-05-01T00:38:39.7219224Z \t  2 - test_rosbag2_play_end_to_end (Timeout)'
b'2020-05-01T00:38:39.7219330Z Errors while running CTest'

TEST RUN 92613255 - 2020-04-30T23:03:53Z
b'2020-04-30T23:33:43.3972138Z The following tests FAILED:'
b'2020-04-30T23:33:43.3972601Z \t 33 - test_play_topic_remap__rmw_fastrtps_dynamic_cpp (Timeout)'
b'2020-04-30T23:33:43.3980234Z Errors while running CTest'

@mabelzhang
Copy link
Contributor

Is this on Bionic? I saw similar fastrtps test failures in rosbag2 when I was on Bionic. They disappeared when I started compiling in Focal instead.

@emersonknapp
Copy link
Collaborator Author

Yeah it's running on Bionic currently - which it should not. I still feel that the first most important change is to get these running on Focal. However it's interesting that Karsten can't repro on his Bionic machine. But, it's not a supported platform so it's not worth tooo much effort tracking it down if it's working on Focal. I have a feeling that a few of these issues will linger on Focal, so the faster we can get that running the better so we can track the remainder down.

This is a prerequisite to getting us there ros-tooling/setup-ros#140

mabelzhang added a commit that referenced this issue May 12, 2020
* decrease subscription expected msgs

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* change workflow to test this branch

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* change expected number too

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* linter

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* change workflow to autotest feature branch

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* revert workflow

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* run workflow more frequently to test stability

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* revert github workflow

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* cleanup

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* define constants

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* disable test on windows

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>

* disable play_filters_by_topic on windows

Signed-off-by: Mabel Zhang <mabel@openrobotics.org>
@emersonknapp
Copy link
Collaborator Author

Still seeing Action CI fail consistently e.g.

so I'm reopening this to track getting us to green

@emersonknapp emersonknapp reopened this May 18, 2020
@emersonknapp emersonknapp added this to the Sprint #19 milestone May 18, 2020
@mm318 mm318 self-assigned this May 20, 2020
@mm318
Copy link
Member

mm318 commented May 21, 2020

I have been able to reproduce these issues locally in a Ubuntu Focal container.

The segfaults (for example in test_rosbag2_record_end_to_end, test_record__*, test_record_all_no_discovery__*) occur during rclcpp::Node construction and rclcpp::shutdown(). Since these tests seem to involve quite a bit of asynchronous code execution, I suspect the issues might be related to ros2/rclcpp#1042.

The timeouts happened in test_rosbag2_play_end_to_end and test_play_topic_remap__rmw_fastrtps_cpp, which may mean it is specific to when using rmw_fastrtps_cpp/rmw_fastrtps_dynamic_cpp as the RMW implementation.

@emersonknapp
Copy link
Collaborator Author

I assume these timeouts are related to expecting a specific number of messages, and not receiving either the messages at the beginning, or one at the end - it's worth putting a print statement on messages received, to see if it's just that we're getting one fewer than expected. So, either a discovery or a shutdown edge case behavior - if the tests are testing something else (e.g. not about specific numbers of messages, but about whether messages are received at all, etc), then what we should do is make those tests less fragile to these boundary behaviors, perhaps publish N*2 messages and only expect to receive N - or something along those lines.

@mm318
Copy link
Member

mm318 commented May 21, 2020

I assume these timeouts are related to expecting a specific number of messages, and not receiving either the messages at the beginning, or one at the end

In both tests that failed due to timeout that I mentioned, there were indeed an expectation for a specific number of messages, but exactly 0 messages were received (I used the debugger instead of print statements). I think we're missing waiting for get_publisher_count() of the SubscriptionManager and the get_subscription_count() of the Rosbag2Transport/ros2 bag play to reach non-zero. It could also be that get_publisher_count()/get_subscription_count() will never be non-zero due to some discovery edge case behavior as you mentioned.

@zmichaels11 zmichaels11 modified the milestones: Sprint #20, Sprint #21 Jun 5, 2020
@mm318
Copy link
Member

mm318 commented Jun 10, 2020

I wrote a small case that reproduces the timeout (messages never actually received by subscription) problem at unpublished_messages.zip.

I also wrote a small case that tries to reproduce the segfault problem at
shutdown_crash.zip. Although I never experienced a segfault with the small case, sometimes I did experience an unhandled exception, which means there's at least a race condition somewhere.

@emersonknapp
Copy link
Collaborator Author

That's awesome! With this we can potentially report it to the rmw implementation - does it show this behavior only on Fast-RTPS?

@mm318
Copy link
Member

mm318 commented Jun 10, 2020

does it show this behavior only on Fast-RTPS?

I have seen this behavior with both FastRTPS and RTI Connext. I haven't tried with CycloneDDS.

@zmichaels11
Copy link
Contributor

Moving this to the top of input queue so it can be reassigned

@emersonknapp
Copy link
Collaborator Author

Bump - watch this. The Action CI has been reenabled, see how it acts over the next 24 hours before closing this issue

@emersonknapp emersonknapp self-assigned this Dec 7, 2020
@emersonknapp
Copy link
Collaborator Author

only one of the last 24 hourly runs failed - and that test failure is also seen on the buildfarm. followup work to de-flake will be tracked by #408

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants