Extend CI Workflow to test multiple RMW #990

ruffsl · 2019-07-30T05:55:48Z

This PR extends the CircleCI workflow for testing multiple RMW implementations. To compensate for the added build time in generating messages, a combination of ccache and CircleCI caches are used to accelerate fresh PR jobs, including jobs across successive workflows. To compensate for the added test time, RMW implementations is tested across parallel jobs using the quicker release build.

Motivation

In preparation for the security workshop at ROSCon, @mikaelarguedas and I have been working to demo SROS2 running on the Turtlebot3 to secure the entire ROS stack, including the navigation2. However, when using other Tier 1 supported RMW implementations to verify security functionality, we’ve been encountering a number of issues, including:

#915
#916

To help navigation team detect and debug these issues, I’d like to extend the current CI workflow to include support for testing multiple RMW implementations, as well as a number of other optimizations that may be helpful in keeping the CI tractable.

Changes

DockerHub and Dockerfile
- Clarify underlay and overlay terminology and inconsistencies
- Parameterize the FROM image using build a ARG
- Add hooks for reusing Dockerfile for multiple tags
- Install and use ccache for CI optimization
CircleCI config.yaml
- Clarify underlay and overlay terminology and inconsistencies
- Split RMW testing for release builds into parallel jobs
- Add nonce to cache keys to avoid collisions between parallel jobs and workflows
- Use CI caching to preserve ccache across jobs for both underlay and overlay

After

Action Items

Update DockerHub automated build config
Change over image repo from ruffsl to rosplanning

Source Type	Source	Docker Tag	Dockerfile location	Build Context	Autobuild
Branch	master	master	/.dockerhub/nightly/Dockerfile	<leave_empty>	True
Branch	master	master-rmw	/.dockerhub/nightly-rmw/Dockerfile	<leave_ empty >	True

mkhansenbot · 2019-07-30T22:15:46Z

Can you help do the same with the Travis CI also? I'd like that to also run the tests with different RMW implementations.

SteveMacenski · 2019-07-30T22:30:34Z

@mkhansen-intel I forget: why do we still use Travis if we have Circle + ROS PR builder going? I'm not sure what that adds the other 2 dont do. Less to maintain here might be nice.

ruffsl · 2019-07-30T23:04:32Z

Can you help do the same with the Travis CI also?

Could we tackle this in a future/separate PR?

crdelsey · 2019-07-30T23:27:04Z

Just FYI: opensplice middleware is broken too. #974.

crdelsey · 2019-07-30T23:33:55Z

why do we still use Travis if we have Circle + ROS PR builder going

Sometimes Circle will just mysteriously fail. Travis kind of acts like the backup when that happens.
A couple of times now, Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

After a week or two, it suddenly starts to work again. This has happened twice now; at least that I've noticed.

Also, due to the way the Circle build works, if master gets broken by an upstream change, a PR that fixes the problem can't make the CIrcle build pass. But it can make the Travis build pass.

I'm OK with not making Travis do all the builds Circle does, but, so long as it is isn't a huge burden, it is nice to have it as a backup check.

ruffsl · 2019-07-31T02:01:04Z

Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

That sounds like an occurance of a cache restore loading build artifacts from a improper job. I've been using timestamps from the nightly docker image to keep the cache keys in lock step.

https://github.com/ros-planning/navigation2/blob/2350778c3eb11196de68ce006d5ec246e848f27b/.circleci/config.yml#L7-L9

In reworking the config for this PR, I realized that the cache keys between release and debug jobs (that run in parallel) could potentially collide, so in 54565dd I added the CIRCLE_CACHE_NONCE env to help seed unique checksum.txt digests between jobs in the same workflow, or jobs with different compiler flags.

@crdelsey , if you could by chance remember an issue where this occurred, then I could dig into the respective CircleCI reports to assert if my hypothesis was the case, and if this PR resolves it.

ruffsl · 2019-07-31T05:57:54Z

Also, due to the way the Circle build works, if master gets broken by an upstream change, a PR that fixes the problem can't make the CIrcle build pass. But it can make the Travis build pass.

IIRC, master branch was broken for so long that rosplanning/navigation2:master had grown too stale, and fell out of sync from upstream changes that respective fixing PRs were expecting.

I was thinking we could have a CI Dockerfile that only install underlay and overlay dependencies, and skipps the colcon build layers, but then we'd miss out the benefit of having the DockerHub initialize our ccache each new PR.

Instead, I've added 664ec03 to resolve this stalemate (haha) buy having the dockerhub build hooks unset the FAIL_ON_BUILD_FAILURE, thus allowing the images to continue to build even when master breaks. However, the Dockerhub CI badge will be less indicative of colcon build success, so I've enables the nightly cron jobs in b40cdaa to keep a close eye. It should be less resource intensive now that we are using ccache.

ruffsl · 2019-07-31T06:19:59Z

Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

I just also noticed that CircleCI branch name substitution using {{ .Branch }} does not prefix names with the repo name. Thus PRs from separate forks using the same branch name could collide.

To avoid cache collisions for PRs with same branch name, I've included the unique PR number into cache key. See b99b98a.

ruffsl · 2019-07-31T06:39:13Z

I've also removed sharing of underlay caches across branches/PRs in 823173f.
That was probably a bug.

crdelsey · 2019-07-31T16:00:20Z

@ruffsl Fortunately the latest Circle symbol failure problem was recent. Here is a build that shows the failure mode.
https://circleci.com/gh/ros-planning/navigation2/1618?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

ruffsl · 2019-08-01T02:54:52Z

Here is a build that shows the failure mode.

Ok, looks like build 1618 restored its underlay cache from build 1435:

Found a cache from build 1435 at upstream-cache-v1-arch1-linux-amd64-6_85-7fIRLjO_6tpk1hegvoBhjHu+EAoQ4oDNc+JQ_VwWOlQ=
Size: 35 MB
Cached paths:
  * /opt/ros_ws

The the cache and originating build 1435 is comparatively old:
June 19 vs June 28.

Skipping cache generation, cache already exists for key: upstream-cache-v1-arch1-linux-amd64-6_85-7fIRLjO_6tpk1hegvoBhjHu+EAoQ4oDNc+JQ_VwWOlQ=
Found one created at 2019-06-19 20:58:00 +0000 UTC

Re-reading the latest docs, save_cache is a bit unitutive in that it can't update existing caches.
Looks like the docs have been updated to better reflect this. Grrr...

Note If the cache for the given key already exists it won’t be modified, and job execution will proceed to the next step.
https://circleci.com/docs/2.0/configuration-reference/#save_cache

CircleCI restores caches in the order of keys listed in the restore_cache step. Each cache key is namespaced to the project, and retrieval is prefix-matched. The cache will be restored from the first matching key. If there are multiple matches, the most recently generated cache will be used.
https://circleci.com/docs/2.0/caching/#restoring-cache

However, given that cache key matching is prefix based, I've reworked this in 479aca7 .
Some working examples jobs below, with 261 restored from 255, and 265 restored from 261:

We shouldn't have another key unintended cache crosstalk for for about 267.484 years. 📆 🤞

ruffsl · 2019-08-10T05:26:03Z

Ok, so the osrf/ros2:nightly-rmw tags should be operational. I've merged origin master and my rmw branch on my fork to test out the CI on the latest state of this repo. Could I get an opinion on the test failures here, none of the jobs seem to be passing with master:

debug: https://circleci.com/gh/ruffsl/navigation2/341
release: https://circleci.com/gh/ruffsl/navigation2/342
- test_rmw_opensplice_cpp: https://circleci.com/gh/ruffsl/navigation2/343
- test_rmw_fastrtps_cpp: https://circleci.com/gh/ruffsl/navigation2/344
- test_rmw_connext_cpp: https://circleci.com/gh/ruffsl/navigation2/345

ruffsl · 2019-08-12T18:32:08Z

Since my last comment, changes I've added include:

Add ccache stats and logs for debugging
Split all build and test steps into separate jobs
Use more yaml magic for keeping DRY
switch to use official lcov orb
enable test results
enable parallel testing and load balancing

Example: https://circleci.com/gh/ruffsl/navigation2/418

I think this is ready for a review. FYI, it's probably easiest to read the longer config from bottom to top given the hierarchy of yaml anchors and circleci abstractions. I've tried order references respectively.

ruffsl · 2019-08-12T18:36:41Z

.circleci/config.yml

+  #     name: Test Overlay
+  #     command: |
+  #       . install/setup.sh
+  #       src/navigation2/tools/run_test_suite.bash


Perhaps we could push any system integration tests into proper ctests that include timing info in the xml test results, so that they can be load balanced accordingly. We could also split them out into their own job, but I'd like to make sure they are still reporting their own timing.

Perhaps we could push any system integration tests into proper ctests

They are are ctests already. Do you mean launch them from colcon test?

Do you mean launch them from colcon test

Yea, as then they could be load balanced by the circleci timing split. However splitting currently is at the granularity of package, so if one package take the lion share of test time, say more than the sum of the other tests; one unlucky worker will still end up testing that package by its lonesome.

Ok. I don't want to go back to colcon running the system tests directly without figuring out some changes. I was unable to get the full console output when colcon ran the test, and without that, it was impossible to tell why the test failed.

I was unable to get the full console output when colcon ran the test

Could you be more specific? Do you have an example I could debug, or links to the two CI test runs I could compare?

I just ran the system test locally using colcon. Here is the output:

> colcon test --merge-install --packages-select nav2_system_tests Starting >>> nav2_system_tests [Processing: nav2_system_tests] [Processing: nav2_system_tests] [Processing: nav2_system_tests] [Processing: nav2_system_tests] --- stderr: nav2_system_tests Errors while running CTest --- Finished <<< nav2_system_tests [2min 30s] [ with test failures ] Summary: 1 package finished [2min 30s] 1 package had stderr output: nav2_system_tests 1 package had test failures: nav2_system_tests > > colcon test-result --verbose --test-result-base build/nav2_system_tests/ build/nav2_system_tests/Testing/20190814-2300/Test.xml: 11 tests, 0 errors, 1 failure, 0 skipped build/nav2_system_tests/test_results/nav2_system_tests/test_localization.xml: 1 test, 0 errors, 1 failure, 0 skipped - nav2_system_tests test_localization.missing_result <<< failure message The test did not generate a result file. >>> Summary: 77 tests, 0 errors, 2 failures, 0 skipped

That's all I get - "test did not generate a result file"

Compare with any current build where I get all the console output - showing all the messages each node prints out.

That summary is being generated with colcon by exampling what ctest wrote to test_results/nav2_system_tests/test_localization.xml.

I'm not sure why the ctest isn't reporting the stdout/stderr. My guess is that it wasn't configured to, but perhaps Dirk would know. Still, I don't think it would be useful to duplicate the stdout dump into the xml when its already captured from the log artfacts. I think that message field in the xml should be used for heigherlevel feedback reporting.

But looking at the colcon logs for the test, all the console output - showing all the messages each node prints out - seems to be there: logs/test/nav2_system_tests/... Please make sure that you can navigatiate the artifact tab for test jobs where this is all stored for you.

https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout_stderr.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/streams.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/command.log

crdelsey · 2019-08-12T20:31:25Z

Could I get an opinion on the test failures here

Those failures look expected except for this in the RTI connext test.

nav2_util test_actions.gtest.missing_result
  <<< failure message
    The test did not generate a result file:
    
    Running main() from /opt/ros/dashing/src/gtest_vendor/src/gtest_main.cc
    [==========] Running 2 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 2 tests from ActionTest
    [ RUN      ] ActionTest.test_simple_action
    
    >>> [rcutils|error_handling.c:107] rcutils_set_error_state()
    This error state is being overwritten:
    
      'array_list is already initialized, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcutils/src/array_list.c:61, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:179'
    
    with this new error message:
    
      'Failed to add publisher to map., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:181'
    
    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
    Expires on 00-jan-00 See www.rti.com for more information.
    RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
    Expires on 00-jan-00 See www.rti.com for more information.
    unknown file: Failure
    C++ exception with description "failed to initialize rcl node: Failed to add publisher to map., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:181" thrown in SetUp().
  >>>

ruffsl · 2019-08-12T22:37:16Z

I simplified the build config a bit using colcon mixin and repurposed the CI images into a debug and release for the different job executors.

crdelsey · 2019-08-12T23:27:52Z

@ruffsl Thanks for doing all this. I'll review this as soon as I can

mkhansenbot · 2019-08-13T23:24:04Z

@ruffsl - can you rebase and re-run the CI to get it passing? I'm assuming there's no point in reviewing this until the CI is passing

ruffsl · 2019-08-15T06:56:51Z

by only running the connext test on the nightly build you set up.

I can settle for that. I've moved the extra release tests to the nightly workflow so we can keep an eye on them, and remind us what's still failing, while keeping more container resource free for paralel PR jobs.

codecov · 2019-08-15T18:14:33Z

Codecov Report

Merging #990 into master will increase coverage by 1.42%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #990      +/-   ##
==========================================
+ Coverage   33.59%   35.01%   +1.42%     
==========================================
  Files         197      197              
  Lines       10241    10241              
  Branches     3821     3821              
==========================================
+ Hits         3440     3586     +146     
+ Misses       4485     4268     -217     
- Partials     2316     2387      +71

Impacted Files	Coverage Δ
...map_2d/test/integration/test_collision_checker.cpp
...2/nav2_lifecycle_manager/src/lifecycle_manager.cpp
...ns/include/dwb_plugins/one_d_velocity_iterator.hpp
src/navigation2/nav2_costmap_2d/src/footprint.cpp
...critics/include/dwb_critics/obstacle_footprint.hpp
...ontroller/dwb_core/include/dwb_core/exceptions.hpp
...gation2/nav2_costmap_2d/src/observation_buffer.cpp
src/navigation2/nav2_world_model/src/main.cpp
src/navigation2/nav2_util/test/test_actions.cpp
...roller/costmap_queue/src/limited_costmap_queue.cpp
... and 384 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a587327...618e642. Read the comment docs.

prefix matching fails when key includes slashes

ruffsl · 2019-08-16T18:22:19Z

Before merging, I'll have change the exciters over to use images from the rosplanning/navigation2 DockerHub repo. I'm also still trying to get the dockerhub build triggers stable, so I'll post again when that's sorted.

crdelsey · 2019-08-16T22:17:59Z

@ruffsl I want to add a doc build step to CI. It would do a Sphinx build of our doc folder (once the docs are checked in). It would then publish the output back to the gh-pages branch in this repo.

In a perfect world, we'd do the build part on each PR to ensure the docs build hasn't been broken. Then the build and publish would happen when master is updated.

Any recommendations on the best way to do this? As in, where should I fit this into the CircleCI config.

ruffsl · 2019-08-17T00:18:25Z

Any recommendations on the best way to do this? As in, where should I fit this into the CircleCI config.

Could you open a new issue describing what you had in mind, perhaps including any examples from other projects you'd like to replicate? Fitting this into the current CircleCi config doesn't seem hard.

I have a few questions what is the convention for documentation on ROS2; should this be hosted on github pages, or index.ros.org. How should deployment of docs be versioned, etc. We can expand a bit more under the ticket. I like the idea of offloading some of the hosting for releases to readthedocs.org if we can. Here is a related article on these topics:
https://circleci.com/blog/automate-releases-from-pipelines-using-infrastructure-as-code

ruffsl · 2019-08-21T03:05:10Z

Ok, I've updated the automated build rules on docker hub and changed over the images in the config.
I'll merge and verify the link to osrf/ros2:nightly is triggering tomorrow.

ruffsl · 2019-08-21T03:18:42Z

@crdelsey , could you go into the CircleCI security settings and enable 3rd party orbs:
https://circleci.com/gh/organizations/ros-planning/settings#security

Otherwise we'll continue to receive CI errors like when we try and load the partnered codecov orb:
https://circleci.com/orbs/registry/orb/codecov/codecov

Orb codecov/codecov@1.0.5 not loaded. To use this orb, an organization admin must opt-in to using third party orbs in Organization Security settings.

crdelsey · 2019-08-21T15:43:31Z

@ruffsl I don't have access to that either. Before I ask OSRF to enable this for us, does using the ORB buy us anything over just using the upload script we used before?

ruffsl · 2019-08-21T15:59:53Z

does using the ORB buy us anything over just using the upload script we used before?

I'd evenly like to push more of the common commands into orbs of their own for working with ROS, so that we can share more CI boilerplate across ROS projects/community, and simplify config complexity.
So to use our own self published orb, we'd still need the same setting.

crdelsey · 2019-08-21T17:49:43Z

@tfoote Would you enable support for 3rd party orbs in the CircleCI settings for the ROS-Planning organization as @ruffsl describes 3 notes above?

We're using CircleCI as the CI system for navigation2 and using CodeCov.io to provide a code coverage report. CodeCov.io provides a plugin (orb) to make it easier to integrate with CircleCI, but, to use it, we need to grant it permissions on the CircleCI website. This has to be done by someone with permissions at the github ros-planning org level, so I can't enable it myself

If you've never used CircleCI before, there's an option to "login with github" which will associate the CircleCI account with your Github account and grant you appropriate permissions.

Thanks

tfoote · 2019-08-23T20:54:19Z

@nuclearsandwich Is going to enable it for you.

nuclearsandwich · 2019-08-23T20:58:37Z

In theory you're ✔️

crdelsey · 2019-08-24T01:00:41Z

@nuclearsandwich Thanks. Looks good.

mkhansenbot requested review from crdelsey and mkhansenbot July 30, 2019 22:14

ruffsl force-pushed the rmw branch from 4cf00a6 to d4cee9e Compare July 31, 2019 06:57

ruffsl force-pushed the rmw branch from 7a53654 to 479aca7 Compare August 1, 2019 02:52

ruffsl force-pushed the rmw branch from 57ba650 to 4364521 Compare August 10, 2019 20:24

ruffsl marked this pull request as ready for review August 12, 2019 18:31

ruffsl commented Aug 12, 2019

View reviewed changes

ruffsl force-pushed the rmw branch from 5f99248 to 64c2331 Compare August 12, 2019 22:43

ruffsl force-pushed the rmw branch from 972b7b0 to 2c9cd82 Compare August 13, 2019 23:40

ruffsl added 4 commits August 13, 2019 16:40

Add parallel testing jobs for each rmw vendor

5f59f68

Fix yaml

c2cc8fa

Fix yaml

f9b0682

Use custom version of nightly instead

0386f44

ruffsl added 7 commits August 16, 2019 01:20

Refactor config using commands

66892a6

Use paths instead of env

6eef8e5

Revert ccache anchor name

aa65146

Fix install linking

d944605

Avoid using paths as keys

a177fe2

prefix matching fails when key includes slashes

Rename yaml anchor

f85c2be

Make sure lcov is installed for CI

a071cfd

crdelsey approved these changes Aug 16, 2019

View reviewed changes

ruffsl added 3 commits August 17, 2019 23:09

Add dummy Dockerfiles for setting repository links on Docker Hub

98e62e4

Set exit and print trace in script

e03175b

Tweek tag names

ca948f2

ruffsl mentioned this pull request Aug 20, 2019

Ros2 nightly eloquent osrf/docker_images#311

Merged

Migrate image tag to rosplanning repo

618e642

ruffsl merged commit 5c0dfbe into ros-navigation:master Aug 21, 2019

crdelsey mentioned this pull request Aug 23, 2019

BT Service node & implementations #1055

Merged

ruffsl mentioned this pull request Nov 18, 2019

Add LoadMap service to map_server #1303

Merged

3 tasks

ruffsl deleted the rmw branch August 15, 2020 07:31

Extend CI Workflow to test multiple RMW #990

Extend CI Workflow to test multiple RMW #990

Conversation

ruffsl commented Jul 30, 2019 • edited Loading

Motivation

Changes

After

Action Items

mkhansenbot commented Jul 30, 2019

SteveMacenski commented Jul 30, 2019 • edited Loading

ruffsl commented Jul 30, 2019

crdelsey commented Jul 30, 2019

crdelsey commented Jul 30, 2019

ruffsl commented Jul 31, 2019 • edited Loading

ruffsl commented Jul 31, 2019

ruffsl commented Jul 31, 2019

ruffsl commented Jul 31, 2019

crdelsey commented Jul 31, 2019

ruffsl commented Aug 1, 2019

ruffsl commented Aug 10, 2019

ruffsl commented Aug 12, 2019 • edited Loading

ruffsl Aug 12, 2019

Choose a reason for hiding this comment

crdelsey Aug 12, 2019

Choose a reason for hiding this comment

ruffsl Aug 12, 2019

Choose a reason for hiding this comment

crdelsey Aug 12, 2019

Choose a reason for hiding this comment

ruffsl Aug 12, 2019

Choose a reason for hiding this comment

crdelsey Aug 14, 2019

Choose a reason for hiding this comment

ruffsl Aug 15, 2019

Choose a reason for hiding this comment

crdelsey commented Aug 12, 2019

ruffsl commented Aug 12, 2019 • edited Loading

crdelsey commented Aug 12, 2019

mkhansenbot commented Aug 13, 2019

ruffsl commented Aug 15, 2019

codecov bot commented Aug 15, 2019 • edited Loading

Codecov Report

ruffsl commented Aug 16, 2019

crdelsey commented Aug 16, 2019

ruffsl commented Aug 17, 2019

ruffsl commented Aug 21, 2019

ruffsl commented Aug 21, 2019

crdelsey commented Aug 21, 2019

ruffsl commented Aug 21, 2019

crdelsey commented Aug 21, 2019

tfoote commented Aug 23, 2019

nuclearsandwich commented Aug 23, 2019

crdelsey commented Aug 24, 2019

ruffsl commented Jul 30, 2019 •

edited

Loading

SteveMacenski commented Jul 30, 2019 •

edited

Loading

ruffsl commented Jul 31, 2019 •

edited

Loading

ruffsl commented Aug 12, 2019 •

edited

Loading

ruffsl commented Aug 12, 2019 •

edited

Loading

codecov bot commented Aug 15, 2019 •

edited

Loading