Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CI Workflow to test multiple RMW #990

Merged
merged 77 commits into from
Aug 21, 2019
Merged

Conversation

ruffsl
Copy link
Member

@ruffsl ruffsl commented Jul 30, 2019

This PR extends the CircleCI workflow for testing multiple RMW implementations. To compensate for the added build time in generating messages, a combination of ccache and CircleCI caches are used to accelerate fresh PR jobs, including jobs across successive workflows. To compensate for the added test time, RMW implementations is tested across parallel jobs using the quicker release build.


Motivation

In preparation for the security workshop at ROSCon, @mikaelarguedas and I have been working to demo SROS2 running on the Turtlebot3 to secure the entire ROS stack, including the navigation2. However, when using other Tier 1 supported RMW implementations to verify security functionality, we’ve been encountering a number of issues, including:

#915
#916

To help navigation team detect and debug these issues, I’d like to extend the current CI workflow to include support for testing multiple RMW implementations, as well as a number of other optimizations that may be helpful in keeping the CI tractable.

Changes

  • DockerHub and Dockerfile

    • Clarify underlay and overlay terminology and inconsistencies
    • Parameterize the FROM image using build a ARG
    • Add hooks for reusing Dockerfile for multiple tags
    • Install and use ccache for CI optimization
  • CircleCI config.yaml

    • Clarify underlay and overlay terminology and inconsistencies
    • Split RMW testing for release builds into parallel jobs
    • Add nonce to cache keys to avoid collisions between parallel jobs and workflows
    • Use CI caching to preserve ccache across jobs for both underlay and overlay

After

image

Action Items

  • Update DockerHub automated build config
  • Change over image repo from ruffsl to rosplanning

image

Source Type Source Docker Tag Dockerfile location Build Context Autobuild
Branch master master /.dockerhub/nightly/Dockerfile <leave_empty> True
Branch master master-rmw /.dockerhub/nightly-rmw/Dockerfile <leave_ empty > True

@mkhansenbot
Copy link
Collaborator

Can you help do the same with the Travis CI also? I'd like that to also run the tests with different RMW implementations.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 30, 2019

@mkhansen-intel I forget: why do we still use Travis if we have Circle + ROS PR builder going? I'm not sure what that adds the other 2 dont do. Less to maintain here might be nice.

@ruffsl
Copy link
Member Author

ruffsl commented Jul 30, 2019

Can you help do the same with the Travis CI also?

Could we tackle this in a future/separate PR?

@crdelsey
Copy link
Contributor

Just FYI: opensplice middleware is broken too. #974.

@crdelsey
Copy link
Contributor

why do we still use Travis if we have Circle + ROS PR builder going

Sometimes Circle will just mysteriously fail. Travis kind of acts like the backup when that happens.
A couple of times now, Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

After a week or two, it suddenly starts to work again. This has happened twice now; at least that I've noticed.

Also, due to the way the Circle build works, if master gets broken by an upstream change, a PR that fixes the problem can't make the CIrcle build pass. But it can make the Travis build pass.

I'm OK with not making Travis do all the builds Circle does, but, so long as it is isn't a huge burden, it is nice to have it as a backup check.

@ruffsl
Copy link
Member Author

ruffsl commented Jul 31, 2019

Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

That sounds like an occurance of a cache restore loading build artifacts from a improper job. I've been using timestamps from the nightly docker image to keep the cache keys in lock step.

https://github.com/ros-planning/navigation2/blob/2350778c3eb11196de68ce006d5ec246e848f27b/.circleci/config.yml#L7-L9

In reworking the config for this PR, I realized that the cache keys between release and debug jobs (that run in parallel) could potentially collide, so in 54565dd I added the CIRCLE_CACHE_NONCE env to help seed unique checksum.txt digests between jobs in the same workflow, or jobs with different compiler flags.

@crdelsey , if you could by chance remember an issue where this occurred, then I could dig into the respective CircleCI reports to assert if my hypothesis was the case, and if this PR resolves it.

@ruffsl
Copy link
Member Author

ruffsl commented Jul 31, 2019

Also, due to the way the Circle build works, if master gets broken by an upstream change, a PR that fixes the problem can't make the CIrcle build pass. But it can make the Travis build pass.

IIRC, master branch was broken for so long that rosplanning/navigation2:master had grown too stale, and fell out of sync from upstream changes that respective fixing PRs were expecting.

I was thinking we could have a CI Dockerfile that only install underlay and overlay dependencies, and skipps the colcon build layers, but then we'd miss out the benefit of having the DockerHub initialize our ccache each new PR.

Instead, I've added 664ec03 to resolve this stalemate (haha) buy having the dockerhub build hooks unset the FAIL_ON_BUILD_FAILURE, thus allowing the images to continue to build even when master breaks. However, the Dockerhub CI badge will be less indicative of colcon build success, so I've enables the nightly cron jobs in b40cdaa to keep a close eye. It should be less resource intensive now that we are using ccache.

@ruffsl
Copy link
Member Author

ruffsl commented Jul 31, 2019

Circle will aways fail when the gazebo_ros plugin is loaded. It fails with a symbol not found error as if the gazebo_ros plugin is linked with a different version of ROS than is currently present on the system.

I just also noticed that CircleCI branch name substitution using {{ .Branch }} does not prefix names with the repo name. Thus PRs from separate forks using the same branch name could collide.

To avoid cache collisions for PRs with same branch name, I've included the unique PR number into cache key. See b99b98a.

@ruffsl
Copy link
Member Author

ruffsl commented Jul 31, 2019

I've also removed sharing of underlay caches across branches/PRs in 823173f.
That was probably a bug.

@crdelsey
Copy link
Contributor

@ruffsl Fortunately the latest Circle symbol failure problem was recent. Here is a build that shows the failure mode.
https://circleci.com/gh/ros-planning/navigation2/1618?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

@ruffsl
Copy link
Member Author

ruffsl commented Aug 1, 2019

Here is a build that shows the failure mode.

Ok, looks like build 1618 restored its underlay cache from build 1435:

Found a cache from build 1435 at upstream-cache-v1-arch1-linux-amd64-6_85-7fIRLjO_6tpk1hegvoBhjHu+EAoQ4oDNc+JQ_VwWOlQ=
Size: 35 MB
Cached paths:
  * /opt/ros_ws

The the cache and originating build 1435 is comparatively old:
June 19 vs June 28.

Skipping cache generation, cache already exists for key: upstream-cache-v1-arch1-linux-amd64-6_85-7fIRLjO_6tpk1hegvoBhjHu+EAoQ4oDNc+JQ_VwWOlQ=
Found one created at 2019-06-19 20:58:00 +0000 UTC

Re-reading the latest docs, save_cache is a bit unitutive in that it can't update existing caches.
Looks like the docs have been updated to better reflect this. Grrr...

Note If the cache for the given key already exists it won’t be modified, and job execution will proceed to the next step.
https://circleci.com/docs/2.0/configuration-reference/#save_cache

CircleCI restores caches in the order of keys listed in the restore_cache step. Each cache key is namespaced to the project, and retrieval is prefix-matched. The cache will be restored from the first matching key. If there are multiple matches, the most recently generated cache will be used.
https://circleci.com/docs/2.0/caching/#restoring-cache

However, given that cache key matching is prefix based, I've reworked this in 479aca7 .
Some working examples jobs below, with 261 restored from 255, and 265 restored from 261:

We shouldn't have another key unintended cache crosstalk for for about 267.484 years. 📆 🤞

@ruffsl
Copy link
Member Author

ruffsl commented Aug 10, 2019

Ok, so the osrf/ros2:nightly-rmw tags should be operational. I've merged origin master and my rmw branch on my fork to test out the CI on the latest state of this repo. Could I get an opinion on the test failures here, none of the jobs seem to be passing with master:

@ruffsl ruffsl marked this pull request as ready for review August 12, 2019 18:31
@ruffsl
Copy link
Member Author

ruffsl commented Aug 12, 2019

Since my last comment, changes I've added include:

  • Add ccache stats and logs for debugging
  • Split all build and test steps into separate jobs
  • Use more yaml magic for keeping DRY
  • switch to use official lcov orb
  • enable test results
  • enable parallel testing and load balancing

Example: https://circleci.com/gh/ruffsl/navigation2/418

image

image

image

I think this is ready for a review. FYI, it's probably easiest to read the longer config from bottom to top given the hierarchy of yaml anchors and circleci abstractions. I've tried order references respectively.

# name: Test Overlay
# command: |
# . install/setup.sh
# src/navigation2/tools/run_test_suite.bash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could push any system integration tests into proper ctests that include timing info in the xml test results, so that they can be load balanced accordingly. We could also split them out into their own job, but I'd like to make sure they are still reporting their own timing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could push any system integration tests into proper ctests

They are are ctests already. Do you mean launch them from colcon test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean launch them from colcon test

Yea, as then they could be load balanced by the circleci timing split. However splitting currently is at the granularity of package, so if one package take the lion share of test time, say more than the sum of the other tests; one unlucky worker will still end up testing that package by its lonesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I don't want to go back to colcon running the system tests directly without figuring out some changes. I was unable to get the full console output when colcon ran the test, and without that, it was impossible to tell why the test failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to get the full console output when colcon ran the test

Could you be more specific? Do you have an example I could debug, or links to the two CI test runs I could compare?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran the system test locally using colcon. Here is the output:

> colcon test --merge-install --packages-select nav2_system_tests
Starting >>> nav2_system_tests
[Processing: nav2_system_tests]                   
[Processing: nav2_system_tests]                           
[Processing: nav2_system_tests]                             
[Processing: nav2_system_tests]                             
--- stderr: nav2_system_tests                               
Errors while running CTest
---
Finished <<< nav2_system_tests [2min 30s]	[ with test failures ]

Summary: 1 package finished [2min 30s]
  1 package had stderr output: nav2_system_tests
  1 package had test failures: nav2_system_tests

> 

> colcon test-result --verbose --test-result-base build/nav2_system_tests/
build/nav2_system_tests/Testing/20190814-2300/Test.xml: 11 tests, 0 errors, 1 failure, 0 skipped
build/nav2_system_tests/test_results/nav2_system_tests/test_localization.xml: 1 test, 0 errors, 1 failure, 0 skipped
- nav2_system_tests test_localization.missing_result
  <<< failure message
    The test did not generate a result file.
  >>>

Summary: 77 tests, 0 errors, 2 failures, 0 skipped

That's all I get - "test did not generate a result file"

Compare with any current build where I get all the console output - showing all the messages each node prints out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That summary is being generated with colcon by exampling what ctest wrote to test_results/nav2_system_tests/test_localization.xml.

I'm not sure why the ctest isn't reporting the stdout/stderr. My guess is that it wasn't configured to, but perhaps Dirk would know. Still, I don't think it would be useful to duplicate the stdout dump into the xml when its already captured from the log artfacts. I think that message field in the xml should be used for heigherlevel feedback reporting.

But looking at the colcon logs for the test, all the console output - showing all the messages each node prints out - seems to be there: logs/test/nav2_system_tests/... Please make sure that you can navigatiate the artifact tab for test jobs where this is all stored for you.

https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout_stderr.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/streams.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/command.log

image

@crdelsey
Copy link
Contributor

Could I get an opinion on the test failures here

Those failures look expected except for this in the RTI connext test.

nav2_util test_actions.gtest.missing_result
  <<< failure message
    The test did not generate a result file:
    
    Running main() from /opt/ros/dashing/src/gtest_vendor/src/gtest_main.cc
    [==========] Running 2 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 2 tests from ActionTest
    [ RUN      ] ActionTest.test_simple_action
    
    >>> [rcutils|error_handling.c:107] rcutils_set_error_state()
    This error state is being overwritten:
    
      'array_list is already initialized, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcutils/src/array_list.c:61, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:179'
    
    with this new error message:
    
      'Failed to add publisher to map., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:181'
    
    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
    Expires on 00-jan-00 See www.rti.com for more information.
    RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
    Expires on 00-jan-00 See www.rti.com for more information.
    unknown file: Failure
    C++ exception with description "failed to initialize rcl node: Failed to add publisher to map., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/logging_rosout.c:181" thrown in SetUp().
  >>>

@ruffsl
Copy link
Member Author

ruffsl commented Aug 12, 2019

I simplified the build config a bit using colcon mixin and repurposed the CI images into a debug and release for the different job executors.

@crdelsey
Copy link
Contributor

@ruffsl Thanks for doing all this. I'll review this as soon as I can

@mkhansenbot
Copy link
Collaborator

@ruffsl - can you rebase and re-run the CI to get it passing? I'm assuming there's no point in reviewing this until the CI is passing

@ruffsl
Copy link
Member Author

ruffsl commented Aug 15, 2019

by only running the connext test on the nightly build you set up.

I can settle for that. I've moved the extra release tests to the nightly workflow so we can keep an eye on them, and remind us what's still failing, while keeping more container resource free for paralel PR jobs.

@codecov
Copy link

codecov bot commented Aug 15, 2019

Codecov Report

Merging #990 into master will increase coverage by 1.42%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #990      +/-   ##
==========================================
+ Coverage   33.59%   35.01%   +1.42%     
==========================================
  Files         197      197              
  Lines       10241    10241              
  Branches     3821     3821              
==========================================
+ Hits         3440     3586     +146     
+ Misses       4485     4268     -217     
- Partials     2316     2387      +71
Impacted Files Coverage Δ
...map_2d/test/integration/test_collision_checker.cpp
...2/nav2_lifecycle_manager/src/lifecycle_manager.cpp
...ns/include/dwb_plugins/one_d_velocity_iterator.hpp
src/navigation2/nav2_costmap_2d/src/footprint.cpp
...critics/include/dwb_critics/obstacle_footprint.hpp
...ontroller/dwb_core/include/dwb_core/exceptions.hpp
...gation2/nav2_costmap_2d/src/observation_buffer.cpp
src/navigation2/nav2_world_model/src/main.cpp
src/navigation2/nav2_util/test/test_actions.cpp
...roller/costmap_queue/src/limited_costmap_queue.cpp
... and 384 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a587327...618e642. Read the comment docs.

@ruffsl
Copy link
Member Author

ruffsl commented Aug 16, 2019

Before merging, I'll have change the exciters over to use images from the rosplanning/navigation2 DockerHub repo. I'm also still trying to get the dockerhub build triggers stable, so I'll post again when that's sorted.

@crdelsey
Copy link
Contributor

@ruffsl I want to add a doc build step to CI. It would do a Sphinx build of our doc folder (once the docs are checked in). It would then publish the output back to the gh-pages branch in this repo.

In a perfect world, we'd do the build part on each PR to ensure the docs build hasn't been broken. Then the build and publish would happen when master is updated.

Any recommendations on the best way to do this? As in, where should I fit this into the CircleCI config.

@ruffsl
Copy link
Member Author

ruffsl commented Aug 17, 2019

Any recommendations on the best way to do this? As in, where should I fit this into the CircleCI config.

Could you open a new issue describing what you had in mind, perhaps including any examples from other projects you'd like to replicate? Fitting this into the current CircleCi config doesn't seem hard.

I have a few questions what is the convention for documentation on ROS2; should this be hosted on github pages, or index.ros.org. How should deployment of docs be versioned, etc. We can expand a bit more under the ticket. I like the idea of offloading some of the hosting for releases to readthedocs.org if we can. Here is a related article on these topics:
https://circleci.com/blog/automate-releases-from-pipelines-using-infrastructure-as-code

@ruffsl
Copy link
Member Author

ruffsl commented Aug 21, 2019

Ok, I've updated the automated build rules on docker hub and changed over the images in the config.
I'll merge and verify the link to osrf/ros2:nightly is triggering tomorrow.

@ruffsl ruffsl merged commit 5c0dfbe into ros-navigation:master Aug 21, 2019
@ruffsl
Copy link
Member Author

ruffsl commented Aug 21, 2019

@crdelsey , could you go into the CircleCI security settings and enable 3rd party orbs:
https://circleci.com/gh/organizations/ros-planning/settings#security

image

Otherwise we'll continue to receive CI errors like when we try and load the partnered codecov orb:
https://circleci.com/orbs/registry/orb/codecov/codecov

Orb codecov/codecov@1.0.5 not loaded. To use this orb, an organization admin must opt-in to using third party orbs in Organization Security settings.

@crdelsey
Copy link
Contributor

@ruffsl I don't have access to that either. Before I ask OSRF to enable this for us, does using the ORB buy us anything over just using the upload script we used before?

@ruffsl
Copy link
Member Author

ruffsl commented Aug 21, 2019

does using the ORB buy us anything over just using the upload script we used before?

I'd evenly like to push more of the common commands into orbs of their own for working with ROS, so that we can share more CI boilerplate across ROS projects/community, and simplify config complexity.
So to use our own self published orb, we'd still need the same setting.

@crdelsey
Copy link
Contributor

@tfoote Would you enable support for 3rd party orbs in the CircleCI settings for the ROS-Planning organization as @ruffsl describes 3 notes above?

We're using CircleCI as the CI system for navigation2 and using CodeCov.io to provide a code coverage report. CodeCov.io provides a plugin (orb) to make it easier to integrate with CircleCI, but, to use it, we need to grant it permissions on the CircleCI website. This has to be done by someone with permissions at the github ros-planning org level, so I can't enable it myself

If you've never used CircleCI before, there's an option to "login with github" which will associate the CircleCI account with your Github account and grant you appropriate permissions.

Thanks

@tfoote
Copy link

tfoote commented Aug 23, 2019

@nuclearsandwich Is going to enable it for you.

@nuclearsandwich
Copy link
Contributor

wtfisorbs

In theory you're ✔️

@crdelsey
Copy link
Contributor

@nuclearsandwich Thanks. Looks good.

@ruffsl ruffsl mentioned this pull request Nov 18, 2019
3 tasks
@ruffsl ruffsl deleted the rmw branch August 15, 2020 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants