-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend CI Workflow to test multiple RMW #990
Conversation
Can you help do the same with the Travis CI also? I'd like that to also run the tests with different RMW implementations. |
@mkhansen-intel I forget: why do we still use Travis if we have Circle + ROS PR builder going? I'm not sure what that adds the other 2 dont do. Less to maintain here might be nice. |
Could we tackle this in a future/separate PR? |
Just FYI: opensplice middleware is broken too. #974. |
Sometimes Circle will just mysteriously fail. Travis kind of acts like the backup when that happens. After a week or two, it suddenly starts to work again. This has happened twice now; at least that I've noticed. Also, due to the way the Circle build works, if master gets broken by an upstream change, a PR that fixes the problem can't make the CIrcle build pass. But it can make the Travis build pass. I'm OK with not making Travis do all the builds Circle does, but, so long as it is isn't a huge burden, it is nice to have it as a backup check. |
That sounds like an occurance of a cache restore loading build artifacts from a improper job. I've been using timestamps from the nightly docker image to keep the cache keys in lock step. In reworking the config for this PR, I realized that the cache keys between release and debug jobs (that run in parallel) could potentially collide, so in 54565dd I added the @crdelsey , if you could by chance remember an issue where this occurred, then I could dig into the respective CircleCI reports to assert if my hypothesis was the case, and if this PR resolves it. |
IIRC, master branch was broken for so long that I was thinking we could have a CI Dockerfile that only install underlay and overlay dependencies, and skipps the colcon build layers, but then we'd miss out the benefit of having the DockerHub initialize our ccache each new PR. Instead, I've added 664ec03 to resolve this stalemate (haha) buy having the dockerhub build hooks unset the |
I just also noticed that CircleCI branch name substitution using {{ .Branch }} does not prefix names with the repo name. Thus PRs from separate forks using the same branch name could collide. To avoid cache collisions for PRs with same branch name, I've included the unique PR number into cache key. See b99b98a. |
I've also removed sharing of underlay caches across branches/PRs in 823173f. |
@ruffsl Fortunately the latest Circle symbol failure problem was recent. Here is a build that shows the failure mode. |
Ok, looks like build 1618 restored its underlay cache from build 1435:
The the cache and originating build 1435 is comparatively old:
Re-reading the latest docs, save_cache is a bit unitutive in that it can't update existing caches.
However, given that cache key matching is prefix based, I've reworked this in 479aca7 .
We shouldn't have another key unintended cache crosstalk for for about 267.484 years. 📆 🤞 |
Ok, so the
|
Since my last comment, changes I've added include:
Example: https://circleci.com/gh/ruffsl/navigation2/418 I think this is ready for a review. FYI, it's probably easiest to read the longer config from bottom to top given the hierarchy of yaml anchors and circleci abstractions. I've tried order references respectively. |
.circleci/config.yml
Outdated
# name: Test Overlay | ||
# command: | | ||
# . install/setup.sh | ||
# src/navigation2/tools/run_test_suite.bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could push any system integration tests into proper ctests that include timing info in the xml test results, so that they can be load balanced accordingly. We could also split them out into their own job, but I'd like to make sure they are still reporting their own timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could push any system integration tests into proper ctests
They are are ctests already. Do you mean launch them from colcon test
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean launch them from colcon test
Yea, as then they could be load balanced by the circleci timing split. However splitting currently is at the granularity of package, so if one package take the lion share of test time, say more than the sum of the other tests; one unlucky worker will still end up testing that package by its lonesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't want to go back to colcon running the system tests directly without figuring out some changes. I was unable to get the full console output when colcon ran the test, and without that, it was impossible to tell why the test failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unable to get the full console output when colcon ran the test
Could you be more specific? Do you have an example I could debug, or links to the two CI test runs I could compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran the system test locally using colcon. Here is the output:
> colcon test --merge-install --packages-select nav2_system_tests
Starting >>> nav2_system_tests
[Processing: nav2_system_tests]
[Processing: nav2_system_tests]
[Processing: nav2_system_tests]
[Processing: nav2_system_tests]
--- stderr: nav2_system_tests
Errors while running CTest
---
Finished <<< nav2_system_tests [2min 30s] [ with test failures ]
Summary: 1 package finished [2min 30s]
1 package had stderr output: nav2_system_tests
1 package had test failures: nav2_system_tests
>
> colcon test-result --verbose --test-result-base build/nav2_system_tests/
build/nav2_system_tests/Testing/20190814-2300/Test.xml: 11 tests, 0 errors, 1 failure, 0 skipped
build/nav2_system_tests/test_results/nav2_system_tests/test_localization.xml: 1 test, 0 errors, 1 failure, 0 skipped
- nav2_system_tests test_localization.missing_result
<<< failure message
The test did not generate a result file.
>>>
Summary: 77 tests, 0 errors, 2 failures, 0 skipped
That's all I get - "test did not generate a result file"
Compare with any current build where I get all the console output - showing all the messages each node prints out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That summary is being generated with colcon by exampling what ctest wrote to test_results/nav2_system_tests/test_localization.xml.
I'm not sure why the ctest isn't reporting the stdout/stderr. My guess is that it wasn't configured to, but perhaps Dirk would know. Still, I don't think it would be useful to duplicate the stdout dump into the xml when its already captured from the log artfacts. I think that message field in the xml should be used for heigherlevel feedback reporting.
But looking at the colcon logs for the test, all the console output - showing all the messages each node prints out - seems to be there: logs/test/nav2_system_tests/... Please make sure that you can navigatiate the artifact tab for test jobs where this is all stored for you.
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/stdout_stderr.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/streams.log
https://528-166157381-gh.circle-artifacts.com/0/opt/overlay_ws/log/test/nav2_system_tests/command.log
Those failures look expected except for this in the RTI connext test.
|
I simplified the build config a bit using colcon mixin and repurposed the CI images into a debug and release for the different job executors. |
@ruffsl Thanks for doing all this. I'll review this as soon as I can |
@ruffsl - can you rebase and re-run the CI to get it passing? I'm assuming there's no point in reviewing this until the CI is passing |
I can settle for that. I've moved the extra release tests to the nightly workflow so we can keep an eye on them, and remind us what's still failing, while keeping more container resource free for paralel PR jobs. |
prefix matching fails when key includes slashes
Before merging, I'll have change the exciters over to use images from the |
@ruffsl I want to add a doc build step to CI. It would do a Sphinx build of our doc folder (once the docs are checked in). It would then publish the output back to the gh-pages branch in this repo. In a perfect world, we'd do the build part on each PR to ensure the docs build hasn't been broken. Then the build and publish would happen when master is updated. Any recommendations on the best way to do this? As in, where should I fit this into the CircleCI config. |
Could you open a new issue describing what you had in mind, perhaps including any examples from other projects you'd like to replicate? Fitting this into the current CircleCi config doesn't seem hard. I have a few questions what is the convention for documentation on ROS2; should this be hosted on github pages, or index.ros.org. How should deployment of docs be versioned, etc. We can expand a bit more under the ticket. I like the idea of offloading some of the hosting for releases to readthedocs.org if we can. Here is a related article on these topics: |
Ok, I've updated the automated build rules on docker hub and changed over the images in the config. |
@crdelsey , could you go into the CircleCI security settings and enable 3rd party orbs: Otherwise we'll continue to receive CI errors like when we try and load the partnered codecov orb:
|
@ruffsl I don't have access to that either. Before I ask OSRF to enable this for us, does using the ORB buy us anything over just using the upload script we used before? |
I'd evenly like to push more of the common commands into orbs of their own for working with ROS, so that we can share more CI boilerplate across ROS projects/community, and simplify config complexity. |
@tfoote Would you enable support for 3rd party orbs in the CircleCI settings for the ROS-Planning organization as @ruffsl describes 3 notes above? We're using CircleCI as the CI system for navigation2 and using CodeCov.io to provide a code coverage report. CodeCov.io provides a plugin (orb) to make it easier to integrate with CircleCI, but, to use it, we need to grant it permissions on the CircleCI website. This has to be done by someone with permissions at the github ros-planning org level, so I can't enable it myself If you've never used CircleCI before, there's an option to "login with github" which will associate the CircleCI account with your Github account and grant you appropriate permissions. Thanks |
@nuclearsandwich Is going to enable it for you. |
@nuclearsandwich Thanks. Looks good. |
This PR extends the CircleCI workflow for testing multiple RMW implementations. To compensate for the added build time in generating messages, a combination of ccache and CircleCI caches are used to accelerate fresh PR jobs, including jobs across successive workflows. To compensate for the added test time, RMW implementations is tested across parallel jobs using the quicker release build.
Motivation
In preparation for the security workshop at ROSCon, @mikaelarguedas and I have been working to demo SROS2 running on the Turtlebot3 to secure the entire ROS stack, including the navigation2. However, when using other Tier 1 supported RMW implementations to verify security functionality, we’ve been encountering a number of issues, including:
#915
#916
To help navigation team detect and debug these issues, I’d like to extend the current CI workflow to include support for testing multiple RMW implementations, as well as a number of other optimizations that may be helpful in keeping the CI tractable.
Changes
DockerHub and Dockerfile
FROM
image using build aARG
CircleCI config.yaml
After
Action Items