New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invoke colcon instead of ament_tools #132

Merged
merged 15 commits into from Apr 21, 2018

Conversation

Projects
None yet
3 participants
@dirk-thomas
Member

dirk-thomas commented Mar 21, 2018

I don't intended to merge this patch - it will stay "in progress" until the direction is decided. I will keep the branch up-to-date though to allow anyone to opt-in to use this on CI.


With this branch you can set CI_SCRIPTS_BRANCH to colcon in the job parameters to use colcon for the build. When using this branch you need to change the default value for CI_AMENT_BUILD_ARGS from:

--parallel --cmake-args -DSECURITY=ON --

to

--cmake-args " -DSECURITY=ON"

  • Parallel builds are the default (--executor parallel).
  • Since colcon uses argparse without custom parsing of greedy arguments the -D needs to be prefixed with a space so that argparse doesn't consider it an argument. The --cmake-args argument itself will then take care of removing the leading space. The terminating -- is not needed anymore.

I also often use the option --event-handler console_cohesion+ to print the output of each package "on block" once the package has finished. That makes sure the output is not interleaves but the build log still contains "all" the information.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status
  • built without SECURITY=ON since that require changing the handling of the job parameters the PR has been updated to change the default parameters to keep security on the same way as before
  • only using FastRTPS since the RTI code generator keeps failing when building "many" packages in parallel
  • Linux coverage Build Status

Comparison of two Linux CI builds using different build tools:

@dirk-thomas dirk-thomas self-assigned this Mar 21, 2018

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Mar 21, 2018

Not stating an opinion here but I think it would be more fair to compare build time instead of the time including repeating flaky tests:

ament build time: 00:21:30
colcon build time: 00:20:04 (slightly faster)

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Mar 21, 2018

Not stating an opinion here but I think it would be more fair to compare build time instead of the time including repeating flaky tests:

With "build" time I was referring to the time the Jenkins build takes. I would expect both builds to have a similar rate of flakiness and would be surprised if they preferred happen in the ament_tools one. The overall result is still a significant speed up - I haven't looked into where the time difference originates from...

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 11, 2018

Ready for review with updated CI builds:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

To iterate quickly the current state uses the default branches of the colcon repos. At some point we can / should (?) switch to the latest release.

@dirk-thomas dirk-thomas added in review and removed in progress labels Apr 11, 2018

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 11, 2018

Is this able to build connext without crashing on the code generator ? I see that the windows job is not using connext

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 11, 2018

Is this able to build connext without crashing on the code generator ?

At least the builds I ran on Jenkins didn't had that problem. But since the problem happened only very rarely in the first place I don't think I can answer that question after just a few builds.

I see that the windows job is not using connext

I triggered another Windows build with Connext: Build Status

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 11, 2018

I think I'm missing information to review this so forgive if the following questions have been answered elsewhere and are planned to be addressed (pointers appreciated)

If it's not desired to have the console output displayed by default in colcon. Is there a plan to add --event-handler console_cohesion+ or at least the error output to the jobs by default?

What is the plan for DSECURITY ? Do you plan on having the jobs fetch the meta information from somewhere to add the flag for fastrtps and test_security ? Or should the default values of the jobs also be modified accordingly?

Why is catkin_pkg required? Does this mean that we don;t need ament_package anymore?

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 11, 2018

If it's not desired to have the console output displayed by default in colcon. Is there a plan to add --event-handler console_cohesion+ or at least the error output to the jobs by default?

Yes, we could make that the default. I can add more commits to this PR - I was mostly worried when also altering the job generation (to use different defaults and change the naming of some of the parameters) it might get too much for a review.

What is the plan for DSECURITY ? Do you plan on having the jobs fetch the meta information from somewhere to add the flag for fastrtps and test_security ? Or should the default values of the jobs also be modified accordingly?

I would keep that separate (to not make a single PR address too many things at the same time) and keep the current default value passing it to all packages.

Why is catkin_pkg required? Does this mean that we don;t need ament_package anymore?

colcon uses the API of catkin_pkg to parse the manifests and doesn't use ament_package. Either of the two would be sufficient. It just happens that the first is already available from Debian in a ROS distro independent way which isn't the case for the later.

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 11, 2018

I would keep that separate (to not make a single PR address too many things at the same time) and keep the current default value passing it to all packages.

So if we merge this we don't build or test security anymore, is that correct ?

I can add more commits to this PR - I was mostly worried when also altering the job generation (to use different defaults and change the naming of some of the parameters) it might get too much for a review.

I understand the concern for the review, nicely crafted commits should allow to reduce the review burden. If we want to convince ourselves about the job generation, what about a set of test jobs generated with the new config and manually triggered ?

I think that the behavior of the jobs should be the same after merging this PR, so not having console output or not building/testing part of our stack is a blocker for merging this in my opinion.

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 11, 2018

So if we merge this we don't build or test security anymore, is that correct ?

No, that is not the case. Currently the default job parameters contain -DSECURITY=ON. The same will be the case after this change.

I think that the behavior of the jobs should be the same after merging this PR, so not having console output or not building/testing part of our stack is a blocker for merging this in my opinion.

I updated the PR to change the default parameters to match the same behavior for e.g. output handling (see https://github.com/ros2/ci/pull/132/files#diff-42a7cd652d15099c3f6b6c27e05dce31R82).

If we want to convince ourselves about the job generation, what about a set of test jobs generated with the new config and manually triggered ?

I intentionally didn't include any other job configuration changes in this PR. So I don't think we need to generate any temporary jobs. You can simply run trigger the current job using this branch and using the default values from this patch:

--cmake-args " -DSECURITY=ON" --event-handler console_cohesion+

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 11, 2018

Currently the default job parameters contain -DSECURITY=ON. The same will be the case after this change.

I must have misunderstood the description of this PR then "built without SECURITY=ON since that require changing the handling of the job parameters"

I updated the PR to change the default parameters to match the same behavior for e.g. output handling

👍 , without that change our jobs would have been broken: https://ci.ros2.org/job/ci_linux/4197/

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 11, 2018

I intentionally didn't include any other job configuration changes in this PR. So I don't think we need to generate any temporary jobs. You can simply run trigger the current job using this branch and using the default values from this patch:
--cmake-args " -DSECURITY=ON" --event-handler console_cohesion+

I tried to run a job with these arguments and the job crashed on argument parsing: https://ci.ros2.org/job/ci_windows/4301/

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 11, 2018

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 17, 2018

For the latest builds see https://ci.ros2.org/view/colcon/

args.cmake_build_type + ' --']
args.colcon_script, 'build',
'--base-paths', '"%s"' % args.sourcespace,
# '--build-tests',

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 18, 2018

Contributor

This can be removed now that we pass BUILD_TESTING to cmake

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 20, 2018

Default CI builds:

  • Linux Build Status
  • Linux aarch Build Status
  • macOS Build Status
  • Windows Build Status
  • Linux coverage: Build Status

Default packaging builds (FastRTPS):

  • Linux Build Status
  • Linux aarch Build Status
  • macOS Build Status
  • Windows Build Status

Packaging builds (OpenSplice):

  • Linux Build Status
  • Linux aarch Build Status
  • macOS Build Status
  • Windows Build Status

With the recent changes and the latest "green" results (except the hanging Linux builds) I am putting this back in review.

@dirk-thomas dirk-thomas added in review and removed in progress labels Apr 20, 2018

@@ -188,7 +188,7 @@ def create_job(os_name, job_name, template_file, additional_dict):
'cmake_build_type': 'None',
'time_trigger_spec': PERIODIC_JOB_SPEC,
'mailer_recipients': DEFAULT_MAIL_RECIPIENTS,
'ament_test_args_default': '--retest-until-fail 20 --ctest-args -LE linter --',
'test_args_default': '--retest-until-fail 20 --ctest-args -LE linter --',

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 20, 2018

Contributor

My understanding was that any arg not known by argparse should be escaped (in this case --ctest-args ' -LE' linter)
Did you test this set of argument on one of the jobs?

Same thing for the ending -- delimiter, I think it can be removed

This comment has been minimized.

@dirk-thomas
# Collect the test results
ret_test_results = job.run(
['"%s"' % job.python, '-u', '"%s"' % ament_py, 'test_results', '"%s"' % args.buildspace],
[args.colcon_script, 'test-result', '--build-base', '"%s"' % args.buildspace, '--all'],

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 20, 2018

Contributor

It is pretty inconvenient to have to open the full console log just to see the summary of the test results. Could we either not pass --all or print the summary after the list of test result files?
Ideally I'd like to have both (not passing all and displaying summary after rather than before) as this provides us the same behavior as ament that I find more convenient

This comment has been minimized.

This comment has been minimized.

@dirk-thomas

dirk-thomas Apr 20, 2018

Member

Moving the summary below the individual results sounds good to me.

Imo the list of all results is still valuable. There is no other easy way to see which tests ran otherwise. Therefore I would prefer keeping --all.

If you think moving the summary below isn't sufficient for readability we could consider to print the test results once with --all (for the complete list) and once without (for a brief list of results with errors/failures).

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 20, 2018

Contributor

I guess it's pretty subjective. I find it more valuable to have a quick summary of what failed directly at the end of the job and to go to the Jenkins test interface if I want the full detail of tests (passing or failing).
But I guess your point is that the jenkins interface already provides an easy way to see failed tests separated from the ran tests? that makes the segregation in the console output less useful/redundant ?

I would be in favor of running the command twice then as it wouldn't impact CI time much and provide easily redable information as (as a developer) I care more often about what failed than what has been ran.

This comment has been minimized.

@dirk-thomas
@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 20, 2018

Rebased to fix conflicts with the recently merged PR.

if sys.platform != 'win32':
colcon_script = os.path.join(venv_path, 'bin', 'colcon')
else:
colcon_script = 'c:\\python36\\Scripts\\colcon.exe'

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 21, 2018

Contributor

Does this mean that we will now need to update the ci scripts on new python minor version?
Could we infer that from the environment ?

This comment has been minimized.

@dirk-thomas

dirk-thomas Apr 21, 2018

Member

If that can be done automatically even better.

This comment has been minimized.

@dirk-thomas

dirk-thomas Apr 21, 2018

Member

Fixed in 4fe9cfd.

rmdir /S /Q ws workspace
echo "# BEGIN SECTION: Determine arguments"
set "PATH=%PATH:"=%"
set "PATH=!PATH:"=!"

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 21, 2018

Contributor

Question: Can you clarify why we need to use ! everywhere now ? and how this plays with the delayed expansion?

Thanks!

This comment has been minimized.

@dirk-thomas

dirk-thomas Apr 21, 2018

Member

Using % didn't work for me. I didn't spend too much effort trying to figure out why. Using ! made it work.

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 21, 2018

Contributor

ok, I was wondering if this didnt work before you enabled the delayed expansion but would have worked after enabling it.

That's fine, I'm not planning on looking into it to find out

@@ -24,33 +24,37 @@
def build_and_test_and_package(args, job):
print('# BEGIN SUBSECTION: ament build')
print('# BEGIN SUBSECTION: build')

This comment has been minimized.

@mikaelarguedas

mikaelarguedas Apr 21, 2018

Contributor

this is "BEGIN SUBSECTION: colcon build" in the other file. I suggest homogenizing by just using "build", "test" and "test-result" everywhere

This comment has been minimized.

@dirk-thomas
@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 21, 2018

This looks good to me now. 👍

Not sure if we prefer to roll this out before the weekend to have the time to see all jobs types running with it before Monday or if on the opposite we want to be around to monitor the jobs after deploying this

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 21, 2018

I am planning on merging this in an hour so that the nightlies can run and there is time over the weekend to fix minor issues. Imo that is preferred over potentially disturbing everyone during the week. If the problems are too big (or my time too short) I will just revert it before the nightlies Sunday evening and will try again some time later... Let me know if you have any objections.

@dirk-thomas dirk-thomas merged commit 72fdf47 into master Apr 21, 2018

@dirk-thomas dirk-thomas deleted the colcon branch Apr 21, 2018

@dirk-thomas dirk-thomas removed the in review label Apr 21, 2018

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 21, 2018

The first round of manually triggered builds (after reconfiguring the jobs) look promising:

CI builds:

  • Linux Build Status
  • Linux aarch Build Status
  • macOS Build Status
  • Windows Build Status

Packaging builds:

  • Linux Build Status
  • Linux aarch Build Status
  • macOS Build Status
  • Windows Build Status

@dirk-thomas dirk-thomas referenced this pull request Apr 23, 2018

Merged

remove ament_tools #486

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 25, 2018

To summarize the current state:

  • all CI builds are green
  • all packaging builds are green
  • the nightly builds:
    • all release builds are green
    • the debug builds:
      • Linux, Linux aarch are green
      • macOS: the 26 test failures are all related to rviz rendering and plugins
      • Windows: the 4 test failures are from the test_showimage timing out
    • the repeated builds:
      • Linux: the 20 test failiures are composition tests, test_parameter_server_cpp, test_tutorial_parameter_events
      • Linux aarch: the 8 test failures are composition tests
      • macOS: the 10 test failures are composition, test_services_cpp, rclcpp.TestNode.now
      • Windows: the 2 test failures are from the test_parameter_server_cpp

Neither of the failing tests strikes me as likely being related to this patch.

@dhood

This comment has been minimized.

Member

dhood commented Apr 25, 2018

If I may add my perspective on the current state:

The composition tests have always been a bit flaky with fastrtps, but since April 21 they started failing regularly again, including with connext. Perhaps because of the same cause as ros2/build_cop#114 (whatever it is). It may have been caused by something unrelated to colcon and the timing was just a coincidence, but AFAICS colcon hasn't been ruled out yet.

The packaging jobs are green but the output from them is not yet usable (fix ready in #158 but overlays not tested yet). Users of nightlies also haven't been informed of the switch that has occurred.

@mikaelarguedas

This comment has been minimized.

Contributor

mikaelarguedas commented Apr 25, 2018

Few additions that may help us segregate the failures:

It indeed looks like the various fixes in testing and packaging yesterday reduced the number of failures

macOS: the 26 test failures are all related to rviz rendering and plugins

👍 from today's meeting, @nuclearsandwich is about to disable them

Windows: the 4 test failures are from the test_showimage timing out

👍 not related to this patch, these failures are due to OpenCV libraries not being installed properly on the build machines:
see #146 (comment) and ros2/build_cop#113 for details

all packaging builds are green

Packaging jobs are green, but the packaging artifacts are still unusable ATM. @dhood has been investigating and @mjcarroll @sloretz are testing the ones from #158 to see if the resulting workspaces can be source and used as underlays

@dirk-thomas

This comment has been minimized.

Member

dirk-thomas commented Apr 25, 2018

Users of nightlies also haven't been informed of the switch that has occurred.

I couldn't find anywhere in the docs where we point users to the nightly packaging jobs. Do you have a pointer for me? I am more than happy to add a note for the changed script name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment