Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pace build optional CI #1460

Merged
merged 8 commits into from
Dec 12, 2023
Merged

Conversation

FlorianDeconinck
Copy link
Contributor

As climate models developed at NOAA and NASA are leveraging DaCe more and more for their performance backend, we have seen multiple occurrence of major/minor version breaking downstream.

This optional GitHub action is an attempt to reduce those breakage by allowing the DaCe ecosystem to pull on a vetted version of the Pace climate model and run a subset of the regression tests that should exercise enough DaCe to catch a good amount of errors.

NASA takes responsibility to keep this CI clean and working along. All non-DaCe issues should be pinged on @FlorianDeconinck.

All data and model are under open source licenses.

@FlorianDeconinck
Copy link
Contributor Author

@phschaad @BenWeber42

As discussed, here's an attempt at bringing a CI that exercise a fraction (Remapping) of the model.

This shouldn't be merged up until we all agree on the general principle and execution time.

@phschaad phschaad enabled auto-merge (squash) December 4, 2023 09:18
@phschaad
Copy link
Collaborator

phschaad commented Dec 4, 2023

Thank you for this nice CI bundle @FlorianDeconinck !
We have discussed this and have concluded that depending on the time it takes to run this CI workflow, it may be desirable to even make the workflow non-optional.

@FlorianDeconinck
Copy link
Contributor Author

If we think it'll be good to have it in the regular suite we can tailor what we extract from the model.

Basically you have 3 big one-rank test (in descending order of complexity):

  • Remapping (<- this is the one in the PR)
  • Tracer advection
  • D_SW

And then we have 2 bigger and more complex test that need 6 ranks (because of halo exchanges):

  • FVDynamics (basically the entire dynamical core)
  • Acoustics

FVDynamics is a bit of beast. The inner Acoustics loop would cover about 75% of the code, so it might be the best ratio code coverage/time.

On the other side, for a lighting fast test D_SW is a the cornerstone of the acoustics loop and often the buggier code.

My only concern is that already installing the entire py stack is not fast. I'll come back with some numbers.

@phschaad
Copy link
Collaborator

phschaad commented Dec 4, 2023

In that case I'll be interested to see what the numbers say for the Remapping test, which should help us decide if we keep it optional or make it a required test. In any case having it as part of the available actions will make a lot of sense.

As for the multi-rank tests we will have to see what the best strategy is to avoid using GH runners here. From what I understand it would already be a big help if we get the 3 single rank tests up and running asap and that would buy us some time to strategize about the dynamical core and acoustics tests?

@FlorianDeconinck
Copy link
Contributor Author

Agreed. You can force a mpirun --oversubscibe but depending on the quality of the hardware this might not be a real option...

I'd say we can get a decent code coverage with the single rank tests and as we implement more model code under DaCe we can also refine those to cover more patterns.

The multi-rank might not even be a good idea to do it DaCe side. After all, on our side for every major version of DaCe we need to run full coverage, including science testing. At that stage, something with a webhook from this repo to ours triggering a job on our HPC makes more sense. Let's call it future endeavor ;)

Copy link
Contributor

@BenWeber42 BenWeber42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice PR!

This shouldn't be merged up until we all agree on the general principle and execution time.

Agreed, we should clarify some minor aspects w.r.t. to this CI.

.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved
.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved
.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved
@FlorianDeconinck
Copy link
Contributor Author

Okay here's a proposal based on timing on my box which is not that great.

Timing for dace:gpu backend runtime:

  • Remapping: 3mn37s
  • D_SW: 4mn38s
  • Riem_Solver_C: 0mn45s

Overall 9mn on top of the install of the stack, so potentially around 15mn for a pretty sturdy test.

What do you think?

@FlorianDeconinck
Copy link
Contributor Author

Alright @BenWeber42 / @phschaad

I've updated the script to run in order from faster to slower: RiemSolverC, D_SW and then Remapping.

The workflow is on workflow_dispatch for now. I propose if code is fine enough to merge this, so we can trigger it in GitHub and if timings are decent we can then PR a master/ci-fix trigger

@phschaad phschaad enabled auto-merge (squash) December 6, 2023 13:45
@phschaad
Copy link
Collaborator

phschaad commented Dec 6, 2023

This seems like a good strategy, yes - Thank you!

@BenWeber42 BenWeber42 removed their request for review December 11, 2023 14:09
Copy link
Contributor

@BenWeber42 BenWeber42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional CI!

@phschaad phschaad merged commit b6e1c9d into spcl:master Dec 12, 2023
11 checks passed
@FlorianDeconinck FlorianDeconinck deleted the pace_optional_ci branch December 12, 2023 14:56
github-merge-queue bot pushed a commit that referenced this pull request May 9, 2024
Follow up of #1460 

- [x] Fixed the `ci` script (including `git checkout issues` around
selecting the correct `dace`)
- [x] Move `D_SW` to execute only on rank 0 to avoid rebuild
- [x] Swapped Rieman Solver on C-grid for D-grid for better coverage

~~WARNING: this PR is blocked by #1477~~
~~WARNING: this PR is blocked by #1568~~

---------

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants