Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Open
vyasr opened this issue Dec 13, 2023 · 7 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Dec 13, 2023

There are various potential improvements we could make to our CI matrix to improve test coverage of critical components while reducing the overall load. Some possible improvements that have been suggested at various times include:

  • Reducing the frequency of testing multiple Python versions to nightly: We very rarely encounter cases where our CI tests uncover a bug that is only present in one Python version. While it is possible, we don't generally use cutting edge language or standard library features that require a very new version of Python. We could choose to only test the oldest supported Python version in PR CI and only run the other tests in nightlies.
  • Don't test arm builds all the time: Architecture-specific bugs are similarly rare and we probably don't need to test both x86 and arm on every PR and in CI.
  • Better testing of CUDA drivers: Currently we only test the oldest and newest supported drivers between both PR and nightly CI. However, for use cases like JIT-compiled kernels we may, at minimum, want to test the oldest and newest supported drivers for each major version of CUDA that we support.
  • Reduce duplication between wheel and conda testing: We currently run a matrix of both wheel and conda tests. We may be able to use a subset of those as long as we get coverage of both wheels and conda (and coverage of all the other pieces -- Python version, CUDA verison, etc -- between the wheel and conda tests).
@jameslamb
Copy link
Member

Capturing some miscellaneous thoughts from interacting with the matrices on rapidsai/shared-workflows#166.

In general, I think whatever decisions we make about the support matrix should be:

  • encouraged via shared variables in configuration
  • enforced in shared-workflows CI

idea 1. matrices in workflow configs should make more use of shared variables

For example, it seems that we only do conda builds with 1 Python, want to cover one minor version per CUDA version, and want to cover the previous and current CUDA major versions (code link).

Today that looks like:

export MATRIX="
- { CUDA_VER: '11.8.0', LINUX_VER: 'ubuntu22.04', ARCH: 'amd64', PY_VER: '3.10' }
- { CUDA_VER: '11.8.0', LINUX_VER: 'ubuntu22.04', ARCH: 'arm64', PY_VER: '3.10' }
- { CUDA_VER: '12.0.1', LINUX_VER: 'ubuntu22.04', ARCH: 'amd64', PY_VER: '3.10' }
- { CUDA_VER: '12.0.1', LINUX_VER: 'ubuntu22.04', ARCH: 'arm64', PY_VER: '3.10' }
"

The intention there would be clearer, in my opinion, like this:

BUILD_OS="ubuntu22.04"
CUDA_PREVIOUS="11.8.0"
CUDA_CURRENT="12.0.1"
PYTHON_VERSION="3.10"

export MATRIX="
- { CUDA_VER: '${CUDA_PREVIOUS}', LINUX_VER: '${BUILD_OS}', ARCH: 'amd64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_PREVIOUS}', LINUX_VER: '${BUILD_OS}', ARCH: 'arm64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_CURRENT}',  LINUX_VER: '${BUILD_OS}', ARCH: 'amd64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_CURRENT}',  LINUX_VER: '${BUILD_OS}', ARCH: 'arm64', PY_VER: '${PYTHON_VERSION}' }
"

That'd make the patterns more obvious and reduce the diffs for changes like updating CUDA or Python version.

If the post-processing of those variables that's already done with yq / jq (code link) also resolved duplicates, then for other matrices you also wouldn't have to think about the difference between the state "we're only supporting 1 minor version of CUDA like 12.0.1" and "we're supporting 2 minor versions like 12.0.1 and 12.2.2".


idea 2: desirable properties of matrices should be enforced in CI

Here are some constraints I can imagine being desirable:

  • "no identical matrix configurations"
  • "no more than n total arm64 jobs"
  • "at least 1 job for each of the operating systems we support"
  • "at least 1 test job on PRs per CUDA {major}.{minor} that we support

Instead of relying on code comments or convention, I think it's worth considering whether those constraints could be enforced in shared-workflows CI.

I'm imagining here a little script that reads in the matrix configuration, renders the full matrices, and then asserts all the conditions we want to be true and raises a big loud error if any are violated.

@jameslamb
Copy link
Member

One other passing thought.... right now all wheels are tested against only the latest driver supported on the NVIDIA-hosted runners.

https://github.com/rapidsai/shared-workflows/blob/91799a905608f4c57d7fe65b92ce9a261249e635/.github/workflows/wheels-test.yaml#L69-L80

We should consider how much, if any, coverage of older drivers we want when testing wheels.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 11, 2024

Instead of relying on code comments or convention, I think it's worth considering whether those constraints could be enforced in shared-workflows CI.
I'm imagining here a little script that reads in the matrix configuration, renders the full matrices, and then asserts all the conditions we want to be true and raises a big loud error if any are violated.

I really like this idea in theory. I don't know how well it'll work in practice, but I definitely think it's worth experimenting with.

@jameslamb
Copy link
Member

jameslamb commented Feb 23, 2024

From discussion in rapidsai/shared-workflows#176 (comment), adding another example of a constraint that could be enforced automatically:

"the nightly matrix should be a strict superset of the PR matrix"

@bdice
Copy link

bdice commented Mar 1, 2024

I discussed this with @vyasr and @ajschmidt8 in the context of rapidsai/shared-workflows#184 and rapidsai/cudf#15201. We are planning to implement this:

(Assume latest driver unless otherwise noted)

24.04 PR jobs

conda

amd64, CUDA 12 (newest Ubuntu, newest Python)
amd64, CUDA 11 (CentOS 7, oldest Python, earliest driver)
arm64, CUDA 12 (Rocky 8, middle Python)

wheel

amd64, CUDA 12 (Rocky 8, oldest Python)
arm64, CUDA 11 (oldest Ubuntu, newest Python)

24.06 PR jobs

Note that we drop CentOS 7, and "oldest Ubuntu" will change from 18.04 to 20.04 here. See #23.
(diff in bold italics)

conda

amd64, CUDA 12 (newest Ubuntu, newest Python)
amd64, CUDA 11 (Rocky 8, oldest Python, earliest driver)
arm64, CUDA 12 (oldest Ubuntu, middle Python)

wheel

amd64, CUDA 12 (Rocky 8, oldest Python)
arm64, CUDA 11 (oldest Ubuntu, newest Python)

Reasoning

We came to this set of entries from the following steps:

  1. Our most important coverage is across architecture and CUDA major version. First, spread those across conda/wheel builds.
  • We're mixing amd64/arm64 and CUDA 11/12. We can "swap" these across conda and wheel builds to get more coverage while keeping the number of jobs constant. That yielded four jobs: conda amd64 + CUDA 11, conda arm64 + CUDA 12, wheel amd64 + CUDA 12, wheel arm64 + CUDA 11.
  1. We added a fifth job conda amd64 + CUDA 12 because this is what most developers are using.
  • Having more arch/CUDA coverage in conda rather than wheel is helpful because we run into packaging issues more frequently with conda.
  • As a side benefit, we can then use conda amd64 + CUDA 11 to cover the earliest driver (all other jobs use latest drivers due to limited runners with the earliest supported driver).
  1. Then, we scatter OS versions, Python versions, and CUDA minor versions across this matrix of five jobs.
  • We have two conda amd64 jobs. One has "newest everything" (CUDA, OS, Python, driver) and the other has "oldest everything". This tests the "endpoints" of our matrix (which gives us more certainty about coverage).
  • We have one conda arm64 job, and we use values not covered by the conda amd64 jobs. This means Rocky 8, some "middle" version of Python (currently 3.10, as we support 3.9-3.11), and another minor version of CUDA (like 12.0 instead of 12.2).
  • For wheels, we try to test other "corners" of the support matrix.
  • wheel amd64 + CUDA 12 uses Rocky 8, which is not covered by the other conda amd64 jobs, and will use the oldest Python with the newest CUDA.
  • wheel arm64 + CUDA 11 uses the oldest supported Ubuntu with the newest Python.

@bdice
Copy link

bdice commented Mar 25, 2024

@vyasr There are still some potential areas for exploration here but nothing I think is important in the short term. It seems like our recent changes to the CI matrix have improved things a lot. Would you want to identify any action items here, or should we consider closing this?

@vyasr
Copy link
Contributor Author

vyasr commented Mar 27, 2024

Since we're about to add back full wheel arm runs, let's see how things look after that. I suspect that we might be partially back in a state where we will benefit from some of the changes in rapidsai/cudf#15201.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants