Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

vyasr · 2023-12-13T23:02:09Z

There are various potential improvements we could make to our CI matrix to improve test coverage of critical components while reducing the overall load. Some possible improvements that have been suggested at various times include:

Reducing the frequency of testing multiple Python versions to nightly: We very rarely encounter cases where our CI tests uncover a bug that is only present in one Python version. While it is possible, we don't generally use cutting edge language or standard library features that require a very new version of Python. We could choose to only test the oldest supported Python version in PR CI and only run the other tests in nightlies.
Don't test arm builds all the time: Architecture-specific bugs are similarly rare and we probably don't need to test both x86 and arm on every PR and in CI.
Better testing of CUDA drivers: Currently we only test the oldest and newest supported drivers between both PR and nightly CI. However, for use cases like JIT-compiled kernels we may, at minimum, want to test the oldest and newest supported drivers for each major version of CUDA that we support.
Reduce duplication between wheel and conda testing: We currently run a matrix of both wheel and conda tests. We may be able to use a subset of those as long as we get coverage of both wheels and conda (and coverage of all the other pieces -- Python version, CUDA verison, etc -- between the wheel and conda tests).

jameslamb · 2024-01-10T19:37:21Z

Capturing some miscellaneous thoughts from interacting with the matrices on rapidsai/shared-workflows#166.

In general, I think whatever decisions we make about the support matrix should be:

encouraged via shared variables in configuration
enforced in shared-workflows CI

idea 1. matrices in workflow configs should make more use of shared variables

For example, it seems that we only do conda builds with 1 Python, want to cover one minor version per CUDA version, and want to cover the previous and current CUDA major versions (code link).

Today that looks like:

export MATRIX="
- { CUDA_VER: '11.8.0', LINUX_VER: 'ubuntu22.04', ARCH: 'amd64', PY_VER: '3.10' }
- { CUDA_VER: '11.8.0', LINUX_VER: 'ubuntu22.04', ARCH: 'arm64', PY_VER: '3.10' }
- { CUDA_VER: '12.0.1', LINUX_VER: 'ubuntu22.04', ARCH: 'amd64', PY_VER: '3.10' }
- { CUDA_VER: '12.0.1', LINUX_VER: 'ubuntu22.04', ARCH: 'arm64', PY_VER: '3.10' }
"

The intention there would be clearer, in my opinion, like this:

BUILD_OS="ubuntu22.04"
CUDA_PREVIOUS="11.8.0"
CUDA_CURRENT="12.0.1"
PYTHON_VERSION="3.10"

export MATRIX="
- { CUDA_VER: '${CUDA_PREVIOUS}', LINUX_VER: '${BUILD_OS}', ARCH: 'amd64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_PREVIOUS}', LINUX_VER: '${BUILD_OS}', ARCH: 'arm64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_CURRENT}',  LINUX_VER: '${BUILD_OS}', ARCH: 'amd64', PY_VER: '${PYTHON_VERSION}' }
- { CUDA_VER: '${CUDA_CURRENT}',  LINUX_VER: '${BUILD_OS}', ARCH: 'arm64', PY_VER: '${PYTHON_VERSION}' }
"

That'd make the patterns more obvious and reduce the diffs for changes like updating CUDA or Python version.

If the post-processing of those variables that's already done with yq / jq (code link) also resolved duplicates, then for other matrices you also wouldn't have to think about the difference between the state "we're only supporting 1 minor version of CUDA like 12.0.1" and "we're supporting 2 minor versions like 12.0.1 and 12.2.2".

idea 2: desirable properties of matrices should be enforced in CI

Here are some constraints I can imagine being desirable:

"no identical matrix configurations"
"no more than n total arm64 jobs"
"at least 1 job for each of the operating systems we support"
"at least 1 test job on PRs per CUDA {major}.{minor} that we support

Instead of relying on code comments or convention, I think it's worth considering whether those constraints could be enforced in shared-workflows CI.

I'm imagining here a little script that reads in the matrix configuration, renders the full matrices, and then asserts all the conditions we want to be true and raises a big loud error if any are violated.

jameslamb · 2024-01-10T19:39:27Z

One other passing thought.... right now all wheels are tested against only the latest driver supported on the NVIDIA-hosted runners.

https://github.com/rapidsai/shared-workflows/blob/91799a905608f4c57d7fe65b92ce9a261249e635/.github/workflows/wheels-test.yaml#L69-L80

We should consider how much, if any, coverage of older drivers we want when testing wheels.

vyasr · 2024-01-11T02:37:58Z

Instead of relying on code comments or convention, I think it's worth considering whether those constraints could be enforced in shared-workflows CI.
I'm imagining here a little script that reads in the matrix configuration, renders the full matrices, and then asserts all the conditions we want to be true and raises a big loud error if any are violated.

I really like this idea in theory. I don't know how well it'll work in practice, but I definitely think it's worth experimenting with.

jameslamb · 2024-02-23T16:36:46Z

From discussion in rapidsai/shared-workflows#176 (comment), adding another example of a constraint that could be enforced automatically:

"the nightly matrix should be a strict superset of the PR matrix"

bdice · 2024-03-01T19:15:34Z

I discussed this with @vyasr and @ajschmidt8 in the context of rapidsai/shared-workflows#184 and rapidsai/cudf#15201. We are planning to implement this:

(Assume latest driver unless otherwise noted)

24.04 PR jobs

conda

amd64, CUDA 12 (newest Ubuntu, newest Python)
amd64, CUDA 11 (CentOS 7, oldest Python, earliest driver)
arm64, CUDA 12 (Rocky 8, middle Python)

wheel

amd64, CUDA 12 (Rocky 8, oldest Python)
arm64, CUDA 11 (oldest Ubuntu, newest Python)

24.06 PR jobs

Note that we drop CentOS 7, and "oldest Ubuntu" will change from 18.04 to 20.04 here. See #23.
(diff in bold italics)

conda

amd64, CUDA 12 (newest Ubuntu, newest Python)
amd64, CUDA 11 (Rocky 8, oldest Python, earliest driver)
arm64, CUDA 12 (oldest Ubuntu, middle Python)

wheel

amd64, CUDA 12 (Rocky 8, oldest Python)
arm64, CUDA 11 (oldest Ubuntu, newest Python)

Reasoning

We came to this set of entries from the following steps:

Our most important coverage is across architecture and CUDA major version. First, spread those across conda/wheel builds.

We're mixing amd64/arm64 and CUDA 11/12. We can "swap" these across conda and wheel builds to get more coverage while keeping the number of jobs constant. That yielded four jobs: conda amd64 + CUDA 11, conda arm64 + CUDA 12, wheel amd64 + CUDA 12, wheel arm64 + CUDA 11.

We added a fifth job conda amd64 + CUDA 12 because this is what most developers are using.

Having more arch/CUDA coverage in conda rather than wheel is helpful because we run into packaging issues more frequently with conda.
As a side benefit, we can then use conda amd64 + CUDA 11 to cover the earliest driver (all other jobs use latest drivers due to limited runners with the earliest supported driver).

Then, we scatter OS versions, Python versions, and CUDA minor versions across this matrix of five jobs.

We have two conda amd64 jobs. One has "newest everything" (CUDA, OS, Python, driver) and the other has "oldest everything". This tests the "endpoints" of our matrix (which gives us more certainty about coverage).
We have one conda arm64 job, and we use values not covered by the conda amd64 jobs. This means Rocky 8, some "middle" version of Python (currently 3.10, as we support 3.9-3.11), and another minor version of CUDA (like 12.0 instead of 12.2).
For wheels, we try to test other "corners" of the support matrix.
wheel amd64 + CUDA 12 uses Rocky 8, which is not covered by the other conda amd64 jobs, and will use the oldest Python with the newest CUDA.
wheel arm64 + CUDA 11 uses the oldest supported Ubuntu with the newest Python.

bdice · 2024-03-25T16:39:12Z

@vyasr There are still some potential areas for exploration here but nothing I think is important in the short term. It seems like our recent changes to the CI matrix have improved things a lot. Would you want to identify any action items here, or should we consider closing this?

vyasr · 2024-03-27T19:34:22Z

Since we're about to add back full wheel arm runs, let's see how things look after that. I suspect that we might be partially back in a state where we will benefit from some of the changes in rapidsai/cudf#15201.

jameslamb mentioned this issue Jan 5, 2024

add CUDA 12.2 conda and wheel jobs rapidsai/shared-workflows#166

Merged

5 tasks

jameslamb mentioned this issue Feb 21, 2024

re-organize build matrices rapidsai/shared-workflows#177

Merged

jameslamb mentioned this issue Feb 23, 2024

add support for python 3.11 rapidsai/shared-workflows#176

Merged

6 tasks

This was referenced Mar 1, 2024

Reduce CI runs rapidsai/shared-workflows#184

Merged

Test CI changes rapidsai/cudf#15201

Closed

bdice mentioned this issue Mar 19, 2024

Drop CentOS 7 from workflows. rapidsai/shared-workflows#192

Merged

This was referenced Apr 11, 2024

Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml rapidsai/cudf#15516

Merged

Add support for Python 3.12 #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

vyasr commented Dec 13, 2023

jameslamb commented Jan 10, 2024

jameslamb commented Jan 10, 2024

vyasr commented Jan 11, 2024

jameslamb commented Feb 23, 2024 •

edited

bdice commented Mar 1, 2024 •

edited

bdice commented Mar 25, 2024

vyasr commented Mar 27, 2024

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Comments

vyasr commented Dec 13, 2023

jameslamb commented Jan 10, 2024

jameslamb commented Jan 10, 2024

vyasr commented Jan 11, 2024

jameslamb commented Feb 23, 2024 • edited

bdice commented Mar 1, 2024 • edited

24.04 PR jobs

conda

wheel

24.06 PR jobs

conda

wheel

Reasoning

bdice commented Mar 25, 2024

vyasr commented Mar 27, 2024

jameslamb commented Feb 23, 2024 •

edited

bdice commented Mar 1, 2024 •

edited