Skip to content

[ci] convert ml.rayci.yml matrix to array syntax#62802

Open
andrew-anyscale wants to merge 1 commit intoandrew/revup/master/convert-array-core-testsfrom
andrew/revup/master/convert-array-ml
Open

[ci] convert ml.rayci.yml matrix to array syntax#62802
andrew-anyscale wants to merge 1 commit intoandrew/revup/master/convert-array-core-testsfrom
andrew/revup/master/convert-array-ml

Conversation

@andrew-anyscale
Copy link
Copy Markdown
Contributor

@andrew-anyscale andrew-anyscale commented Apr 20, 2026

Convert the three matrix build steps (minbuild-ml, mlbuild-multipy, mlgpubuild-multipy) and the three matrix-setup test labels ("train v2 {{python}} tests", "train v2 gpu {{python}} tests", "tune {{python}} tests") from matrix to array syntax. The three newly-array test steps use ($) against mlbuild-multipy and mlgpubuild-multipy since both sides share the python array key, and the build steps refine their oss-ci-base_* dependencies from (*) / (python=3.10) to ($) for the same reason. Every remaining plain depends_on to minbuild-ml / mlbuild-multipy / mlgpubuild-multipy becomes (python=3.10) since every consuming step pins py3.10. The non-array mllightning1gpubuild is unaffected.

Relative: convert-array-core-tests
Topic: convert-array-ml
Signed-off-by: andrew andrew@anyscale.com

@andrew-anyscale
Copy link
Copy Markdown
Contributor Author

andrew-anyscale commented Apr 20, 2026

Reviews in this chain:
#62799 [ci] convert core.rayci.yml test steps to array and narrow subsets
 ├#62801 [ci] convert data.rayci.yml matrix to array syntax
 ├#62802 [ci] convert ml.rayci.yml matrix to array syntax
 ├#62803 [ci] convert serve.rayci.yml matrix to array syntax
 └#62804 [ci] narrow pre-existing (*) fan-ins to (python=3.10) subsets

@andrew-anyscale
Copy link
Copy Markdown
Contributor Author

andrew-anyscale commented Apr 20, 2026

# head base diff date summary
0 77cb2907 0f2e8944 diff Apr 20 16:20 PM 1 file changed, 55 insertions(+), 55 deletions(-)
1 82c6e3b6 59e0122e rebase Apr 22 6:28 AM 0 files changed
2 0889fb47 af9fd22e rebase Apr 22 6:28 AM 0 files changed
3 21e46f47 238bf958 diff Apr 22 10:18 AM 1 file changed, 3 insertions(+)
4 682590de 7cdf7184 rebase Apr 23 7:03 AM 0 files changed
5 bae65421 f724ae4f rebase Apr 27 10:10 AM 0 files changed
6 bc1d74e7 8cf2eef8 rebase Apr 28 11:32 AM 0 files changed
7 f292ea35 7f9373b2 diff May 1 7:29 AM 0 files changed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the Buildkite configuration in .buildkite/ml.rayci.yml from using matrix to array for several build and test steps, including updates to dependency references and template variables. A critical issue was identified in the sharding logic for the train v2 tests; the current implementation still uses Buildkite environment variables for worker identification instead of the newly defined array.worker_id, which will cause redundant test execution across the array jobs.

Comment thread .buildkite/ml.rayci.yml
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/train/... ml
--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}" --worker-id "$${BUILDKITE_PARALLEL_JOB}" --parallelism-per-worker 3
--python-version {{matrix.python}} --build-name mlbuild-py{{matrix.python}}
--python-version {{array.python}} --build-name mlbuild-py{{array.python}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The command for the train v2 {{array.python}} tests step still uses Buildkite environment variables ($${BUILDKITE_PARALLEL_JOB_COUNT} and $${BUILDKITE_PARALLEL_JOB}) for sharding, even though it has been converted to use an array for worker_id. Since parallelism is not set for this step, these environment variables will default to 1 and 0 respectively, causing both jobs in the array to run the same tests (worker 0).

To correctly shard the tests across the array, you should use the {{array.worker_id}} variable and a hardcoded total worker count (2), matching the implementation in the GPU test step at line 148.

        --workers 2 --worker-id {{array.worker_id}} --parallelism-per-worker 3
        --python-version {{array.python}} --build-name mlbuild-py{{array.python}}

@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch from 77cb290 to 82c6e3b Compare April 22, 2026 13:28
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-core-tests branch from 0f2e894 to 59e0122 Compare April 22, 2026 13:28
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch from 82c6e3b to 0889fb4 Compare April 22, 2026 13:28
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-core-tests branch from 59e0122 to af9fd22 Compare April 22, 2026 13:28
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch from 0889fb4 to 21e46f4 Compare April 22, 2026 17:18
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-core-tests branch 2 times, most recently from 238bf95 to 7cdf718 Compare April 23, 2026 14:03
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch 2 times, most recently from 682590d to bae6542 Compare April 27, 2026 17:10
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-core-tests branch from 7cdf718 to f724ae4 Compare April 27, 2026 17:10
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch from bae6542 to bc1d74e Compare April 28, 2026 18:32
Convert the three matrix build steps (minbuild-ml, mlbuild-multipy, mlgpubuild-multipy) and the three matrix-setup test labels ("train v2 {{python}} tests", "train v2 gpu {{python}} tests", "tune {{python}} tests") from matrix to array syntax. The three newly-array test steps use ($) against mlbuild-multipy and mlgpubuild-multipy since both sides share the python array key, and the build steps refine their oss-ci-base_* dependencies from (*) / (python=3.10) to ($) for the same reason. Every remaining plain depends_on to minbuild-ml / mlbuild-multipy / mlgpubuild-multipy becomes (python=3.10) since every consuming step pins py3.10. The non-array mllightning1gpubuild is unaffected.

Relative: convert-array-core-tests
Topic: convert-array-ml
Signed-off-by: andrew <andrew@anyscale.com>
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-ml branch from bc1d74e to f292ea3 Compare May 1, 2026 14:29
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/convert-array-core-tests branch from 8cf2eef to 7f9373b Compare May 1, 2026 14:29
@andrew-anyscale andrew-anyscale marked this pull request as ready for review May 1, 2026 14:29
@ray-gardener ray-gardener Bot added the devprod label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant