Skip to content

[rayci] Upload pipeline in batches when job count exceeds Buildkite limit#483

Merged
andrew-anyscale merged 1 commit intomainfrom
andrew/revup/main/upload-batches
Apr 8, 2026
Merged

[rayci] Upload pipeline in batches when job count exceeds Buildkite limit#483
andrew-anyscale merged 1 commit intomainfrom
andrew/revup/main/upload-batches

Conversation

@andrew-anyscale
Copy link
Copy Markdown
Contributor

Buildkite rejects pipeline uploads with more than 500 jobs

buildkite-agent: fatal: Failed to upload and process pipeline: Pipeline upload rejected: The number of jobs in this upload exceeds your organization limit of 500. Please break the upload into batches below this limit, or contact support to discuss an increase

When the pipeline exceeds this limit (counting parallelism-expanded jobs), split it into batches by group and upload each batch separately. If any single group exceeds 500 jobs, error with a message to split the group.

Topic: upload-batches
Signed-off-by: andrew andrew@anyscale.com

@andrew-anyscale
Copy link
Copy Markdown
Contributor Author

Reviews in this chain:
#483 [rayci] Upload pipeline in batches when job count exceeds Buildkite limit

@andrew-anyscale
Copy link
Copy Markdown
Contributor Author

andrew-anyscale commented Apr 7, 2026

# head base diff date summary
0 db2a0e7f 18233682 diff Apr 7 10:55 AM 3 files changed, 233 insertions(+), 3 deletions(-)
1 b5817af4 18233682 diff Apr 7 12:34 PM 1 file changed, 3 insertions(+), 9 deletions(-)
2 89784b75 18233682 diff Apr 7 12:38 PM 1 file changed, 15 insertions(+), 25 deletions(-)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements pipeline batching for Buildkite uploads to handle job limits. It adds functionality to calculate total jobs (including parallelism) and split pipelines into batches, ensuring the 'Notify' configuration is preserved in the first batch. Comprehensive tests were added to verify batching behavior. The reviewer suggested simplifying the implementation in main.go by always using the batching logic to avoid code duplication between the single-batch and multi-batch cases.

Comment thread raycicmd/main.go Outdated
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/main/upload-batches branch from db2a0e7 to b5817af Compare April 7, 2026 19:34
…imit

Buildkite rejects pipeline uploads with more than 500 jobs
> buildkite-agent: fatal: Failed to upload and process pipeline: Pipeline upload rejected: The number of jobs in this upload exceeds your organization limit of 500. Please break the upload into batches below this limit, or contact support to discuss an increase

When the pipeline exceeds this limit (counting parallelism-expanded jobs), split it into batches by group and upload each batch separately. If any single group exceeds 500 jobs, error with a message to split the group.

Topic: upload-batches
Signed-off-by: andrew <andrew@anyscale.com>
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/main/upload-batches branch from b5817af to 89784b7 Compare April 7, 2026 19:38
@andrew-anyscale andrew-anyscale requested a review from aslonnie April 7, 2026 20:01
@andrew-anyscale andrew-anyscale merged commit e6bb80d into main Apr 8, 2026
2 checks passed
@andrew-anyscale andrew-anyscale deleted the andrew/revup/main/upload-batches branch April 8, 2026 15:15
aslonnie pushed a commit to ray-project/ray that referenced this pull request Apr 17, 2026
…ob limit (#62736)

Buildkite rejects pipeline uploads above an organization-level job limit
(500 at time of writing) with "Pipeline upload rejected: The number of
jobs in this upload exceeds your organization limit of 500." The release
pipeline's release_tests.json has grown past that; the previous "step
dependencies not found" failure had been masking it.

custom_image_build_and_test_init now splits the computed steps into
batches of at most --max-jobs-per-upload jobs (default 450 for headroom)
and writes each batch to .buildkite/release/release_tests_<i>.json.
Groups are atomic — a single group that exceeds the limit raises,
matching the approach taken in rayci (ray-project/rayci#483, #484).
custom-image-build-and-test-init.sh iterates the chunks and uploads each
in order so dependencies between steps in different chunks still
resolve.

Signed-off-by: andrew <andrew@anyscale.com>
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…ob limit (ray-project#62736)

Buildkite rejects pipeline uploads above an organization-level job limit
(500 at time of writing) with "Pipeline upload rejected: The number of
jobs in this upload exceeds your organization limit of 500." The release
pipeline's release_tests.json has grown past that; the previous "step
dependencies not found" failure had been masking it.

custom_image_build_and_test_init now splits the computed steps into
batches of at most --max-jobs-per-upload jobs (default 450 for headroom)
and writes each batch to .buildkite/release/release_tests_<i>.json.
Groups are atomic — a single group that exceeds the limit raises,
matching the approach taken in rayci (ray-project/rayci#483, ray-project#484).
custom-image-build-and-test-init.sh iterates the chunks and uploads each
in order so dependencies between steps in different chunks still
resolve.

Signed-off-by: andrew <andrew@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants