[data] Adding in batch inference mock pipeline release test #52616

omatthew98 · 2025-04-25T20:51:02Z

Why are these changes needed?

We want to add a release test that will be used to model performance on a batch inference image pipeline. This is meant to more accurately model a realistic user pipeline. This will be used to track our improvement on this workload and more broadly our improvement on batch inference with images with and without spot nodes.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani · 2025-04-25T20:53:40Z

release/release_data_tests.yaml

(Optionally for future reference): we can use the variants syntax to minimize duplication

ray/release/release_data_tests.yaml

Lines 211 to 232 in 251a131

- name: distributed_training

working_dir: nightly_tests

cluster:

byod:

post_build_script: byod_install_mosaicml.sh

cluster_compute: dataset/multi_node_train_16_workers.yaml

run:

timeout: 3600

script: >

python dataset/multi_node_train_benchmark.py --num-workers 16 --file-type parquet

--target-worker-gb 50 --use-gpu

variations:

- __suffix__: regular

- __suffix__: chaos

run:

prepare: >

python setup_chaos.py --kill-interval 200 --max-to-kill 1 --task-names

"_RayTrainWorker__execute.get_next"

omatthew98 · 2025-04-29T23:00:30Z

python/requirements/ml/rllib-test-requirements.txt

These changes are from @aslonnie's trouble shooting here #52687.

aslonnie

:) thank you

you might want to double check again if the release test works

Signed-off-by: Matthew Owen <mowen@anyscale.com>

## Why are these changes needed? We want to add a release test that will be used to model performance on a batch inference image pipeline. This is meant to more accurately model a realistic user pipeline. This will be used to track our improvement on this workload and more broadly our improvement on batch inference with images with and without spot nodes. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: jhsu <jhsu@anyscale.com>

omatthew98 requested a review from bveeramani April 25, 2025 20:51

bveeramani approved these changes Apr 25, 2025

View reviewed changes

omatthew98 requested a review from a team as a code owner April 29, 2025 20:28

omatthew98 commented Apr 29, 2025

View reviewed changes

aslonnie approved these changes Apr 30, 2025

View reviewed changes

omatthew98 added the go add ONLY when ready to merge, run all tests label Apr 30, 2025

omatthew98 added 6 commits April 30, 2025 16:24

adding in batch inference mock pipeline release test

ab1268f

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in extra dep

b6a03bd

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fix scale_factor

5294b8a

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in byod install

fa7a314

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in variations to clean up test

85999f2

Signed-off-by: Matthew Owen <mowen@anyscale.com>

chmod +x script

5cecb7f

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the mowen/add-batch-inference-mock-image-pipeline-test branch from 503d678 to 0cb639a Compare April 30, 2025 23:24

omatthew98 and others added 4 commits April 30, 2025 16:33

nudging batch a bit smaller

95d6010

Signed-off-by: Matthew Owen <mowen@anyscale.com>

[deps] add albumentations

f7f3b16

Signed-off-by: Matthew Owen <mowen@anyscale.com>

clean up other approach in favor of lonnies

af7d1aa

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in manual tag for now

09f8a7a

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the mowen/add-batch-inference-mock-image-pipeline-test branch from 0cb639a to 09f8a7a Compare April 30, 2025 23:33

bveeramani enabled auto-merge (squash) April 30, 2025 23:49

bveeramani merged commit 56b7d73 into ray-project:master May 1, 2025
6 checks passed

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Adding in batch inference mock pipeline release test #52616

[data] Adding in batch inference mock pipeline release test #52616

Uh oh!

omatthew98 commented Apr 25, 2025

Uh oh!

bveeramani Apr 25, 2025

Uh oh!

omatthew98 Apr 29, 2025

Uh oh!

aslonnie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	- name: distributed_training
	working_dir: nightly_tests

	cluster:
	byod:
	post_build_script: byod_install_mosaicml.sh
	cluster_compute: dataset/multi_node_train_16_workers.yaml

	run:
	timeout: 3600
	script: >
	python dataset/multi_node_train_benchmark.py --num-workers 16 --file-type parquet
	--target-worker-gb 50 --use-gpu

	variations:
	- __suffix__: regular
	- __suffix__: chaos
	run:
	prepare: >
	python setup_chaos.py --kill-interval 200 --max-to-kill 1 --task-names
	"_RayTrainWorker__execute.get_next"

[data] Adding in batch inference mock pipeline release test #52616

[data] Adding in batch inference mock pipeline release test #52616

Uh oh!

Conversation

omatthew98 commented Apr 25, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

bveeramani Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

omatthew98 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants