Skip to content

Conversation

@omatthew98
Copy link
Contributor

Why are these changes needed?

We want to add a release test that will be used to model performance on a batch inference image pipeline. This is meant to more accurately model a realistic user pipeline. This will be used to track our improvement on this workload and more broadly our improvement on batch inference with images with and without spot nodes.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@omatthew98 omatthew98 requested a review from bveeramani April 25, 2025 20:51
Comment on lines 480 to 498
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optionally for future reference): we can use the variants syntax to minimize duplication

- name: distributed_training
working_dir: nightly_tests
cluster:
byod:
post_build_script: byod_install_mosaicml.sh
cluster_compute: dataset/multi_node_train_16_workers.yaml
run:
timeout: 3600
script: >
python dataset/multi_node_train_benchmark.py --num-workers 16 --file-type parquet
--target-worker-gb 50 --use-gpu
variations:
- __suffix__: regular
- __suffix__: chaos
run:
prepare: >
python setup_chaos.py --kill-interval 200 --max-to-kill 1 --task-names
"_RayTrainWorker__execute.get_next"

@omatthew98 omatthew98 requested a review from a team as a code owner April 29, 2025 20:28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are from @aslonnie's trouble shooting here #52687.

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) thank you

you might want to double check again if the release test works

@omatthew98 omatthew98 added the go add ONLY when ready to merge, run all tests label Apr 30, 2025
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98 omatthew98 force-pushed the mowen/add-batch-inference-mock-image-pipeline-test branch from 503d678 to 0cb639a Compare April 30, 2025 23:24
omatthew98 and others added 4 commits April 30, 2025 16:33
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98 omatthew98 force-pushed the mowen/add-batch-inference-mock-image-pipeline-test branch from 0cb639a to 09f8a7a Compare April 30, 2025 23:33
@bveeramani bveeramani enabled auto-merge (squash) April 30, 2025 23:49
@bveeramani bveeramani merged commit 56b7d73 into ray-project:master May 1, 2025
6 checks passed
iamjustinhsu pushed a commit that referenced this pull request May 3, 2025
## Why are these changes needed?
We want to add a release test that will be used to model performance on
a batch inference image pipeline. This is meant to more accurately model
a realistic user pipeline. This will be used to track our improvement on
this workload and more broadly our improvement on batch inference with
images with and without spot nodes.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Co-authored-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: jhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants