[ml-release][no-ci] fix torch tune serve test. #30095

xwjiang2010 · 2022-11-08T15:33:48Z

Signed-off-by: xwjiang2010 xwjiang2010@gmail.com

Why are these changes needed?

The test currently misses wait_for_nodes, resulting in timeout.
https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_Q4mHYcB1xS93WatGP7i9jSsy

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

The test currently misses wait_for_nodes, resulting in timeout. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

krfricke

I think the idea behind the golden notebook test is explicitly to test autoscaling. If this doesn't work, we should investigate why this is the case - is it node availability, a problem with the autoscaler, the product, or do we just have to increase the timeout?

cc @matthewdeng for context on golden notebook tests

krfricke · 2022-11-08T18:27:36Z

Hm, though it does seem that the cluster compute specifies min_nodes:

worker_node_types:
    - name: worker_node
      instance_type: g3.8xlarge
      min_workers: 2
      max_workers: 2
      use_spot: true

matthewdeng · 2022-11-09T02:44:05Z

I don't recall this explicitly intending to test autoscaling.

This was originally added with timeout 1800 in [release] add golden notebook release test for torch/tune/serve #16619.
1. The wait_for_nodes utility may not have existed at this point in time.
Timeout was decreased to 600 in [CI] Reduce cost of golden notebook release test #29609.

krfricke

Thanks @matthewdeng for the context.

The test currently misses wait_for_nodes, resulting in timeout. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

[ml-release][no-ci] fix torch tune serve test.

3703e13

The test currently misses wait_for_nodes, resulting in timeout. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

xwjiang2010 assigned krfricke Nov 8, 2022

krfricke requested changes Nov 8, 2022

View reviewed changes

xwjiang2010 assigned matthewdeng Nov 8, 2022

matthewdeng approved these changes Nov 9, 2022

View reviewed changes

krfricke approved these changes Nov 9, 2022

View reviewed changes

krfricke merged commit c7d696e into ray-project:master Nov 9, 2022

xwjiang2010 deleted the fix_torch_tune_serve_test branch July 26, 2023 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ml-release][no-ci] fix torch tune serve test. #30095

[ml-release][no-ci] fix torch tune serve test. #30095

xwjiang2010 commented Nov 8, 2022

krfricke left a comment

krfricke commented Nov 8, 2022

matthewdeng commented Nov 9, 2022

krfricke left a comment

[ml-release][no-ci] fix torch tune serve test. #30095

[ml-release][no-ci] fix torch tune serve test. #30095

Conversation

xwjiang2010 commented Nov 8, 2022

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke commented Nov 8, 2022

matthewdeng commented Nov 9, 2022

krfricke left a comment

Choose a reason for hiding this comment