[ci] Migrate serve and runtime_env compute configs to new schema by sai-miduthuri · Pull Request #62876 · ray-project/ray

sai-miduthuri · 2026-04-23T06:48:26Z

Summary

Migrates 12 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds anyscale_sdk_2026: true to the 10 corresponding tests in release_tests.yaml.

Compute configs migrated (12 files)

Serve tests (release/serve_tests/):

compute_tpl_32_cpu.yaml / compute_tpl_32_cpu_gce.yaml
compute_tpl_8_cpu_autoscaling.yaml
compute_tpl_gpu_node.yaml / compute_tpl_gpu_node_gce.yaml
compute_tpl_single_node_16_cpu.yaml
compute_tpl_single_node_32_cpu.yaml
compute_tpl_single_node_gce.yaml

Runtime env tests (release/runtime_env_tests/):

rte_small.yaml / rte_gce_small.yaml
rte_minimal.yaml / rte_gce_minimal.yaml

Tests updated in release_tests.yaml (10 tests)

pytest_serve_scale_replicas → compute_tpl_single_node_32_cpu.yaml
pytest_serve_multi_deployment_1k_noop_replica → compute_tpl_32_cpu.yaml (+ GCE variation on compute_tpl_32_cpu_gce.yaml)
pytest_serve_autoscaling_load_test → compute_tpl_single_node_32_cpu.yaml
serve_controller_benchmark_haproxy → compute_tpl_8_cpu_autoscaling.yaml (+ GCE on compute_tpl_single_node_gce.yaml)
pytest_serve_microbenchmarks → compute_tpl_single_node_16_cpu.yaml (+ GCE)
pytest_serve_throughput_optimized_microbenchmarks → compute_tpl_single_node_16_cpu.yaml (+ GCE)
pytest_serve_router_benchmark → compute_tpl_8_cpu_autoscaling.yaml (shares the config with serve_controller_benchmark_haproxy — must be flipped together)
pytest_serve_resnet_benchmark → compute_tpl_gpu_node.yaml (+ GCE on compute_tpl_gpu_node_gce.yaml)
runtime_env_rte_many_tasks_actors → rte_small.yaml (+ GCE on rte_gce_small.yaml)
runtime_env_wheel_urls → rte_minimal.yaml (+ GCE on rte_gce_minimal.yaml)

All 10 tests have their anyscale_sdk_2026: true on the base cluster: block; GCE variations inherit via deep_update. None of these tests have a kuberay variation.

Schema changes applied

cloud_id → cloud, ANYSCALE_CLOUD_ID → ANYSCALE_CLOUD_NAME
Removed region: us-west-2 / GCE region + allowed_azs → zones
Removed max_workers: N (top-level)
head_node_type → head_node, worker_node_types → worker_nodes (preserved empty lists as worker_nodes: [] on single-node configs)
min_workers → min_nodes, max_workers → max_nodes
use_spot: false → market_type: ON_DEMAND
advanced_configurations_json → advanced_instance_config
Dropped head/worker name: fields (single worker group per config)
compute_tpl_gpu_node.yaml / compute_tpl_gpu_node_gce.yaml: head has GPU — set explicit resources: {CPU: 16, GPU: 1} per the plan's head-schedulability rule
compute_tpl_8_cpu_autoscaling.yaml: flattened head resources: {custom_resources: {proxy: 1, CPU: 0}} to {CPU: 0, proxy: 1}; workers' {custom_resources: {proxy: 1}} flattened to {CPU: 8, proxy: 1} — explicit CPU count because the new SDK treats worker_nodes[].resources as a full override rather than a merge with the instance's natural resources
compute_tpl_32_cpu.yaml: workers' {custom_resources: {worker: 1}} flattened to {CPU: 32, worker: 1} (same full-override rationale)
rte_small.yaml / rte_gce_small.yaml: set explicit head resources: {CPU: 4} because the corresponding test has wait_for_nodes.num_nodes: 4 (= head + 3 workers) — head must be schedulable

Test plan

All 12 config files validated against ComputeConfig.from_yaml()
CI passes with anyscale_sdk_2026: true flag on all 10 test entries

🤖 Generated with Claude Code

Step 14 of the compute config upgrade plan. Migrates 12 serve and runtime_env compute configs to the new Anyscale SDK schema and flips anyscale_sdk_2026: true on the 9 affected tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9865da2. Configure here.}

gemini-code-assist

Code Review

This pull request migrates cluster compute templates to the new SDK 2026 schema across several YAML configuration files. Key changes include renaming fields such as cloud_id to cloud, head_node_type to head_node, and worker_node_types to worker_nodes, as well as updating worker node parameters and adding the anyscale_sdk_2026: true flag to test definitions. Review feedback identifies a missing flag for the pytest_serve_router_benchmark test that will cause parsing errors and suggests consistent placement of the new flag within the YAML structure for better readability.

In the new SDK schema, `worker_nodes[].resources` is a full override rather than a merge with the instance type's natural resources (confirmed via anyscale/compute_config/models.py: "Defaults to match the physical resources of the instance type"). Partial overrides like `resources: {worker: 1}` silently drop the instance's vCPU count, preventing Serve replicas from being scheduled. Fixes: - compute_tpl_32_cpu.yaml: worker adds `CPU: 32` alongside `worker: 1` (m5.8xlarge = 32 vCPUs). - compute_tpl_8_cpu_autoscaling.yaml: worker adds `CPU: 8` alongside `proxy: 1` (m5.2xlarge = 8 vCPUs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Addresses review feedback: - pytest_serve_router_benchmark also loads compute_tpl_8_cpu_autoscaling.yaml (migrated in this PR) but was missing anyscale_sdk_2026: true. Without the flag the legacy SDK parser hits the new-schema config and fails. Added the flag. - Moved anyscale_sdk_2026: true to the top of the cluster: block in serve_controller_benchmark_haproxy and pytest_serve_throughput_optimized_microbenchmarks for placement consistency with the other 7 tests in this PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

sai-miduthuri · 2026-05-08T20:36:37Z

All tests pass on baae41b except for 2 tests that have been set to manual due to test errors unrelated to these compute-config changes.

These tests were failing before this PR, as seen at #63231 , so leaving the tests without any further changes to get them passing.

…nfigs

…-project#62876) ## Summary Migrates 12 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to the 10 corresponding tests in `release_tests.yaml`. ### Compute configs migrated (12 files) **Serve tests** (`release/serve_tests/`): - `compute_tpl_32_cpu.yaml` / `compute_tpl_32_cpu_gce.yaml` - `compute_tpl_8_cpu_autoscaling.yaml` - `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml` - `compute_tpl_single_node_16_cpu.yaml` - `compute_tpl_single_node_32_cpu.yaml` - `compute_tpl_single_node_gce.yaml` **Runtime env tests** (`release/runtime_env_tests/`): - `rte_small.yaml` / `rte_gce_small.yaml` - `rte_minimal.yaml` / `rte_gce_minimal.yaml` ### Tests updated in release_tests.yaml (10 tests) 1. `pytest_serve_scale_replicas` → `compute_tpl_single_node_32_cpu.yaml` 2. `pytest_serve_multi_deployment_1k_noop_replica` → `compute_tpl_32_cpu.yaml` (+ GCE variation on `compute_tpl_32_cpu_gce.yaml`) 3. `pytest_serve_autoscaling_load_test` → `compute_tpl_single_node_32_cpu.yaml` 4. `serve_controller_benchmark_haproxy` → `compute_tpl_8_cpu_autoscaling.yaml` (+ GCE on `compute_tpl_single_node_gce.yaml`) 5. `pytest_serve_microbenchmarks` → `compute_tpl_single_node_16_cpu.yaml` (+ GCE) 6. `pytest_serve_throughput_optimized_microbenchmarks` → `compute_tpl_single_node_16_cpu.yaml` (+ GCE) 7. `pytest_serve_router_benchmark` → `compute_tpl_8_cpu_autoscaling.yaml` (shares the config with `serve_controller_benchmark_haproxy` — must be flipped together) 8. `pytest_serve_resnet_benchmark` → `compute_tpl_gpu_node.yaml` (+ GCE on `compute_tpl_gpu_node_gce.yaml`) 9. `runtime_env_rte_many_tasks_actors` → `rte_small.yaml` (+ GCE on `rte_gce_small.yaml`) 10. `runtime_env_wheel_urls` → `rte_minimal.yaml` (+ GCE on `rte_gce_minimal.yaml`) All 10 tests have their `anyscale_sdk_2026: true` on the base `cluster:` block; GCE variations inherit via `deep_update`. None of these tests have a kuberay variation. ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - Removed `region: us-west-2` / GCE `region` + `allowed_azs` → `zones` - Removed `max_workers: N` (top-level) - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` (preserved empty lists as `worker_nodes: []` on single-node configs) - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` → `advanced_instance_config` - Dropped head/worker `name:` fields (single worker group per config) - `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml`: head has GPU — set explicit `resources: {CPU: 16, GPU: 1}` per the plan's head-schedulability rule - `compute_tpl_8_cpu_autoscaling.yaml`: flattened head `resources: {custom_resources: {proxy: 1, CPU: 0}}` to `{CPU: 0, proxy: 1}`; workers' `{custom_resources: {proxy: 1}}` flattened to `{CPU: 8, proxy: 1}` — explicit CPU count because the new SDK treats `worker_nodes[].resources` as a full override rather than a merge with the instance's natural resources - `compute_tpl_32_cpu.yaml`: workers' `{custom_resources: {worker: 1}}` flattened to `{CPU: 32, worker: 1}` (same full-override rationale) - `rte_small.yaml` / `rte_gce_small.yaml`: set explicit head `resources: {CPU: 4}` because the corresponding test has `wait_for_nodes.num_nodes: 4` (= head + 3 workers) — head must be schedulable ## Test plan - [x] All 12 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all 10 test entries 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread release/release_tests.yaml Outdated

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread release/serve_tests/compute_tpl_8_cpu_autoscaling.yaml

Comment thread release/release_tests.yaml Outdated

ray-gardener Bot added serve Ray Serve Related Issue release-test release test labels Apr 23, 2026

sai-miduthuri and others added 3 commits April 23, 2026 09:36

Force gce variations to use gce cloud and project

baae41b

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

sai-miduthuri mentioned this pull request May 8, 2026

[DRAFT] Run serve gce tests without changes #63231

Closed

Merge branch 'master' into sai-miduthuri/upgrade-serve-rte-compute-co…

cf36f66

…nfigs

sai-miduthuri added the go add ONLY when ready to merge, run all tests label May 8, 2026

sai-miduthuri requested review from elliot-barn and kamil-kaczmarek May 11, 2026 20:47

elliot-barn approved these changes May 18, 2026

View reviewed changes

elliot-barn merged commit 957664a into master May 18, 2026
8 checks passed

elliot-barn deleted the sai-miduthuri/upgrade-serve-rte-compute-configs branch May 18, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Migrate serve and runtime_env compute configs to new schema#62876

[ci] Migrate serve and runtime_env compute configs to new schema#62876
elliot-barn merged 5 commits into
masterfrom
sai-miduthuri/upgrade-serve-rte-compute-configs

sai-miduthuri commented Apr 23, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

sai-miduthuri commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sai-miduthuri commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compute configs migrated (12 files)

Tests updated in release_tests.yaml (10 tests)

Schema changes applied

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

sai-miduthuri commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sai-miduthuri commented Apr 23, 2026 •

edited

Loading