[ci] Migrate serve and runtime_env compute configs to new schema#62876
Conversation
Step 14 of the compute config upgrade plan. Migrates 12 serve and runtime_env compute configs to the new Anyscale SDK schema and flips anyscale_sdk_2026: true on the 9 affected tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9865da2. Configure here.
There was a problem hiding this comment.
Code Review
This pull request migrates cluster compute templates to the new SDK 2026 schema across several YAML configuration files. Key changes include renaming fields such as cloud_id to cloud, head_node_type to head_node, and worker_node_types to worker_nodes, as well as updating worker node parameters and adding the anyscale_sdk_2026: true flag to test definitions. Review feedback identifies a missing flag for the pytest_serve_router_benchmark test that will cause parsing errors and suggests consistent placement of the new flag within the YAML structure for better readability.
In the new SDK schema, `worker_nodes[].resources` is a full override
rather than a merge with the instance type's natural resources
(confirmed via anyscale/compute_config/models.py: "Defaults to match
the physical resources of the instance type"). Partial overrides like
`resources: {worker: 1}` silently drop the instance's vCPU count,
preventing Serve replicas from being scheduled.
Fixes:
- compute_tpl_32_cpu.yaml: worker adds `CPU: 32` alongside `worker: 1`
(m5.8xlarge = 32 vCPUs).
- compute_tpl_8_cpu_autoscaling.yaml: worker adds `CPU: 8` alongside
`proxy: 1` (m5.2xlarge = 8 vCPUs).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Addresses review feedback: - pytest_serve_router_benchmark also loads compute_tpl_8_cpu_autoscaling.yaml (migrated in this PR) but was missing anyscale_sdk_2026: true. Without the flag the legacy SDK parser hits the new-schema config and fails. Added the flag. - Moved anyscale_sdk_2026: true to the top of the cluster: block in serve_controller_benchmark_haproxy and pytest_serve_throughput_optimized_microbenchmarks for placement consistency with the other 7 tests in this PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
…-project#62876) ## Summary Migrates 12 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to the 10 corresponding tests in `release_tests.yaml`. ### Compute configs migrated (12 files) **Serve tests** (`release/serve_tests/`): - `compute_tpl_32_cpu.yaml` / `compute_tpl_32_cpu_gce.yaml` - `compute_tpl_8_cpu_autoscaling.yaml` - `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml` - `compute_tpl_single_node_16_cpu.yaml` - `compute_tpl_single_node_32_cpu.yaml` - `compute_tpl_single_node_gce.yaml` **Runtime env tests** (`release/runtime_env_tests/`): - `rte_small.yaml` / `rte_gce_small.yaml` - `rte_minimal.yaml` / `rte_gce_minimal.yaml` ### Tests updated in release_tests.yaml (10 tests) 1. `pytest_serve_scale_replicas` → `compute_tpl_single_node_32_cpu.yaml` 2. `pytest_serve_multi_deployment_1k_noop_replica` → `compute_tpl_32_cpu.yaml` (+ GCE variation on `compute_tpl_32_cpu_gce.yaml`) 3. `pytest_serve_autoscaling_load_test` → `compute_tpl_single_node_32_cpu.yaml` 4. `serve_controller_benchmark_haproxy` → `compute_tpl_8_cpu_autoscaling.yaml` (+ GCE on `compute_tpl_single_node_gce.yaml`) 5. `pytest_serve_microbenchmarks` → `compute_tpl_single_node_16_cpu.yaml` (+ GCE) 6. `pytest_serve_throughput_optimized_microbenchmarks` → `compute_tpl_single_node_16_cpu.yaml` (+ GCE) 7. `pytest_serve_router_benchmark` → `compute_tpl_8_cpu_autoscaling.yaml` (shares the config with `serve_controller_benchmark_haproxy` — must be flipped together) 8. `pytest_serve_resnet_benchmark` → `compute_tpl_gpu_node.yaml` (+ GCE on `compute_tpl_gpu_node_gce.yaml`) 9. `runtime_env_rte_many_tasks_actors` → `rte_small.yaml` (+ GCE on `rte_gce_small.yaml`) 10. `runtime_env_wheel_urls` → `rte_minimal.yaml` (+ GCE on `rte_gce_minimal.yaml`) All 10 tests have their `anyscale_sdk_2026: true` on the base `cluster:` block; GCE variations inherit via `deep_update`. None of these tests have a kuberay variation. ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - Removed `region: us-west-2` / GCE `region` + `allowed_azs` → `zones` - Removed `max_workers: N` (top-level) - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` (preserved empty lists as `worker_nodes: []` on single-node configs) - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` → `advanced_instance_config` - Dropped head/worker `name:` fields (single worker group per config) - `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml`: head has GPU — set explicit `resources: {CPU: 16, GPU: 1}` per the plan's head-schedulability rule - `compute_tpl_8_cpu_autoscaling.yaml`: flattened head `resources: {custom_resources: {proxy: 1, CPU: 0}}` to `{CPU: 0, proxy: 1}`; workers' `{custom_resources: {proxy: 1}}` flattened to `{CPU: 8, proxy: 1}` — explicit CPU count because the new SDK treats `worker_nodes[].resources` as a full override rather than a merge with the instance's natural resources - `compute_tpl_32_cpu.yaml`: workers' `{custom_resources: {worker: 1}}` flattened to `{CPU: 32, worker: 1}` (same full-override rationale) - `rte_small.yaml` / `rte_gce_small.yaml`: set explicit head `resources: {CPU: 4}` because the corresponding test has `wait_for_nodes.num_nodes: 4` (= head + 3 workers) — head must be schedulable ## Test plan - [x] All 12 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all 10 test entries 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

Summary
Migrates 12 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds
anyscale_sdk_2026: trueto the 10 corresponding tests inrelease_tests.yaml.Compute configs migrated (12 files)
Serve tests (
release/serve_tests/):compute_tpl_32_cpu.yaml/compute_tpl_32_cpu_gce.yamlcompute_tpl_8_cpu_autoscaling.yamlcompute_tpl_gpu_node.yaml/compute_tpl_gpu_node_gce.yamlcompute_tpl_single_node_16_cpu.yamlcompute_tpl_single_node_32_cpu.yamlcompute_tpl_single_node_gce.yamlRuntime env tests (
release/runtime_env_tests/):rte_small.yaml/rte_gce_small.yamlrte_minimal.yaml/rte_gce_minimal.yamlTests updated in release_tests.yaml (10 tests)
pytest_serve_scale_replicas→compute_tpl_single_node_32_cpu.yamlpytest_serve_multi_deployment_1k_noop_replica→compute_tpl_32_cpu.yaml(+ GCE variation oncompute_tpl_32_cpu_gce.yaml)pytest_serve_autoscaling_load_test→compute_tpl_single_node_32_cpu.yamlserve_controller_benchmark_haproxy→compute_tpl_8_cpu_autoscaling.yaml(+ GCE oncompute_tpl_single_node_gce.yaml)pytest_serve_microbenchmarks→compute_tpl_single_node_16_cpu.yaml(+ GCE)pytest_serve_throughput_optimized_microbenchmarks→compute_tpl_single_node_16_cpu.yaml(+ GCE)pytest_serve_router_benchmark→compute_tpl_8_cpu_autoscaling.yaml(shares the config withserve_controller_benchmark_haproxy— must be flipped together)pytest_serve_resnet_benchmark→compute_tpl_gpu_node.yaml(+ GCE oncompute_tpl_gpu_node_gce.yaml)runtime_env_rte_many_tasks_actors→rte_small.yaml(+ GCE onrte_gce_small.yaml)runtime_env_wheel_urls→rte_minimal.yaml(+ GCE onrte_gce_minimal.yaml)All 10 tests have their
anyscale_sdk_2026: trueon the basecluster:block; GCE variations inherit viadeep_update. None of these tests have a kuberay variation.Schema changes applied
cloud_id→cloud,ANYSCALE_CLOUD_ID→ANYSCALE_CLOUD_NAMEregion: us-west-2/ GCEregion+allowed_azs→zonesmax_workers: N(top-level)head_node_type→head_node,worker_node_types→worker_nodes(preserved empty lists asworker_nodes: []on single-node configs)min_workers→min_nodes,max_workers→max_nodesuse_spot: false→market_type: ON_DEMANDadvanced_configurations_json→advanced_instance_configname:fields (single worker group per config)compute_tpl_gpu_node.yaml/compute_tpl_gpu_node_gce.yaml: head has GPU — set explicitresources: {CPU: 16, GPU: 1}per the plan's head-schedulability rulecompute_tpl_8_cpu_autoscaling.yaml: flattened headresources: {custom_resources: {proxy: 1, CPU: 0}}to{CPU: 0, proxy: 1}; workers'{custom_resources: {proxy: 1}}flattened to{CPU: 8, proxy: 1}— explicit CPU count because the new SDK treatsworker_nodes[].resourcesas a full override rather than a merge with the instance's natural resourcescompute_tpl_32_cpu.yaml: workers'{custom_resources: {worker: 1}}flattened to{CPU: 32, worker: 1}(same full-override rationale)rte_small.yaml/rte_gce_small.yaml: set explicit headresources: {CPU: 4}because the corresponding test haswait_for_nodes.num_nodes: 4(= head + 3 workers) — head must be schedulableTest plan
ComputeConfig.from_yaml()anyscale_sdk_2026: trueflag on all 10 test entries🤖 Generated with Claude Code