Skip to content

[ci] Migrate serve and runtime_env compute configs to new schema#62876

Merged
elliot-barn merged 5 commits into
masterfrom
sai-miduthuri/upgrade-serve-rte-compute-configs
May 18, 2026
Merged

[ci] Migrate serve and runtime_env compute configs to new schema#62876
elliot-barn merged 5 commits into
masterfrom
sai-miduthuri/upgrade-serve-rte-compute-configs

Conversation

@sai-miduthuri
Copy link
Copy Markdown
Contributor

@sai-miduthuri sai-miduthuri commented Apr 23, 2026

Summary

Migrates 12 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds anyscale_sdk_2026: true to the 10 corresponding tests in release_tests.yaml.

Compute configs migrated (12 files)

Serve tests (release/serve_tests/):

  • compute_tpl_32_cpu.yaml / compute_tpl_32_cpu_gce.yaml
  • compute_tpl_8_cpu_autoscaling.yaml
  • compute_tpl_gpu_node.yaml / compute_tpl_gpu_node_gce.yaml
  • compute_tpl_single_node_16_cpu.yaml
  • compute_tpl_single_node_32_cpu.yaml
  • compute_tpl_single_node_gce.yaml

Runtime env tests (release/runtime_env_tests/):

  • rte_small.yaml / rte_gce_small.yaml
  • rte_minimal.yaml / rte_gce_minimal.yaml

Tests updated in release_tests.yaml (10 tests)

  1. pytest_serve_scale_replicascompute_tpl_single_node_32_cpu.yaml
  2. pytest_serve_multi_deployment_1k_noop_replicacompute_tpl_32_cpu.yaml (+ GCE variation on compute_tpl_32_cpu_gce.yaml)
  3. pytest_serve_autoscaling_load_testcompute_tpl_single_node_32_cpu.yaml
  4. serve_controller_benchmark_haproxycompute_tpl_8_cpu_autoscaling.yaml (+ GCE on compute_tpl_single_node_gce.yaml)
  5. pytest_serve_microbenchmarkscompute_tpl_single_node_16_cpu.yaml (+ GCE)
  6. pytest_serve_throughput_optimized_microbenchmarkscompute_tpl_single_node_16_cpu.yaml (+ GCE)
  7. pytest_serve_router_benchmarkcompute_tpl_8_cpu_autoscaling.yaml (shares the config with serve_controller_benchmark_haproxy — must be flipped together)
  8. pytest_serve_resnet_benchmarkcompute_tpl_gpu_node.yaml (+ GCE on compute_tpl_gpu_node_gce.yaml)
  9. runtime_env_rte_many_tasks_actorsrte_small.yaml (+ GCE on rte_gce_small.yaml)
  10. runtime_env_wheel_urlsrte_minimal.yaml (+ GCE on rte_gce_minimal.yaml)

All 10 tests have their anyscale_sdk_2026: true on the base cluster: block; GCE variations inherit via deep_update. None of these tests have a kuberay variation.

Schema changes applied

  • cloud_idcloud, ANYSCALE_CLOUD_IDANYSCALE_CLOUD_NAME
  • Removed region: us-west-2 / GCE region + allowed_azszones
  • Removed max_workers: N (top-level)
  • head_node_typehead_node, worker_node_typesworker_nodes (preserved empty lists as worker_nodes: [] on single-node configs)
  • min_workersmin_nodes, max_workersmax_nodes
  • use_spot: falsemarket_type: ON_DEMAND
  • advanced_configurations_jsonadvanced_instance_config
  • Dropped head/worker name: fields (single worker group per config)
  • compute_tpl_gpu_node.yaml / compute_tpl_gpu_node_gce.yaml: head has GPU — set explicit resources: {CPU: 16, GPU: 1} per the plan's head-schedulability rule
  • compute_tpl_8_cpu_autoscaling.yaml: flattened head resources: {custom_resources: {proxy: 1, CPU: 0}} to {CPU: 0, proxy: 1}; workers' {custom_resources: {proxy: 1}} flattened to {CPU: 8, proxy: 1} — explicit CPU count because the new SDK treats worker_nodes[].resources as a full override rather than a merge with the instance's natural resources
  • compute_tpl_32_cpu.yaml: workers' {custom_resources: {worker: 1}} flattened to {CPU: 32, worker: 1} (same full-override rationale)
  • rte_small.yaml / rte_gce_small.yaml: set explicit head resources: {CPU: 4} because the corresponding test has wait_for_nodes.num_nodes: 4 (= head + 3 workers) — head must be schedulable

Test plan

  • All 12 config files validated against ComputeConfig.from_yaml()
  • CI passes with anyscale_sdk_2026: true flag on all 10 test entries

🤖 Generated with Claude Code

Step 14 of the compute config upgrade plan.

Migrates 12 serve and runtime_env compute configs to the new Anyscale
SDK schema and flips anyscale_sdk_2026: true on the 9 affected tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9865da2. Configure here.

Comment thread release/release_tests.yaml Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates cluster compute templates to the new SDK 2026 schema across several YAML configuration files. Key changes include renaming fields such as cloud_id to cloud, head_node_type to head_node, and worker_node_types to worker_nodes, as well as updating worker node parameters and adding the anyscale_sdk_2026: true flag to test definitions. Review feedback identifies a missing flag for the pytest_serve_router_benchmark test that will cause parsing errors and suggests consistent placement of the new flag within the YAML structure for better readability.

Comment thread release/serve_tests/compute_tpl_8_cpu_autoscaling.yaml
Comment thread release/release_tests.yaml Outdated
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue release-test release test labels Apr 23, 2026
sai-miduthuri and others added 3 commits April 23, 2026 09:36
In the new SDK schema, `worker_nodes[].resources` is a full override
rather than a merge with the instance type's natural resources
(confirmed via anyscale/compute_config/models.py: "Defaults to match
the physical resources of the instance type"). Partial overrides like
`resources: {worker: 1}` silently drop the instance's vCPU count,
preventing Serve replicas from being scheduled.

Fixes:
- compute_tpl_32_cpu.yaml: worker adds `CPU: 32` alongside `worker: 1`
  (m5.8xlarge = 32 vCPUs).
- compute_tpl_8_cpu_autoscaling.yaml: worker adds `CPU: 8` alongside
  `proxy: 1` (m5.2xlarge = 8 vCPUs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Addresses review feedback:

- pytest_serve_router_benchmark also loads compute_tpl_8_cpu_autoscaling.yaml
  (migrated in this PR) but was missing anyscale_sdk_2026: true. Without
  the flag the legacy SDK parser hits the new-schema config and fails.
  Added the flag.

- Moved anyscale_sdk_2026: true to the top of the cluster: block in
  serve_controller_benchmark_haproxy and pytest_serve_throughput_optimized_microbenchmarks
  for placement consistency with the other 7 tests in this PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
@sai-miduthuri
Copy link
Copy Markdown
Contributor Author

All tests pass on baae41b except for 2 tests that have been set to manual due to test errors unrelated to these compute-config changes.

These tests were failing before this PR, as seen at #63231 , so leaving the tests without any further changes to get them passing.

@sai-miduthuri sai-miduthuri added the go add ONLY when ready to merge, run all tests label May 8, 2026
@elliot-barn elliot-barn merged commit 957664a into master May 18, 2026
8 checks passed
@elliot-barn elliot-barn deleted the sai-miduthuri/upgrade-serve-rte-compute-configs branch May 18, 2026 17:24
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…-project#62876)

## Summary

Migrates 12 Anyscale compute config files from the legacy schema to the
new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to the 10
corresponding tests in `release_tests.yaml`.

### Compute configs migrated (12 files)

**Serve tests** (`release/serve_tests/`):
- `compute_tpl_32_cpu.yaml` / `compute_tpl_32_cpu_gce.yaml`
- `compute_tpl_8_cpu_autoscaling.yaml`
- `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml`
- `compute_tpl_single_node_16_cpu.yaml`
- `compute_tpl_single_node_32_cpu.yaml`
- `compute_tpl_single_node_gce.yaml`

**Runtime env tests** (`release/runtime_env_tests/`):
- `rte_small.yaml` / `rte_gce_small.yaml`
- `rte_minimal.yaml` / `rte_gce_minimal.yaml`

### Tests updated in release_tests.yaml (10 tests)

1. `pytest_serve_scale_replicas` → `compute_tpl_single_node_32_cpu.yaml`
2. `pytest_serve_multi_deployment_1k_noop_replica` →
`compute_tpl_32_cpu.yaml` (+ GCE variation on
`compute_tpl_32_cpu_gce.yaml`)
3. `pytest_serve_autoscaling_load_test` →
`compute_tpl_single_node_32_cpu.yaml`
4. `serve_controller_benchmark_haproxy` →
`compute_tpl_8_cpu_autoscaling.yaml` (+ GCE on
`compute_tpl_single_node_gce.yaml`)
5. `pytest_serve_microbenchmarks` →
`compute_tpl_single_node_16_cpu.yaml` (+ GCE)
6. `pytest_serve_throughput_optimized_microbenchmarks` →
`compute_tpl_single_node_16_cpu.yaml` (+ GCE)
7. `pytest_serve_router_benchmark` →
`compute_tpl_8_cpu_autoscaling.yaml` (shares the config with
`serve_controller_benchmark_haproxy` — must be flipped together)
8. `pytest_serve_resnet_benchmark` → `compute_tpl_gpu_node.yaml` (+ GCE
on `compute_tpl_gpu_node_gce.yaml`)
9. `runtime_env_rte_many_tasks_actors` → `rte_small.yaml` (+ GCE on
`rte_gce_small.yaml`)
10. `runtime_env_wheel_urls` → `rte_minimal.yaml` (+ GCE on
`rte_gce_minimal.yaml`)

All 10 tests have their `anyscale_sdk_2026: true` on the base `cluster:`
block; GCE variations inherit via `deep_update`. None of these tests
have a kuberay variation.

### Schema changes applied
- `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME`
- Removed `region: us-west-2` / GCE `region` + `allowed_azs` → `zones`
- Removed `max_workers: N` (top-level)
- `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes`
(preserved empty lists as `worker_nodes: []` on single-node configs)
- `min_workers` → `min_nodes`, `max_workers` → `max_nodes`
- `use_spot: false` → `market_type: ON_DEMAND`
- `advanced_configurations_json` → `advanced_instance_config`
- Dropped head/worker `name:` fields (single worker group per config)
- `compute_tpl_gpu_node.yaml` / `compute_tpl_gpu_node_gce.yaml`: head
has GPU — set explicit `resources: {CPU: 16, GPU: 1}` per the plan's
head-schedulability rule
- `compute_tpl_8_cpu_autoscaling.yaml`: flattened head `resources:
{custom_resources: {proxy: 1, CPU: 0}}` to `{CPU: 0, proxy: 1}`;
workers' `{custom_resources: {proxy: 1}}` flattened to `{CPU: 8, proxy:
1}` — explicit CPU count because the new SDK treats
`worker_nodes[].resources` as a full override rather than a merge with
the instance's natural resources
- `compute_tpl_32_cpu.yaml`: workers' `{custom_resources: {worker: 1}}`
flattened to `{CPU: 32, worker: 1}` (same full-override rationale)
- `rte_small.yaml` / `rte_gce_small.yaml`: set explicit head `resources:
{CPU: 4}` because the corresponding test has `wait_for_nodes.num_nodes:
4` (= head + 3 workers) — head must be schedulable

## Test plan
- [x] All 12 config files validated against `ComputeConfig.from_yaml()`
- [x] CI passes with `anyscale_sdk_2026: true` flag on all 10 test
entries

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants