[ci] Stop billing tags from overriding advanced_instance_config fields#63296
Conversation
In the new-schema branch of _annotate_cluster_compute, also tagging head_node.advanced_instance_config and worker_nodes[*].advanced_instance_config caused Anyscale to use the per-group spec (TagSpecifications-only) at instance launch and silently drop the cluster-level IamInstanceProfile and other fields, since the per-group spec replaces (not merges with) the base spec. Matches the legacy branch's behavior of only annotating the top-level spec. Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the cluster manager to only annotate the top-level advanced_instance_config with billing tags, preventing Anyscale from overriding cluster-level configurations when per-group specifications are present. Corresponding tests were updated to verify that node groups are no longer automatically annotated. A review comment suggests that the code should still annotate advanced_instance_config if it is explicitly provided by the user for specific node groups to ensure billing tags are not lost in those instances.
When the user supplies head_node.advanced_instance_config or worker_nodes[*].advanced_instance_config, Anyscale replaces (does not merge with) the cluster-level base spec, so the per-group spec must carry the billing tags itself. Auto-creation of per-group specs is still avoided. Addresses #63296 (comment) Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Refines the new-schema annotation rules so billing tags land exactly
where the effective advanced_instance_config is used:
- No advanced_instance_config anywhere -> tag base only.
- Base only -> tag base only.
- Per-group only (head and/or workers, no base) -> tag per-group
only; do NOT auto-create a base spec.
- Base AND per-group -> tag everywhere.
Adds test coverage for head-only per-group, worker-only per-group,
base+per-group, and the non-AWS early-return.
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
- Coerce missing/null head_node to {} and worker_nodes to [] in the
same way, so the downstream membership/iteration code doesn't need
to special-case either. The per-worker isinstance(w, dict) guard is
retained to defend against malformed individual entries.
- Add a dedicated testClusterComputeNewSchemaNoAic so the "no
advanced_instance_config anywhere" row of the spec is visible
alongside the other per-row test methods.
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
| RELEASE_AWS_RESOURCE_TYPES_TO_TRACK_FOR_BILLING, | ||
| ) | ||
| for worker in workers: | ||
| if isinstance(worker, dict) and "advanced_instance_config" in worker: |
There was a problem hiding this comment.
should we throw an error here if the worker isn't a dict? not sure if we have cluster validation before this
There was a problem hiding this comment.
We do have cluster validation as a separate step when the compute-configs are loaded into memory for each each test through the test-collection:
ray/release/ray_release/scripts/run_release_test.py
Lines 92 to 95 in 2d9c453
ray/release/ray_release/config.py
Lines 108 to 112 in 2d9c453
ray/release/ray_release/config.py
Line 269 in 2d9c453
ray/release/ray_release/config.py
Lines 328 to 330 in 2d9c453
ray/release/ray_release/config.py
Line 333 in 2d9c453
We implicitly ensure that the cluster_compute["head_node"] and cluster_compute["worker_nodes"] are of dict type if they are present in the compute-config.
Though I can add additional checks and throw an error here if it isn't a dict, I think we can remove the isinstance(..., dict) check instead and just assume that it is a dict.
What is your recommendation?
There was a problem hiding this comment.
Will leave as is. Approved!
ray-project#63296) ## Description Refines `_annotate_cluster_compute` (new-schema branch) so the billing-tag `TagSpecifications` land at exactly the `advanced_instance_config` levels that the cluster will actually use at launch. Previously every new-schema compute config got tags added at three locations unconditionally: top-level, `head_node`, and every `worker_nodes[*]`. ### Final rules | Input state | Base tagged? | Per-group tagged? | |-----------------------------------------------------|----------------------|--------------------------------| | No `advanced_instance_config` anywhere | yes (auto-created) | — | | `advanced_instance_config` at base only | yes | — | | `advanced_instance_config` only on head and/or workers | no (not auto-created) | yes (only the groups that already have one) | | `advanced_instance_config` at base AND on head/workers | yes | yes (only the groups that already have one) | In code: `should_tag_base = has_base_aic or not (has_head_aic or has_worker_aic)`; per-group specs are tagged iff the user already supplied them. ### Why Anyscale resolves the effective advanced instance spec per node group using an either-or pick: when a per-group spec is present, it **replaces** (does not merge with) the cluster-level base spec. The previous behavior unconditionally created `TagSpecifications`-only entries on `head_node` and `worker_nodes[*]`, which made those per-group specs non-empty and caused them to win — silently dropping any `IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster level. This was observed on the `iceberg_benchmark_*` tests after their compute config was migrated to the new schema: workers no longer picked up the `ray-autoscaler-v1` IAM instance profile that was declared at the top level of `iceberg_benchmark_compute.yaml`, so the test couldn't access the AWS resources it needed. ### Test `release/ray_release/tests/test_cluster_manager.py`: - `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and base+per-group rows, including verifying that pre-existing `IamInstanceProfile` values on each level are preserved alongside the added `TagSpecifications`. - `testClusterComputeNewSchemaNoAic`, `testClusterComputeNewSchemaPerGroupHeadOnly`, `testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per row of the table for clear regression coverage. - `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early return. ## Related issues Related to ray-project#62863 (where this bug was first observed during the dataset compute-config migration). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ray-project#63296) ## Description Refines `_annotate_cluster_compute` (new-schema branch) so the billing-tag `TagSpecifications` land at exactly the `advanced_instance_config` levels that the cluster will actually use at launch. Previously every new-schema compute config got tags added at three locations unconditionally: top-level, `head_node`, and every `worker_nodes[*]`. ### Final rules | Input state | Base tagged? | Per-group tagged? | |-----------------------------------------------------|----------------------|--------------------------------| | No `advanced_instance_config` anywhere | yes (auto-created) | — | | `advanced_instance_config` at base only | yes | — | | `advanced_instance_config` only on head and/or workers | no (not auto-created) | yes (only the groups that already have one) | | `advanced_instance_config` at base AND on head/workers | yes | yes (only the groups that already have one) | In code: `should_tag_base = has_base_aic or not (has_head_aic or has_worker_aic)`; per-group specs are tagged iff the user already supplied them. ### Why Anyscale resolves the effective advanced instance spec per node group using an either-or pick: when a per-group spec is present, it **replaces** (does not merge with) the cluster-level base spec. The previous behavior unconditionally created `TagSpecifications`-only entries on `head_node` and `worker_nodes[*]`, which made those per-group specs non-empty and caused them to win — silently dropping any `IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster level. This was observed on the `iceberg_benchmark_*` tests after their compute config was migrated to the new schema: workers no longer picked up the `ray-autoscaler-v1` IAM instance profile that was declared at the top level of `iceberg_benchmark_compute.yaml`, so the test couldn't access the AWS resources it needed. ### Test `release/ray_release/tests/test_cluster_manager.py`: - `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and base+per-group rows, including verifying that pre-existing `IamInstanceProfile` values on each level are preserved alongside the added `TagSpecifications`. - `testClusterComputeNewSchemaNoAic`, `testClusterComputeNewSchemaPerGroupHeadOnly`, `testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per row of the table for clear regression coverage. - `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early return. ## Related issues Related to ray-project#62863 (where this bug was first observed during the dataset compute-config migration). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
ray-project#63296) ## Description Refines `_annotate_cluster_compute` (new-schema branch) so the billing-tag `TagSpecifications` land at exactly the `advanced_instance_config` levels that the cluster will actually use at launch. Previously every new-schema compute config got tags added at three locations unconditionally: top-level, `head_node`, and every `worker_nodes[*]`. ### Final rules | Input state | Base tagged? | Per-group tagged? | |-----------------------------------------------------|----------------------|--------------------------------| | No `advanced_instance_config` anywhere | yes (auto-created) | — | | `advanced_instance_config` at base only | yes | — | | `advanced_instance_config` only on head and/or workers | no (not auto-created) | yes (only the groups that already have one) | | `advanced_instance_config` at base AND on head/workers | yes | yes (only the groups that already have one) | In code: `should_tag_base = has_base_aic or not (has_head_aic or has_worker_aic)`; per-group specs are tagged iff the user already supplied them. ### Why Anyscale resolves the effective advanced instance spec per node group using an either-or pick: when a per-group spec is present, it **replaces** (does not merge with) the cluster-level base spec. The previous behavior unconditionally created `TagSpecifications`-only entries on `head_node` and `worker_nodes[*]`, which made those per-group specs non-empty and caused them to win — silently dropping any `IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster level. This was observed on the `iceberg_benchmark_*` tests after their compute config was migrated to the new schema: workers no longer picked up the `ray-autoscaler-v1` IAM instance profile that was declared at the top level of `iceberg_benchmark_compute.yaml`, so the test couldn't access the AWS resources it needed. ### Test `release/ray_release/tests/test_cluster_manager.py`: - `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and base+per-group rows, including verifying that pre-existing `IamInstanceProfile` values on each level are preserved alongside the added `TagSpecifications`. - `testClusterComputeNewSchemaNoAic`, `testClusterComputeNewSchemaPerGroupHeadOnly`, `testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per row of the table for clear regression coverage. - `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early return. ## Related issues Related to ray-project#62863 (where this bug was first observed during the dataset compute-config migration). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
Refines
_annotate_cluster_compute(new-schema branch) so the billing-tagTagSpecificationsland at exactly theadvanced_instance_configlevels that the cluster will actually use at launch. Previously every new-schema compute config got tags added at three locations unconditionally: top-level,head_node, and everyworker_nodes[*].Final rules
advanced_instance_configanywhereadvanced_instance_configat base onlyadvanced_instance_configonly on head and/or workersadvanced_instance_configat base AND on head/workersIn code:
should_tag_base = has_base_aic or not (has_head_aic or has_worker_aic); per-group specs are tagged iff the user already supplied them.Why
Anyscale resolves the effective advanced instance spec per node group using an either-or pick: when a per-group spec is present, it replaces (does not merge with) the cluster-level base spec. The previous behavior unconditionally created
TagSpecifications-only entries onhead_nodeandworker_nodes[*], which made those per-group specs non-empty and caused them to win — silently dropping anyIamInstanceProfile/NetworkInterfaces/etc. set only at the cluster level.This was observed on the
iceberg_benchmark_*tests after their compute config was migrated to the new schema: workers no longer picked up theray-autoscaler-v1IAM instance profile that was declared at the top level oficeberg_benchmark_compute.yaml, so the test couldn't access the AWS resources it needed.Test
release/ray_release/tests/test_cluster_manager.py:testClusterComputeExtraTagsNewSchema(updated) covers the no-AIC and base+per-group rows, including verifying that pre-existingIamInstanceProfilevalues on each level are preserved alongside the addedTagSpecifications.testClusterComputeNewSchemaNoAic,testClusterComputeNewSchemaPerGroupHeadOnly,testClusterComputeNewSchemaPerGroupWorkerOnly(new) — one method per row of the table for clear regression coverage.testClusterComputeNewSchemaNonAws(new) — verifies the non-AWS early return.Related issues
Related to #62863 (where this bug was first observed during the dataset compute-config migration).
🤖 Generated with Claude Code