Skip to content

[ci] Stop billing tags from overriding advanced_instance_config fields#63296

Merged
elliot-barn merged 4 commits into
masterfrom
sai-miduthuri/fix-new-schema-aic-annotation
May 12, 2026
Merged

[ci] Stop billing tags from overriding advanced_instance_config fields#63296
elliot-barn merged 4 commits into
masterfrom
sai-miduthuri/fix-new-schema-aic-annotation

Conversation

@sai-miduthuri
Copy link
Copy Markdown
Contributor

@sai-miduthuri sai-miduthuri commented May 12, 2026

Description

Refines _annotate_cluster_compute (new-schema branch) so the billing-tag TagSpecifications land at exactly the advanced_instance_config levels that the cluster will actually use at launch. Previously every new-schema compute config got tags added at three locations unconditionally: top-level, head_node, and every worker_nodes[*].

Final rules

Input state Base tagged? Per-group tagged?
No advanced_instance_config anywhere yes (auto-created)
advanced_instance_config at base only yes
advanced_instance_config only on head and/or workers no (not auto-created) yes (only the groups that already have one)
advanced_instance_config at base AND on head/workers yes yes (only the groups that already have one)

In code: should_tag_base = has_base_aic or not (has_head_aic or has_worker_aic); per-group specs are tagged iff the user already supplied them.

Why

Anyscale resolves the effective advanced instance spec per node group using an either-or pick: when a per-group spec is present, it replaces (does not merge with) the cluster-level base spec. The previous behavior unconditionally created TagSpecifications-only entries on head_node and worker_nodes[*], which made those per-group specs non-empty and caused them to win — silently dropping any IamInstanceProfile/NetworkInterfaces/etc. set only at the cluster level.

This was observed on the iceberg_benchmark_* tests after their compute config was migrated to the new schema: workers no longer picked up the ray-autoscaler-v1 IAM instance profile that was declared at the top level of iceberg_benchmark_compute.yaml, so the test couldn't access the AWS resources it needed.

Test

release/ray_release/tests/test_cluster_manager.py:

  • testClusterComputeExtraTagsNewSchema (updated) covers the no-AIC and base+per-group rows, including verifying that pre-existing IamInstanceProfile values on each level are preserved alongside the added TagSpecifications.
  • testClusterComputeNewSchemaNoAic, testClusterComputeNewSchemaPerGroupHeadOnly, testClusterComputeNewSchemaPerGroupWorkerOnly (new) — one method per row of the table for clear regression coverage.
  • testClusterComputeNewSchemaNonAws (new) — verifies the non-AWS early return.

Related issues

Related to #62863 (where this bug was first observed during the dataset compute-config migration).

🤖 Generated with Claude Code

In the new-schema branch of _annotate_cluster_compute, also tagging
head_node.advanced_instance_config and worker_nodes[*].advanced_instance_config
caused Anyscale to use the per-group spec (TagSpecifications-only) at
instance launch and silently drop the cluster-level IamInstanceProfile and
other fields, since the per-group spec replaces (not merges with) the base
spec. Matches the legacy branch's behavior of only annotating the top-level
spec.

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
@sai-miduthuri sai-miduthuri requested a review from a team as a code owner May 12, 2026 05:58
@sai-miduthuri sai-miduthuri added the go add ONLY when ready to merge, run all tests label May 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the cluster manager to only annotate the top-level advanced_instance_config with billing tags, preventing Anyscale from overriding cluster-level configurations when per-group specifications are present. Corresponding tests were updated to verify that node groups are no longer automatically annotated. A review comment suggests that the code should still annotate advanced_instance_config if it is explicitly provided by the user for specific node groups to ensure billing tags are not lost in those instances.

Comment thread release/ray_release/cluster_manager/cluster_manager.py Outdated
sai-miduthuri and others added 3 commits May 11, 2026 23:10
When the user supplies head_node.advanced_instance_config or
worker_nodes[*].advanced_instance_config, Anyscale replaces (does not
merge with) the cluster-level base spec, so the per-group spec must
carry the billing tags itself. Auto-creation of per-group specs is
still avoided.

Addresses #63296 (comment)

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Refines the new-schema annotation rules so billing tags land exactly
where the effective advanced_instance_config is used:
  - No advanced_instance_config anywhere -> tag base only.
  - Base only -> tag base only.
  - Per-group only (head and/or workers, no base) -> tag per-group
    only; do NOT auto-create a base spec.
  - Base AND per-group -> tag everywhere.

Adds test coverage for head-only per-group, worker-only per-group,
base+per-group, and the non-AWS early-return.

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
- Coerce missing/null head_node to {} and worker_nodes to [] in the
  same way, so the downstream membership/iteration code doesn't need
  to special-case either. The per-worker isinstance(w, dict) guard is
  retained to defend against malformed individual entries.
- Add a dedicated testClusterComputeNewSchemaNoAic so the "no
  advanced_instance_config anywhere" row of the spec is visible
  alongside the other per-row test methods.

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
@sai-miduthuri sai-miduthuri requested a review from elliot-barn May 12, 2026 06:53
@sai-miduthuri sai-miduthuri changed the title [ci] Only annotate top-level advanced_instance_config with billing tags [ci] Stop billing tags from overriding advanced_instance_config fields May 12, 2026
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core devprod labels May 12, 2026
RELEASE_AWS_RESOURCE_TYPES_TO_TRACK_FOR_BILLING,
)
for worker in workers:
if isinstance(worker, dict) and "advanced_instance_config" in worker:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we throw an error here if the worker isn't a dict? not sure if we have cluster validation before this

Copy link
Copy Markdown
Contributor Author

@sai-miduthuri sai-miduthuri May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have cluster validation as a separate step when the compute-configs are loaded into memory for each each test through the test-collection:

test_collection = read_and_validate_release_test_collection(
test_collection_file or RELEASE_TEST_CONFIG_FILES,
test_definition_root,
)

validate_release_test_collection(
tests,
schema_file=schema_file,
test_definition_root=test_definition_root,
)

error = validate_test_cluster_compute(test, test_definition_root)

is_new_schema = test.uses_anyscale_sdk_2026()
cluster_compute = load_test_cluster_compute(test, test_definition_root)
return validate_cluster_compute(cluster_compute, is_new_schema=is_new_schema)

def validate_cluster_compute(

We implicitly ensure that the cluster_compute["head_node"] and cluster_compute["worker_nodes"] are of dict type if they are present in the compute-config.

Though I can add additional checks and throw an error here if it isn't a dict, I think we can remove the isinstance(..., dict) check instead and just assume that it is a dict.

What is your recommendation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave as is. Approved!

@elliot-barn elliot-barn merged commit 5e2970e into master May 12, 2026
8 checks passed
@elliot-barn elliot-barn deleted the sai-miduthuri/fix-new-schema-aic-annotation branch May 12, 2026 22:05
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
ray-project#63296)

## Description

Refines `_annotate_cluster_compute` (new-schema branch) so the
billing-tag `TagSpecifications` land at exactly the
`advanced_instance_config` levels that the cluster will actually use at
launch. Previously every new-schema compute config got tags added at
three locations unconditionally: top-level, `head_node`, and every
`worker_nodes[*]`.

### Final rules

| Input state | Base tagged? | Per-group tagged? |

|-----------------------------------------------------|----------------------|--------------------------------|
| No `advanced_instance_config` anywhere | yes (auto-created) | — |
| `advanced_instance_config` at base only | yes | — |
| `advanced_instance_config` only on head and/or workers | no (not
auto-created) | yes (only the groups that already have one) |
| `advanced_instance_config` at base AND on head/workers | yes | yes
(only the groups that already have one) |

In code: `should_tag_base = has_base_aic or not (has_head_aic or
has_worker_aic)`; per-group specs are tagged iff the user already
supplied them.

### Why

Anyscale resolves the effective advanced instance spec per node group
using an either-or pick: when a per-group spec is present, it
**replaces** (does not merge with) the cluster-level base spec. The
previous behavior unconditionally created `TagSpecifications`-only
entries on `head_node` and `worker_nodes[*]`, which made those per-group
specs non-empty and caused them to win — silently dropping any
`IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster
level.

This was observed on the `iceberg_benchmark_*` tests after their compute
config was migrated to the new schema: workers no longer picked up the
`ray-autoscaler-v1` IAM instance profile that was declared at the top
level of `iceberg_benchmark_compute.yaml`, so the test couldn't access
the AWS resources it needed.

### Test

`release/ray_release/tests/test_cluster_manager.py`:
- `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and
base+per-group rows, including verifying that pre-existing
`IamInstanceProfile` values on each level are preserved alongside the
added `TagSpecifications`.
- `testClusterComputeNewSchemaNoAic`,
`testClusterComputeNewSchemaPerGroupHeadOnly`,
`testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per
row of the table for clear regression coverage.
- `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early
return.

## Related issues

Related to ray-project#62863 (where this bug was first observed during the dataset
compute-config migration).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
ray-project#63296)

## Description

Refines `_annotate_cluster_compute` (new-schema branch) so the
billing-tag `TagSpecifications` land at exactly the
`advanced_instance_config` levels that the cluster will actually use at
launch. Previously every new-schema compute config got tags added at
three locations unconditionally: top-level, `head_node`, and every
`worker_nodes[*]`.

### Final rules

| Input state | Base tagged? | Per-group tagged? |

|-----------------------------------------------------|----------------------|--------------------------------|
| No `advanced_instance_config` anywhere | yes (auto-created) | — |
| `advanced_instance_config` at base only | yes | — |
| `advanced_instance_config` only on head and/or workers | no (not
auto-created) | yes (only the groups that already have one) |
| `advanced_instance_config` at base AND on head/workers | yes | yes
(only the groups that already have one) |

In code: `should_tag_base = has_base_aic or not (has_head_aic or
has_worker_aic)`; per-group specs are tagged iff the user already
supplied them.

### Why

Anyscale resolves the effective advanced instance spec per node group
using an either-or pick: when a per-group spec is present, it
**replaces** (does not merge with) the cluster-level base spec. The
previous behavior unconditionally created `TagSpecifications`-only
entries on `head_node` and `worker_nodes[*]`, which made those per-group
specs non-empty and caused them to win — silently dropping any
`IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster
level.

This was observed on the `iceberg_benchmark_*` tests after their compute
config was migrated to the new schema: workers no longer picked up the
`ray-autoscaler-v1` IAM instance profile that was declared at the top
level of `iceberg_benchmark_compute.yaml`, so the test couldn't access
the AWS resources it needed.

### Test

`release/ray_release/tests/test_cluster_manager.py`:
- `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and
base+per-group rows, including verifying that pre-existing
`IamInstanceProfile` values on each level are preserved alongside the
added `TagSpecifications`.
- `testClusterComputeNewSchemaNoAic`,
`testClusterComputeNewSchemaPerGroupHeadOnly`,
`testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per
row of the table for clear regression coverage.
- `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early
return.

## Related issues

Related to ray-project#62863 (where this bug was first observed during the dataset
compute-config migration).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
ray-project#63296)

## Description

Refines `_annotate_cluster_compute` (new-schema branch) so the
billing-tag `TagSpecifications` land at exactly the
`advanced_instance_config` levels that the cluster will actually use at
launch. Previously every new-schema compute config got tags added at
three locations unconditionally: top-level, `head_node`, and every
`worker_nodes[*]`.

### Final rules

| Input state | Base tagged? | Per-group tagged? |

|-----------------------------------------------------|----------------------|--------------------------------|
| No `advanced_instance_config` anywhere | yes (auto-created) | — |
| `advanced_instance_config` at base only | yes | — |
| `advanced_instance_config` only on head and/or workers | no (not
auto-created) | yes (only the groups that already have one) |
| `advanced_instance_config` at base AND on head/workers | yes | yes
(only the groups that already have one) |

In code: `should_tag_base = has_base_aic or not (has_head_aic or
has_worker_aic)`; per-group specs are tagged iff the user already
supplied them.

### Why

Anyscale resolves the effective advanced instance spec per node group
using an either-or pick: when a per-group spec is present, it
**replaces** (does not merge with) the cluster-level base spec. The
previous behavior unconditionally created `TagSpecifications`-only
entries on `head_node` and `worker_nodes[*]`, which made those per-group
specs non-empty and caused them to win — silently dropping any
`IamInstanceProfile`/`NetworkInterfaces`/etc. set only at the cluster
level.

This was observed on the `iceberg_benchmark_*` tests after their compute
config was migrated to the new schema: workers no longer picked up the
`ray-autoscaler-v1` IAM instance profile that was declared at the top
level of `iceberg_benchmark_compute.yaml`, so the test couldn't access
the AWS resources it needed.

### Test

`release/ray_release/tests/test_cluster_manager.py`:
- `testClusterComputeExtraTagsNewSchema` (updated) covers the no-AIC and
base+per-group rows, including verifying that pre-existing
`IamInstanceProfile` values on each level are preserved alongside the
added `TagSpecifications`.
- `testClusterComputeNewSchemaNoAic`,
`testClusterComputeNewSchemaPerGroupHeadOnly`,
`testClusterComputeNewSchemaPerGroupWorkerOnly` (new) — one method per
row of the table for clear regression coverage.
- `testClusterComputeNewSchemaNonAws` (new) — verifies the non-AWS early
return.

## Related issues

Related to ray-project#62863 (where this bug was first observed during the dataset
compute-config migration).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core devprod go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants