Skip to content

[Autoscaler][v2] Fix stopped node metric double counting#62026

Merged
edoakes merged 2 commits intoray-project:masterfrom
weimingdiit:fix/v2-metrics-stopped-nodes-transition-count
Apr 24, 2026
Merged

[Autoscaler][v2] Fix stopped node metric double counting#62026
edoakes merged 2 commits intoray-project:masterfrom
weimingdiit:fix/v2-metrics-stopped-nodes-transition-count

Conversation

@weimingdiit
Copy link
Copy Markdown
Contributor

Description

AutoscalerMetricsReporter.report_instances() computes terminated across the full instance snapshot, but it was incrementing stopped_nodes inside the per-node-type reporting loop.

That caused the same terminated transition count to be added once per configured node type instead of once per reporting pass. In a mixed node-type snapshot, autoscaler_stopped_nodes_total could therefore be over-counted.

Move the stopped_nodes increment out of the per-node-type loop so the counter is updated exactly once for each batch of newly terminated instances.

Related issues

#62025

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug where the autoscaler_stopped_nodes_total metric was being incremented on every reporting cycle for already terminated nodes. The new logic tracks the previous state of instances and only increments the counter when an instance newly transitions to a terminated state. The accompanying test changes effectively validate this new behavior. My review includes one suggestion to improve the readability of the test code.

Comment thread python/ray/autoscaler/v2/tests/test_metrics_reporter.py Outdated
@rueian rueian self-requested a review March 24, 2026 21:51
@weimingdiit weimingdiit marked this pull request as ready for review March 26, 2026 01:46
@weimingdiit weimingdiit requested a review from a team as a code owner March 26, 2026 01:46
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread python/ray/autoscaler/v2/metrics_reporter.py Outdated
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Mar 26, 2026
@weimingdiit
Copy link
Copy Markdown
Contributor Author

Based on the log, the microcheck failure does not seem related to this PR.

@weimingdiit weimingdiit force-pushed the fix/v2-metrics-stopped-nodes-transition-count branch from 06acf13 to bd9b184 Compare March 28, 2026 01:57
@weimingdiit
Copy link
Copy Markdown
Contributor Author

Hi @Kunchd @rueian, the PR is ready now. Could you please take another look when you have chance? Really appreciate it.

Comment thread python/ray/autoscaler/v2/metrics_reporter.py
Comment thread python/ray/autoscaler/v2/metrics_reporter.py Outdated
@weimingdiit weimingdiit force-pushed the fix/v2-metrics-stopped-nodes-transition-count branch 4 times, most recently from f4750d2 to caac07e Compare April 4, 2026 08:12
ids=cloud_instance_ids, request_id=str(uuid.uuid4())
)
if self._metrics_reporter:
self._metrics_reporter.inc_stopped_nodes(len(cloud_instance_ids))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is for terminating. Can we do inc_stopped_nodes when we turn instances to TERMINATED?

Copy link
Copy Markdown
Contributor Author

@weimingdiit weimingdiit Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian Good catch. TERMINATING only means we've issued the async terminate request, so counting stopped_nodes there was premature. I moved the accounting to TERMINATED transitions instead, and only count terminated events with a cloud_instance_id so that canceled queued requests are not included.

Comment on lines +69 to +79
terminated_instances = instances[:-2] + [
create_instance(
terminating_type_1.instance_id,
status=Instance.TERMINATED,
instance_type="type_1",
),
create_instance(
terminating_type_2.instance_id,
status=Instance.TERMINATED,
instance_type="type_2",
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are removing the terminated counting from the report_instances. Can we just delete these?

Copy link
Copy Markdown
Contributor Author

@weimingdiit weimingdiit Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian Thanks for your comment. Yes, those terminated test instances are redundant now. report_instances() no longer exposes any terminated counting, and report_resources() doesn't use terminated instances either. I removed that setup and simplified the test to use the original instances list directly.

@weimingdiit weimingdiit force-pushed the fix/v2-metrics-stopped-nodes-transition-count branch from caac07e to dabc9c5 Compare April 16, 2026 16:12
Signed-off-by: weimingdiit <weimingdiit@gmail.com>
@weimingdiit weimingdiit force-pushed the fix/v2-metrics-stopped-nodes-transition-count branch from dabc9c5 to 559e12c Compare April 16, 2026 16:53
Copy link
Copy Markdown
Contributor

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edoakes, please help merge this 🙏

@rueian rueian added go add ONLY when ready to merge, run all tests labels Apr 16, 2026
@rueian
Copy link
Copy Markdown
Contributor

rueian commented Apr 24, 2026

Hi @edoakes, please help merge this PR 🙏

@edoakes edoakes merged commit 4b32a0e into ray-project:master Apr 24, 2026
6 checks passed
@weimingdiit
Copy link
Copy Markdown
Contributor Author

@rueian @edoakes Thanks for your review and merge.

@weimingdiit weimingdiit deleted the fix/v2-metrics-stopped-nodes-transition-count branch April 25, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants