[Autoscaler][v2] Fix stopped node metric double counting#62026
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug where the autoscaler_stopped_nodes_total metric was being incremented on every reporting cycle for already terminated nodes. The new logic tracks the previous state of instances and only increments the counter when an instance newly transitions to a terminated state. The accompanying test changes effectively validate this new behavior. My review includes one suggestion to improve the readability of the test code.
|
Based on the log, the microcheck failure does not seem related to this PR. |
06acf13 to
bd9b184
Compare
f4750d2 to
caac07e
Compare
| ids=cloud_instance_ids, request_id=str(uuid.uuid4()) | ||
| ) | ||
| if self._metrics_reporter: | ||
| self._metrics_reporter.inc_stopped_nodes(len(cloud_instance_ids)) |
There was a problem hiding this comment.
Here is for terminating. Can we do inc_stopped_nodes when we turn instances to TERMINATED?
There was a problem hiding this comment.
@rueian Good catch. TERMINATING only means we've issued the async terminate request, so counting stopped_nodes there was premature. I moved the accounting to TERMINATED transitions instead, and only count terminated events with a cloud_instance_id so that canceled queued requests are not included.
| terminated_instances = instances[:-2] + [ | ||
| create_instance( | ||
| terminating_type_1.instance_id, | ||
| status=Instance.TERMINATED, | ||
| instance_type="type_1", | ||
| ), | ||
| create_instance( | ||
| terminating_type_2.instance_id, | ||
| status=Instance.TERMINATED, | ||
| instance_type="type_2", | ||
| ), |
There was a problem hiding this comment.
We are removing the terminated counting from the report_instances. Can we just delete these?
There was a problem hiding this comment.
@rueian Thanks for your comment. Yes, those terminated test instances are redundant now. report_instances() no longer exposes any terminated counting, and report_resources() doesn't use terminated instances either. I removed that setup and simplified the test to use the original instances list directly.
caac07e to
dabc9c5
Compare
Signed-off-by: weimingdiit <weimingdiit@gmail.com>
dabc9c5 to
559e12c
Compare
|
Hi @edoakes, please help merge this PR 🙏 |

Description
AutoscalerMetricsReporter.report_instances()computesterminatedacross the full instance snapshot, but it was incrementingstopped_nodesinside the per-node-type reporting loop.That caused the same terminated transition count to be added once per configured node type instead of once per reporting pass. In a mixed node-type snapshot,
autoscaler_stopped_nodes_totalcould therefore be over-counted.Move the
stopped_nodesincrement out of the per-node-type loop so the counter is updated exactly once for each batch of newly terminated instances.Related issues
#62025