-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Additional Autoscaler Metrics #16198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two quick comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@ijrsvt ah sorry, I still have the problem where I don't have auth to merge my own PRs. If you're good with this you can merge when you can go forward with the merge, otherwise we can wait for more reviewers |
@ckw017 Let's wait and see if @AmeerHajAli wants to say anythin! |
@@ -332,6 +338,7 @@ def _update(self): | |||
for node_id in nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add here a tracking of recovering nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left a few minor comments.
added in a member to NodeUpdaterThread that's true if the thread was started to recovery an unhealthy node, and metrics:
|
Why are these changes needed?
Followup to #16066. Adds the following metrics
worker_launch_time
(histogram), measured as time for provider.create_node to executeworker_update_time
(histogram), measured as time for NodeLauncherThread.run() to successfully executeupdating_nodes
(gauge), length of StandardAutoscaler.updatersfailed_create_nodes
(counter), incremented whenever by number of nodes in batch whenever provider.create_node throws an exceptionfailed_updates
(counter), tracks StandardAutoscaler.num_failed_updatessuccessful_updates
(counter) tracks StandardAutoscaler.num_successful_updatesRelated issue number
Checks
scripts/format.sh
to lint the changes in this PR.