Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Additional Autoscaler Metrics #16198

Merged
merged 8 commits into from
Jun 5, 2021

Conversation

ckw017
Copy link
Member

@ckw017 ckw017 commented Jun 2, 2021

Why are these changes needed?

Followup to #16066. Adds the following metrics

  • worker_launch_time (histogram), measured as time for provider.create_node to execute
  • worker_update_time (histogram), measured as time for NodeLauncherThread.run() to successfully execute
  • updating_nodes (gauge), length of StandardAutoscaler.updaters
  • failed_create_nodes (counter), incremented whenever by number of nodes in batch whenever provider.create_node throws an exception
  • failed_updates (counter), tracks StandardAutoscaler.num_failed_updates
  • successful_updates (counter) tracks StandardAutoscaler.num_successful_updates

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two quick comments!

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ckw017
Copy link
Member Author

ckw017 commented Jun 3, 2021

@ijrsvt ah sorry, I still have the problem where I don't have auth to merge my own PRs. If you're good with this you can merge when you can go forward with the merge, otherwise we can wait for more reviewers

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 3, 2021

@ckw017 Let's wait and see if @AmeerHajAli wants to say anythin!

@@ -332,6 +338,7 @@ def _update(self):
for node_id in nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add here a tracking of recovering nodes?

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left a few minor comments.

@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 3, 2021
@ckw017 ckw017 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 3, 2021
@ckw017
Copy link
Member Author

ckw017 commented Jun 3, 2021

added in a member to NodeUpdaterThread that's true if the thread was started to recovery an unhealthy node, and metrics:

  • recovering_nodes (gauge): set to number of updaters with the flag set at the end of the update loop
  • successful_recoveries (counter): incremented whenever an updater with flag set exits normally
  • failed_recoveries (counter): incremented whenever an updater with flag set exits with a problem

@ijrsvt ijrsvt changed the title [autoscaler] additional autoscaler metrics [autoscaler] Additional Autoscaler Metrics Jun 5, 2021
@ijrsvt ijrsvt merged commit 2e11ac6 into ray-project:master Jun 5, 2021
mwtian pushed a commit that referenced this pull request Jun 5, 2021
mwtian pushed a commit that referenced this pull request Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants