Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Fewer non terminated nodes calls #20359

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Nov 15, 2021

Why are these changes needed?

non_terminated_nodes calls are expensive for some node provider implementations.

This PR refactors autoscaler._update() such that it results in at most one non_terminated_nodes call.
Conceptually, the change is that the autoscaler only needs a consistent view of the world once per update interval.

The structure of an autoscaler update is now

  • call non_terminated_nodes to update internal state
  • update autoscaler status strings
  • terminate nodes we don't need, removing them from internal state as we go
  • run node updaters if needed
  • get nodes to launch based on internal state

There's a small operational difference introduced:
Previously -- After a node is created, its NodeUpdater thread is initiated immediately.
Now -- After a node is created, its NodeUpdater thread is initiated in the next autoscaler update.

This typically will not introduce latency, since the time to get SSH access (a few minutes) is much longer than the autoscaler update interval (5 seconds by default).

Along the way, I've removed the local_ip initialization parameter for LoadMetrics because it was confusing and not useful (and caused some tests to fail)

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
      Updated existing autoscaler tests to verify that non_terminated_nodes is called only once per autoscaler._update.
    • Release tests
    • This PR is not tested :(

@DmitriGekhtman
Copy link
Contributor Author

Please review, would be good to have this ready soon.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this logic feels a lot simpler.

@ericl ericl removed their assignment Nov 16, 2021
@DmitriGekhtman
Copy link
Contributor Author

@wuisawesome thanks for reviewing.

Now just need to make sure CI builds the wheels..

Comment on lines 257 to 259
self.load_metrics.prune_active_ips(
[self.provider.internal_ip(node_id) for node_id in self.all_nodes])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why self.all_nodes instead of self.all_workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the head node is not a worker node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misinterpreted the question.

Because previously the load metrics method secretly hacked in the head node ip, leading to weird bugs when writing new node providers.
I've removed the head ip instance variable from load metrics and the head's ip is passed in here, along with the rest of things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't it always be active?

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Setting head ip active in load metrics saves us the redundancy of passing it in each time, but it makes the code brittle. There have been bugs from setting the head ip differently when initializing load metrics than when reading from node provider.

@DmitriGekhtman DmitriGekhtman added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Nov 18, 2021
@DmitriGekhtman
Copy link
Contributor Author

Requires e2e testing and deliberation before merge.

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Lets run the necessary testing and merge.

@DmitriGekhtman
Copy link
Contributor Author

LGTM. Lets run the necessary testing and merge.

We need to come to a consensus on what the necessary testing is.
Let me solicit feedback.

@DmitriGekhtman
Copy link
Contributor Author

on what the necessary testing is

...Should at least pass CI :)
Fixed and tested a serialization bug coming from changing named tuples to dataclasses.

@DmitriGekhtman DmitriGekhtman force-pushed the autoscaler-fewer-non-terminated-nodes-calls branch from e2d47e4 to 2c661da Compare November 19, 2021 04:45
@DmitriGekhtman
Copy link
Contributor Author

manual testing in progress

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Nov 19, 2021

Passes

  • CI (which reflects the history of all bugs we've seen)
  • the basic tests we have of the oss ray operator
  • sanity check on AWS Node Provider and GCP Node provider
  • test stack of another ray operator

I'd say that's good enough (best we can do)

(Need to re-run CI for the mac build, though.)

@AmeerHajAli
Copy link
Contributor

This is great @DmitriGekhtman , thanks for pushing this very close to the finish line!

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Nov 19, 2021

oh, actually mac build looks like it's running, will let that do its thing

@DmitriGekhtman DmitriGekhtman removed the do-not-merge Do not merge this PR! label Nov 19, 2021
@waleedkadous
Copy link
Contributor

Can you shed more light on the sanity checks you have in mind? Can we do something more intensive (e.g. run nightly autoscaler tests against AWS and GCP)? Ideally we would also do this for Azure too.

@DmitriGekhtman
Copy link
Contributor Author

Following up offline.

@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 20, 2021
@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Nov 20, 2021

Can we do something more intensive (e.g. run nightly autoscaler tests against AWS and GCP)

While there aren't currently nightly autoscaler tests, what we can do is schedule some Tune runs to trigger brief upscaling to, say, 20 GPU workers and 100 CPU workers. That can be repeated for AWS and GCP.

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Nov 21, 2021

I ran an experiment along the lines suggested in the last comment, also the nightly decision tree example with various tree depths.
Merging, as I see no degradation of performance against master on either AWS or GCP.

@DmitriGekhtman DmitriGekhtman merged commit 0f70f40 into ray-project:master Nov 21, 2021
@rkooo567
Copy link
Contributor

When I ran commit after this, I've seen

(base) ray@ip-10-0-0-71:~/sang-nightly-large-disk-test% ray status
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1989, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1534, in status
    print(debug_status(status, error))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 110, in debug_status
    lm_summary = LoadMetricsSummary(**as_dict["load_metrics_report"])
TypeError: __init__() got an unexpected keyword argument 'head_ip'

Do you think this could be related to this PR?

@DmitriGekhtman
Copy link
Contributor Author

Yes, will submit a fix momentarily.. Not sure how I missed that.

@DmitriGekhtman
Copy link
Contributor Author

I see, it's an old autoscaler <-> new ray compatibility thing -- fix incoming.

@DmitriGekhtman
Copy link
Contributor Author

#20623

DmitriGekhtman added a commit that referenced this pull request Nov 22, 2021
Running ray status with the changes from #20359
while running an autoscaler older than those changes
results in an error on input "head_ip" to LoadMetricsSummary.
See #20359 (comment)
This PR fixes the bug by restoring head_ip as an optional parameter of LoadMetricsSummary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants