[autoscaler] Fewer non terminated nodes calls #20359

DmitriGekhtman · 2021-11-15T06:43:11Z

Why are these changes needed?

non_terminated_nodes calls are expensive for some node provider implementations.

This PR refactors autoscaler._update() such that it results in at most one non_terminated_nodes call.
Conceptually, the change is that the autoscaler only needs a consistent view of the world once per update interval.

The structure of an autoscaler update is now

call non_terminated_nodes to update internal state
update autoscaler status strings
terminate nodes we don't need, removing them from internal state as we go
run node updaters if needed
get nodes to launch based on internal state

There's a small operational difference introduced:
Previously -- After a node is created, its NodeUpdater thread is initiated immediately.
Now -- After a node is created, its NodeUpdater thread is initiated in the next autoscaler update.

This typically will not introduce latency, since the time to get SSH access (a few minutes) is much longer than the autoscaler update interval (5 seconds by default).

Along the way, I've removed the local_ip initialization parameter for LoadMetrics because it was confusing and not useful (and caused some tests to fail)

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
  Updated existing autoscaler tests to verify that non_terminated_nodes is called only once per autoscaler._update.
- Release tests
- This PR is not tested :(

DmitriGekhtman · 2021-11-16T15:54:35Z

Please review, would be good to have this ready soon.

wuisawesome

nice! this logic feels a lot simpler.

DmitriGekhtman · 2021-11-16T22:33:15Z

@wuisawesome thanks for reviewing.

Now just need to make sure CI builds the wheels..

python/ray/autoscaler/_private/autoscaler.py

AmeerHajAli · 2021-11-16T22:50:12Z

python/ray/autoscaler/_private/autoscaler.py

+        self.load_metrics.prune_active_ips(
+            [self.provider.internal_ip(node_id) for node_id in self.all_nodes])
+


why self.all_nodes instead of self.all_workers?

because the head node is not a worker node

Misinterpreted the question.

Because previously the load metrics method secretly hacked in the head node ip, leading to weird bugs when writing new node providers.
I've removed the head ip instance variable from load metrics and the head's ip is passed in here, along with the rest of things.

But won't it always be active?

Yes. Setting head ip active in load metrics saves us the redundancy of passing it in each time, but it makes the code brittle. There have been bugs from setting the head ip differently when initializing load metrics than when reading from node provider.

python/ray/autoscaler/_private/autoscaler.py

python/ray/autoscaler/_private/load_metrics.py

python/ray/autoscaler/_private/monitor.py

python/ray/autoscaler/_private/resource_demand_scheduler.py

python/ray/autoscaler/_private/autoscaler.py

DmitriGekhtman · 2021-11-18T22:03:16Z

Requires e2e testing and deliberation before merge.

AmeerHajAli

LGTM. Lets run the necessary testing and merge.

DmitriGekhtman · 2021-11-19T00:48:04Z

LGTM. Lets run the necessary testing and merge.

We need to come to a consensus on what the necessary testing is.
Let me solicit feedback.

DmitriGekhtman · 2021-11-19T04:06:08Z

on what the necessary testing is

...Should at least pass CI :)
Fixed and tested a serialization bug coming from changing named tuples to dataclasses.

DmitriGekhtman · 2021-11-19T07:15:13Z

manual testing in progress

DmitriGekhtman · 2021-11-19T09:24:43Z

Passes

CI (which reflects the history of all bugs we've seen)
the basic tests we have of the oss ray operator
sanity check on AWS Node Provider and GCP Node provider
test stack of another ray operator

I'd say that's good enough (best we can do)

(Need to re-run CI for the mac build, though.)

AmeerHajAli · 2021-11-19T09:25:49Z

This is great @DmitriGekhtman , thanks for pushing this very close to the finish line!

DmitriGekhtman · 2021-11-19T09:28:58Z

oh, actually mac build looks like it's running, will let that do its thing

waleedkadous · 2021-11-19T16:48:09Z

Can you shed more light on the sanity checks you have in mind? Can we do something more intensive (e.g. run nightly autoscaler tests against AWS and GCP)? Ideally we would also do this for Azure too.

DmitriGekhtman · 2021-11-19T18:47:29Z

Following up offline.

DmitriGekhtman · 2021-11-20T08:28:37Z

Can we do something more intensive (e.g. run nightly autoscaler tests against AWS and GCP)

While there aren't currently nightly autoscaler tests, what we can do is schedule some Tune runs to trigger brief upscaling to, say, 20 GPU workers and 100 CPU workers. That can be repeated for AWS and GCP.

DmitriGekhtman · 2021-11-21T17:21:50Z

I ran an experiment along the lines suggested in the last comment, also the nightly decision tree example with various tree depths.
Merging, as I see no degradation of performance against master on either AWS or GCP.

rkooo567 · 2021-11-22T03:46:47Z

When I ran commit after this, I've seen

(base) ray@ip-10-0-0-71:~/sang-nightly-large-disk-test% ray status
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1989, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1534, in status
    print(debug_status(status, error))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 110, in debug_status
    lm_summary = LoadMetricsSummary(**as_dict["load_metrics_report"])
TypeError: __init__() got an unexpected keyword argument 'head_ip'

Do you think this could be related to this PR?

DmitriGekhtman · 2021-11-22T04:25:10Z

Yes, will submit a fix momentarily.. Not sure how I missed that.

DmitriGekhtman · 2021-11-22T04:34:59Z

I see, it's an old autoscaler <-> new ray compatibility thing -- fix incoming.

DmitriGekhtman · 2021-11-22T05:34:15Z

#20623

Running ray status with the changes from #20359 while running an autoscaler older than those changes results in an error on input "head_ip" to LoadMetricsSummary. See #20359 (comment) This PR fixes the bug by restoring head_ip as an optional parameter of LoadMetricsSummary.

DmitriGekhtman added 7 commits November 13, 2021 21:55

wip

3e16e66

Remove one more non_terminated_nodes off the critical path.

36fdd4c

Fixes

524553a

Get non terminated nodes completely out of resource demand scheduler

b90bda5

Enforce one call per update

8c25ef0

Merge branch 'master' into autoscaler-fewer-non-terminated-nodes-calls

a70479a

Remove unneeded config

cdac203

DmitriGekhtman assigned AmeerHajAli and sasha-s Nov 15, 2021

DmitriGekhtman added 5 commits November 15, 2021 16:14

Add to setup-dev

2c94334

Remove local ip from LoadMetrics

202a12b

Merge branch 'master' into autoscaler-fewer-non-terminated-nodes-calls

2640e9f

remove a space

045602e

Remove unused imports

0462f56

DmitriGekhtman assigned wuisawesome and ericl Nov 16, 2021

wuisawesome approved these changes Nov 16, 2021

View reviewed changes

Merge branch 'master' into autoscaler-fewer-non-terminated-nodes-calls

2b4b90e

ericl removed their assignment Nov 16, 2021

Merge branch 'master' into autoscaler-fewer-non-terminated-nodes-calls

757f7d4

AmeerHajAli reviewed Nov 16, 2021

View reviewed changes