Fix edge case in autoscaler with poor bin packing #5702

ericl · 2019-09-13T05:37:57Z

Why are these changes needed?

When the cluster is backlogged, bump the "estimated busy nodes" count by the number of nonidle nodes + 1. This takes into account the case where the head node cannot accept tasks and hence is idle, or more generally, when nodes don't register as fully utilized due to poor bin packing, but there are tasks in a backlog somewhere in the cluster.

One side effect is that we are slightly more aggressive at scaling up, but this is probably ok.

Related issue number

Closes #5696

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.

AmplabJenkins · 2019-09-13T08:54:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17008/
Test PASSed.

AmplabJenkins · 2019-09-13T11:09:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17009/
Test FAILed.

edoakes

Fixing some of the unnecessary output too, nice :)

LGTM assuming you've validated manually.

edoakes · 2019-09-13T17:07:58Z

python/ray/autoscaler/autoscaler.py

+            if max_frac > 0:
+                num_nonidle += 1
+
+        # If any nodes have a queue buildup, assume all non-idle nodes are 100%


This seems very aggressive, for example in a case where we somehow have nearly the whole cluster idle but a single node with a queue (could happen if there were many short-lived tasks scheduled on other nodes?). Maybe we should make it a bit more of a heuristic instead of this hard rule. Seems like a fine fix for now, though.

This won't trigger an up-scaling event unless there is at least "some" load on almost every node. So the case you mentioned is probably OK.

But yeah agreed we could be smarter, perhaps by looking at the queue size and extrapolating the load.

fix edge case

e9ec828

ericl assigned edoakes Sep 13, 2019

fix for general case

0089a97

ericl changed the title ~~Fix edge case in autoscaler with small head nodes~~ Fix edge case in autoscaler with poor bin packing Sep 13, 2019

edoakes approved these changes Sep 13, 2019

View reviewed changes

ericl merged commit 3ed18d0 into ray-project:master Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix edge case in autoscaler with poor bin packing #5702

Fix edge case in autoscaler with poor bin packing #5702

ericl commented Sep 13, 2019 •

edited

Loading

AmplabJenkins commented Sep 13, 2019

AmplabJenkins commented Sep 13, 2019

edoakes left a comment

edoakes Sep 13, 2019

ericl Sep 13, 2019

Fix edge case in autoscaler with poor bin packing #5702

Fix edge case in autoscaler with poor bin packing #5702

Conversation

ericl commented Sep 13, 2019 • edited Loading

Why are these changes needed?

Related issue number

Checks

AmplabJenkins commented Sep 13, 2019

AmplabJenkins commented Sep 13, 2019

edoakes left a comment

Choose a reason for hiding this comment

edoakes Sep 13, 2019

Choose a reason for hiding this comment

ericl Sep 13, 2019

Choose a reason for hiding this comment

ericl commented Sep 13, 2019 •

edited

Loading