[autoscaler] Remove legacy autoscaler #11802

AmeerHajAli · 2020-11-04T15:16:48Z

This PR basically removes the legacy autoscaler code.
Many bugs were detected in the original resource demand autoscaler and fixed. For example, if config file was modified, if the legacy yaml provided "Resources", a bug here from #11551 (comment), etc.

Multiple tests were modified to reflect the new autoscaler too.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 818a63a.

ericl · 2020-11-11T00:20:14Z

python/ray/autoscaler/_private/autoscaler.py

+            if upscaling_speed:
+                upscaling_speed = float(upscaling_speed)
+            elif aggressive:
+                upscaling_speed = float(self.config["max_workers"])


This is somewhat different from the original notion of aggressive, which scales to max regardless of load. Can you add a TODO to maybe support this in the future?

We would need an option like initial_upscaling_num_workers (currently hardcoded to 5). We can decide whether this is needed based on user feedback.

Just wanted to clarify that setting upscaling_speed to max_workers is equivalent to setting it to infinity. Does it make sense?

I am actually confused by what you mean by "original notion of aggressiveness". In the legacy autoscaler it just means that in the aggressive mode we upscale to initial_workers in case of load (not max regardless of load).

What you propose as an option seems orthogonal to scaling to max regardless of load. I added a todo now for initial_upscaling_num_workers.

Btw if you set upscaling_speed to initial_upscaling_num_workers, at the first scale-up (when only the head node is there) you actually get initial_upscaling_num_workers. But I guess someone might want them to have different values. I added a TODO for now.

Now that I think about it, why do we actually need initial_upscaling_num_workers? What new behavior it will introduce that is not currently covered in upscaling_speed? at the end of the day they both are constrained by demand and at the end the cluster will scale to the same size regardless of upscaling_speed and initial_upscaling_num_workers.
It feels redundant to me.

Aggressive used to immediately scale to max regardless of demand. We don't do that any more since we always exactly match demand.

Re: second question, cluster size will be y = (1 + b)*speed^t; initial upscaling constant b affects how long it takes to get to a certain size. Could be annoying to wait for slow autoscaling if you have huge clusters.

However, you can arguably just set min_workers higher in that case.

python/ray/autoscaler/_private/autoscaler.py

python/ray/autoscaler/_private/resource_demand_scheduler.py

ericl · 2020-11-11T00:22:53Z

python/ray/autoscaler/_private/util.py

        config["available_node_types"] = {
            NODE_TYPE_LEGACY_HEAD: {
                "node_config": config["head_node"],
-                "resources": {},
+                "resources": config["head_node"].get("Resources") or {},


Why is Resources capitalized here?

It seems like this is the use case previously used in tests:

ray/python/ray/tests/test_autoscaler.py

Line 614 in d1182b8

config["worker_nodes"] = {"Resources": {"CPU": cores_per_node}}

and also in legacy autoscaler:

ray/python/ray/autoscaler/_private/autoscaler.py

Line 386 in 7b8bd15

cores_per_worker = self.config["worker_nodes"]["Resources"][

I was also confused with why it is capitalized.

Removed this. Capital resources is clearly wrong.

python/ray/autoscaler/aws/example-full.yaml

ericl

LGTM, last round of comments. Looking forward to trying this out!

wuisawesome · 2020-11-11T02:52:06Z

python/ray/autoscaler/_private/autoscaler.py

+                upscaling_speed = 1 / max(target_utilization_fraction, 0.001)
+            else:
+                upscaling_speed = 1.0
+            if self.resource_demand_scheduler:


We shouldn't do this since you can change the instance type of any node type

I am confused, we call reset_config which handles it, can you please double check?

But you are doing that to "make sure inferered resources are not lost."

What I'm saying is that you want to lose inferred resources to make sure you don't have wrong resources.

But I already do that in the case you mention, I overwrite if the node config changes, checkout reset_config function.

wuisawesome

I think there is a more dead code that can be removed.

Also, I'm not sure the changes to inferring resources for legacy node types is correct.

Other than that lgtm

python/ray/autoscaler/_private/autoscaler.py

wuisawesome · 2020-11-11T03:07:39Z

python/ray/autoscaler/_private/autoscaler.py

            for resource, count in resources.items():
-                self.resource_requests[resource] = max(
-                    self.resource_requests[resource], count)
+                try:


Can you comment/explain what this is doing/why it needed to be changed?

self.resource_requests is no longer used, so this translates the previous single dict to resource_demand_vector. I added a comment to clarify.

I removed the else branch, it's not used.

I added it because some tests used the "Resources" and this dict.

python/ray/autoscaler/_private/load_metrics.py

python/ray/autoscaler/_private/autoscaler.py

doc/source/cluster/autoscaling.rst

…e_legacy

AmeerHajAli · 2020-11-11T10:10:23Z

I think there is a more dead code that can be removed.

Also, I'm not sure the changes to inferring resources for legacy node types is correct.

Other than that lgtm

@wuisawesome I removed the dead code you mentioned. What is wrong with the inferring? It is tested in the test_autoscaler.py in many places e.g., (testAggressiveAutoscaling)

…e_legacy

AmeerHajAli · 2020-11-11T19:23:45Z

python/ray/autoscaler/_private/autoscaler.py

-                    self.resource_demand_vector.extend([{
-                        resource: resource_per_worker
-                    }] * workers_to_add)
+        assert isinstance(resources, list), resources


Can we log this rather than assert it which might break it?

It should always be a list; we never call it with anything that's not that.

@ericl , this test was basically not a legacy use case or any use case?

ray/python/ray/tests/test_autoscaler.py

Line 629 in d1182b8

autoscaler.request_resources({"CPU": cores_per_node * 10})

Ameer Haj Ali and others added 30 commits September 24, 2020 00:26

prepare for head node

7248cf9

move command runner interface outside _private

bc43e46

Merge github.com:ray-project/ray

14be0a1

remove space

ab660a8

Eric

1ea0c1f

flake

dad31ae

Merge github.com:ray-project/ray

49bcf56

min_workers in multi node type

16d736d

Merge github.com:ray-project/ray

06911df

fixing edge cases

0d8dddb

eric not idle

fe69ce3

fix target_workers to consider min_workers of node types

35832ed

idle timeout

ca0be53

minor

c9518bd

minor fix

5452d39

test

9e904cd

lint

f9edcbe

eric v2

cb02267

eric 3

5e5d403

min_workers constraint before bin packing

4d44cd8

Merge github.com:ray-project/ray

614abbf

Update resource_demand_scheduler.py

818a63a

Revert "Update resource_demand_scheduler.py"

539b29c

This reverts commit 818a63a.

reducing diff

9a63866

Merge branch 'master' of github.com:AmeerHajAli/ray

7501623

make get_nodes_to_launch return a dict

b4edd21

Merge github.com:ray-project/ray

fc48725

merge

0aef789

weird merge fix

39245a8

auto fill instance types for AWS

c7eb4ad

update

be14dba

ericl reviewed Nov 11, 2020

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved

ericl reviewed Nov 11, 2020

View reviewed changes

python/ray/autoscaler/_private/resource_demand_scheduler.py Outdated Show resolved Hide resolved

ericl reviewed Nov 11, 2020

View reviewed changes

python/ray/autoscaler/aws/example-full.yaml Outdated Show resolved Hide resolved

ericl approved these changes Nov 11, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2020

wuisawesome reviewed Nov 11, 2020

View reviewed changes

AmeerHajAli commented Nov 11, 2020

View reviewed changes

doc/source/cluster/autoscaling.rst Show resolved Hide resolved

AmeerHajAli commented Nov 11, 2020

View reviewed changes

doc/source/cluster/autoscaling.rst Outdated Show resolved Hide resolved

AmeerHajAli added 3 commits November 11, 2020 11:56

comments from Eric and Alex

846d748

Merge branch 'remove_legacy' of github.com:AmeerHajAli/ray into remov…

c6f7185

…e_legacy

Merge github.com:ray-project/ray into remove_legacy

1ff0a43

AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2020

AmeerHajAli requested review from wuisawesome and ericl November 11, 2020 10:07

AmeerHajAli mentioned this pull request Nov 11, 2020

[Autoscaler] Worker Node Type's max_workers is not validated #11900

Closed

AmeerHajAli and others added 8 commits November 11, 2020 20:20

fixed test_multi_node_2

a537941

Merge github.com:ray-project/ray into remove_legacy

a2d4279

comment

6b17239

Merge branch 'remove_legacy' of github.com:AmeerHajAli/ray into remov…

8058469

…e_legacy

remove old resources

6b4077d

lower case resources

c8e4b81

what is this

031c592

update

041f711

AmeerHajAli commented Nov 11, 2020

View reviewed changes

ericl merged commit 85197de into ray-project:master Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Remove legacy autoscaler #11802

[autoscaler] Remove legacy autoscaler #11802

AmeerHajAli commented Nov 4, 2020 •

edited

Loading

ericl Nov 11, 2020 •

edited

Loading

AmeerHajAli Nov 11, 2020

AmeerHajAli Nov 11, 2020

AmeerHajAli Nov 11, 2020

AmeerHajAli Nov 11, 2020

ericl Nov 11, 2020

ericl Nov 11, 2020 •

edited

Loading

ericl Nov 11, 2020

AmeerHajAli Nov 11, 2020

ericl Nov 11, 2020

ericl left a comment

wuisawesome Nov 11, 2020

AmeerHajAli Nov 11, 2020

wuisawesome Nov 11, 2020

AmeerHajAli Nov 11, 2020

wuisawesome left a comment

wuisawesome Nov 11, 2020

AmeerHajAli Nov 11, 2020 •

edited

Loading

ericl Nov 11, 2020

AmeerHajAli Nov 11, 2020

AmeerHajAli commented Nov 11, 2020 •

edited

Loading

AmeerHajAli Nov 11, 2020

ericl Nov 11, 2020

AmeerHajAli Nov 11, 2020

[autoscaler] Remove legacy autoscaler #11802

[autoscaler] Remove legacy autoscaler #11802

Conversation

AmeerHajAli commented Nov 4, 2020 • edited Loading

Checks

ericl Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wuisawesome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmeerHajAli Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmeerHajAli commented Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmeerHajAli commented Nov 4, 2020 •

edited

Loading

ericl Nov 11, 2020 •

edited

Loading

ericl Nov 11, 2020 •

edited

Loading

AmeerHajAli Nov 11, 2020 •

edited

Loading

AmeerHajAli commented Nov 11, 2020 •

edited

Loading