Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Remove legacy autoscaler #11802

Merged
merged 115 commits into from
Nov 11, 2020
Merged

[autoscaler] Remove legacy autoscaler #11802

merged 115 commits into from
Nov 11, 2020

Conversation

AmeerHajAli
Copy link
Contributor

@AmeerHajAli AmeerHajAli commented Nov 4, 2020

This PR basically removes the legacy autoscaler code.
Many bugs were detected in the original resource demand autoscaler and fixed. For example, if config file was modified, if the legacy yaml provided "Resources", a bug here from #11551 (comment), etc.

Multiple tests were modified to reflect the new autoscaler too.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

if upscaling_speed:
upscaling_speed = float(upscaling_speed)
elif aggressive:
upscaling_speed = float(self.config["max_workers"])
Copy link
Contributor

@ericl ericl Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat different from the original notion of aggressive, which scales to max regardless of load. Can you add a TODO to maybe support this in the future?

We would need an option like initial_upscaling_num_workers (currently hardcoded to 5). We can decide whether this is needed based on user feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to clarify that setting upscaling_speed to max_workers is equivalent to setting it to infinity. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually confused by what you mean by "original notion of aggressiveness". In the legacy autoscaler it just means that in the aggressive mode we upscale to initial_workers in case of load (not max regardless of load).

What you propose as an option seems orthogonal to scaling to max regardless of load. I added a todo now for initial_upscaling_num_workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw if you set upscaling_speed to initial_upscaling_num_workers, at the first scale-up (when only the head node is there) you actually get initial_upscaling_num_workers. But I guess someone might want them to have different values. I added a TODO for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, why do we actually need initial_upscaling_num_workers? What new behavior it will introduce that is not currently covered in upscaling_speed? at the end of the day they both are constrained by demand and at the end the cluster will scale to the same size regardless of upscaling_speed and initial_upscaling_num_workers.
It feels redundant to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggressive used to immediately scale to max regardless of demand. We don't do that any more since we always exactly match demand.

Copy link
Contributor

@ericl ericl Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: second question, cluster size will be y = (1 + b)*speed^t; initial upscaling constant b affects how long it takes to get to a certain size. Could be annoying to wait for slow autoscaling if you have huge clusters.

However, you can arguably just set min_workers higher in that case.

config["available_node_types"] = {
NODE_TYPE_LEGACY_HEAD: {
"node_config": config["head_node"],
"resources": {},
"resources": config["head_node"].get("Resources") or {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Resources capitalized here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is the use case previously used in tests:

config["worker_nodes"] = {"Resources": {"CPU": cores_per_node}}

and also in legacy autoscaler:

cores_per_worker = self.config["worker_nodes"]["Resources"][

I was also confused with why it is capitalized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this. Capital resources is clearly wrong.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, last round of comments. Looking forward to trying this out!

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2020
upscaling_speed = 1 / max(target_utilization_fraction, 0.001)
else:
upscaling_speed = 1.0
if self.resource_demand_scheduler:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't do this since you can change the instance type of any node type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, we call reset_config which handles it, can you please double check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are doing that to "make sure inferered resources are not lost."

What I'm saying is that you want to lose inferred resources to make sure you don't have wrong resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I already do that in the case you mention, I overwrite if the node config changes, checkout reset_config function.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a more dead code that can be removed.

Also, I'm not sure the changes to inferring resources for legacy node types is correct.

Other than that lgtm

python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
for resource, count in resources.items():
self.resource_requests[resource] = max(
self.resource_requests[resource], count)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment/explain what this is doing/why it needed to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.resource_requests is no longer used, so this translates the previous single dict to resource_demand_vector. I added a comment to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the else branch, it's not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because some tests used the "Resources" and this dict.

python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved
@AmeerHajAli AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2020
@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Nov 11, 2020

I think there is a more dead code that can be removed.

Also, I'm not sure the changes to inferring resources for legacy node types is correct.

Other than that lgtm

@wuisawesome I removed the dead code you mentioned. What is wrong with the inferring? It is tested in the test_autoscaler.py in many places e.g., (testAggressiveAutoscaling)

self.resource_demand_vector.extend([{
resource: resource_per_worker
}] * workers_to_add)
assert isinstance(resources, list), resources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log this rather than assert it which might break it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should always be a list; we never call it with anything that's not that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl , this test was basically not a legacy use case or any use case?

autoscaler.request_resources({"CPU": cores_per_node * 10})

@ericl ericl merged commit 85197de into ray-project:master Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants