-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Remove legacy autoscaler #11802
Conversation
This reverts commit 818a63a.
if upscaling_speed: | ||
upscaling_speed = float(upscaling_speed) | ||
elif aggressive: | ||
upscaling_speed = float(self.config["max_workers"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat different from the original notion of aggressive, which scales to max regardless of load. Can you add a TODO to maybe support this in the future?
We would need an option like initial_upscaling_num_workers
(currently hardcoded to 5). We can decide whether this is needed based on user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to clarify that setting upscaling_speed to max_workers is equivalent to setting it to infinity. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually confused by what you mean by "original notion of aggressiveness". In the legacy autoscaler it just means that in the aggressive mode we upscale to initial_workers in case of load (not max regardless of load).
What you propose as an option seems orthogonal to scaling to max regardless of load. I added a todo now for initial_upscaling_num_workers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw if you set upscaling_speed to initial_upscaling_num_workers, at the first scale-up (when only the head node is there) you actually get initial_upscaling_num_workers. But I guess someone might want them to have different values. I added a TODO for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, why do we actually need initial_upscaling_num_workers
? What new behavior it will introduce that is not currently covered in upscaling_speed
? at the end of the day they both are constrained by demand and at the end the cluster will scale to the same size regardless of upscaling_speed
and initial_upscaling_num_workers
.
It feels redundant to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggressive used to immediately scale to max regardless of demand. We don't do that any more since we always exactly match demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: second question, cluster size will be y = (1 + b)*speed^t; initial upscaling constant b
affects how long it takes to get to a certain size. Could be annoying to wait for slow autoscaling if you have huge clusters.
However, you can arguably just set min_workers higher in that case.
config["available_node_types"] = { | ||
NODE_TYPE_LEGACY_HEAD: { | ||
"node_config": config["head_node"], | ||
"resources": {}, | ||
"resources": config["head_node"].get("Resources") or {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Resources capitalized here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is the use case previously used in tests:
ray/python/ray/tests/test_autoscaler.py
Line 614 in d1182b8
config["worker_nodes"] = {"Resources": {"CPU": cores_per_node}} |
and also in legacy autoscaler:
cores_per_worker = self.config["worker_nodes"]["Resources"][ |
I was also confused with why it is capitalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this. Capital resources is clearly wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, last round of comments. Looking forward to trying this out!
upscaling_speed = 1 / max(target_utilization_fraction, 0.001) | ||
else: | ||
upscaling_speed = 1.0 | ||
if self.resource_demand_scheduler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't do this since you can change the instance type of any node type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, we call reset_config which handles it, can you please double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are doing that to "make sure inferered resources are not lost."
What I'm saying is that you want to lose inferred resources to make sure you don't have wrong resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I already do that in the case you mention, I overwrite if the node config changes, checkout reset_config function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a more dead code that can be removed.
Also, I'm not sure the changes to inferring resources for legacy node types is correct.
Other than that lgtm
for resource, count in resources.items(): | ||
self.resource_requests[resource] = max( | ||
self.resource_requests[resource], count) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment/explain what this is doing/why it needed to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.resource_requests is no longer used, so this translates the previous single dict to resource_demand_vector. I added a comment to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the else branch, it's not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because some tests used the "Resources" and this dict.
@wuisawesome I removed the dead code you mentioned. What is wrong with the inferring? It is tested in the test_autoscaler.py in many places e.g., (testAggressiveAutoscaling) |
self.resource_demand_vector.extend([{ | ||
resource: resource_per_worker | ||
}] * workers_to_add) | ||
assert isinstance(resources, list), resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log this rather than assert it which might break it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should always be a list; we never call it with anything that's not that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl , this test was basically not a legacy use case or any use case?
ray/python/ray/tests/test_autoscaler.py
Line 629 in d1182b8
autoscaler.request_resources({"CPU": cores_per_node * 10}) |
This PR basically removes the legacy autoscaler code.
Many bugs were detected in the original resource demand autoscaler and fixed. For example, if config file was modified, if the legacy yaml provided "Resources", a bug here from #11551 (comment), etc.
Multiple tests were modified to reflect the new autoscaler too.
Checks
scripts/format.sh
to lint the changes in this PR.