-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751
Conversation
This reverts commit 818a63a.
@@ -1053,8 +1053,6 @@ def testConfiguresNewNodes(self): | |||
|
|||
def testReportsConfigFailures(self): | |||
config = copy.deepcopy(SMALL_CLUSTER) | |||
config["provider"]["type"] = "external" | |||
config = prepare_config(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed as after we rewrite in prepare_config, the autofilling of instance resources fails as no module is provided. This prepare_config is redundant here too and we cannot have a type "external" without the appropriate module.
# node is connected and its resources show up in LoadMetrics. | ||
return_immediately = self._handle_legacy_yaml( | ||
nodes, pending_nodes, static_node_resources) | ||
if return_immediately: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give this a better name than return_immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually how about we rename _handle_legacy_yaml to _infer_legacy_node_resources_if_needed(), remove the early return, and change the logic in line 110 to return empty dict if a worker is already pending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That way we never need to return early here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one suggested refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great. What are the follow-up PRs?
-Removing the legacy code. |
Seems many failing tests
|
@ericl , yeah, my tests passed but there were changes in master that made it break.
|
The PR adds code to the new resource demand scheduler to support legacy cluster configs.
We actually rewrite the cluster configs from legacy to available node types.
Since the head node is already launched, we can get its resources. If there are remaining unfulfilled resources we launch max(1, min_workers) to get the resources of the workers so we can later launch based on demand.
We do not check if the head node and worker nodes are the same.
I am not sure how we can know if they are the same for any cloud as they have different config names:
e.g,
InstanceType
for AWS vsmachineType
for gcp.Relevant tests were added to test the support of legacy node types.
scripts/format.sh
to lint the changes in this PR.