-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] ray on spark autoscaling #38215
Conversation
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
|
||
|
||
class SparkNodeProvider(NodeProvider): | ||
"""A node provider that implements provider for nodes of Ray on spark.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write down the high level design to help people understand the code.
f"Spark node provider creates node {node_id}." | ||
) | ||
|
||
def update_node_status(_node_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked other node providers (e.g. aws, gcp), the node status is updated inside non_terminated_nodes
. I think we can do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
collect_log_to_path, | ||
) | ||
ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to AutoscalingCluster, can we also create a StaticCluster
class to encapsulate the logic of starting head and worker nodes of a static cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is code refactoring work, I suggest we do it in follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.
Generally it'd be good to write more comments, especially the high level design and how things work together. This can help future users understand and maintain the code. |
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Lint failure
|
collect_log_to_path, | ||
) | ||
ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, I think we can consolidate the static cluster and autoscaling cluster code by always using the autoscaling code path? A static cluster is just an autoscaling cluster with the same min_workers and max_workers.
This makes sense. |
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@@ -198,6 +198,7 @@ def get_packages(self): | |||
"ray/autoscaler/aws/cloudwatch/prometheus.yml", | |||
"ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh", | |||
"ray/autoscaler/azure/defaults.yaml", | |||
"ray/autoscaler/spark/defaults.yaml", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl This file needs your approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for setup.py changes.
Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Implement ray on spark autoscaling.
See REP: ray-project/enhancements#43
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.