Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] ray on spark autoscaling #38215

Merged
merged 73 commits into from
Oct 5, 2023

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Aug 8, 2023

Why are these changes needed?

Implement ray on spark autoscaling.
See REP: ray-project/enhancements#43

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 marked this pull request as draft August 8, 2023 08:45
@WeichenXu123 WeichenXu123 changed the title [WIP] [spark] ray on spark autoscaling [WIP] [spark] ray on spark autoscaling prototyping Aug 8, 2023
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 changed the title [WIP] [spark] ray on spark autoscaling prototyping [WIP] [spark] ray on spark autoscaling Aug 21, 2023
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>


class SparkNodeProvider(NodeProvider):
"""A node provider that implements provider for nodes of Ray on spark."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write down the high level design to help people understand the code.

python/ray/autoscaler/_private/spark/node_provider.py Outdated Show resolved Hide resolved
f"Spark node provider creates node {node_id}."
)

def update_node_status(_node_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked other node providers (e.g. aws, gcp), the node status is updated inside non_terminated_nodes. I think we can do the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

python/ray/autoscaler/_private/spark/spark_job_server.py Outdated Show resolved Hide resolved
python/ray/tests/spark/test_basic.py Outdated Show resolved Hide resolved
python/ray/tests/spark/test_basic.py Show resolved Hide resolved
python/ray/util/spark/databricks_hook.py Outdated Show resolved Hide resolved
collect_log_to_path,
)
ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to AutoscalingCluster, can we also create a StaticCluster class to encapsulate the logic of starting head and worker nodes of a static cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code refactoring work, I suggest we do it in follow-up PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.

python/ray/util/spark/cluster_init.py Show resolved Hide resolved
@jjyao
Copy link
Collaborator

jjyao commented Oct 1, 2023

Generally it'd be good to write more comments, especially the high level design and how things work together. This can help future users understand and maintain the code.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@jjyao
Copy link
Collaborator

jjyao commented Oct 2, 2023

Lint failure



Mon Oct  2 03:10:47 UTC 2023 Flake8....
--
  | python/ray/autoscaler/_private/spark/node_provider.py:4:1: F401 'threading' imported but unused
  | python/ray/autoscaler/_private/spark/node_provider.py:5:1: F401 'time' imported but unused
  | python/ray/autoscaler/_private/spark/node_provider.py:211:89: E501 line too long (91 > 88 characters)
  | python/ray/util/spark/cluster_init.py:2:1: F401 'tempfile' imported but unused

python/ray/autoscaler/_private/spark/spark_job_server.py Outdated Show resolved Hide resolved
collect_log_to_path,
)
ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.

python/ray/util/spark/cluster_init.py Show resolved Hide resolved
python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, I think we can consolidate the static cluster and autoscaling cluster code by always using the autoscaling code path? A static cluster is just an autoscaling cluster with the same min_workers and max_workers.

@WeichenXu123
Copy link
Contributor Author

As a follow-up, I think we can consolidate the static cluster and autoscaling cluster code by always using the autoscaling code path? A static cluster is just an autoscaling cluster with the same min_workers and max_workers.

This makes sense.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@@ -198,6 +198,7 @@ def get_packages(self):
"ray/autoscaler/aws/cloudwatch/prometheus.yml",
"ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh",
"ray/autoscaler/azure/defaults.yaml",
"ray/autoscaler/spark/defaults.yaml",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl This file needs your approval.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for setup.py changes.

@jjyao jjyao merged commit 19a58d2 into ray-project:master Oct 5, 2023
87 of 93 checks passed
Zandew pushed a commit to Zandew/ray that referenced this pull request Oct 10, 2023
Implement ray on spark autoscaling.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Implement ray on spark autoscaling.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Victor <vctr.y.m@example.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants