[spark] ray on spark autoscaling #38215

WeichenXu123 · 2023-08-08T08:45:07Z

Why are these changes needed?

Implement ray on spark autoscaling.
See REP: ray-project/enhancements#43

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/autoscaler/_private/spark/node_provider.py

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/autoscaler/_private/spark/node_provider.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao · 2023-10-01T05:18:54Z

python/ray/autoscaler/_private/spark/node_provider.py

+
+
+class SparkNodeProvider(NodeProvider):
+    """A node provider that implements provider for nodes of Ray on spark."""


Let's write down the high level design to help people understand the code.

python/ray/autoscaler/_private/spark/node_provider.py

jjyao · 2023-10-01T05:29:03Z

python/ray/autoscaler/_private/spark/node_provider.py

+                f"Spark node provider creates node {node_id}."
+            )
+
+            def update_node_status(_node_id):


I checked other node providers (e.g. aws, gcp), the node status is updated inside non_terminated_nodes. I think we can do the same?

python/ray/autoscaler/_private/spark/node_provider.py

python/ray/autoscaler/_private/spark/spark_job_server.py

python/ray/tests/spark/test_basic.py

python/ray/util/spark/databricks_hook.py

jjyao · 2023-10-01T06:01:01Z

python/ray/util/spark/cluster_init.py

+            collect_log_to_path,
+        )
+        ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd
+    else:


Similar to AutoscalingCluster, can we also create a StaticCluster class to encapsulate the logic of starting head and worker nodes of a static cluster?

This is code refactoring work, I suggest we do it in follow-up PR.

ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.

python/ray/util/spark/cluster_init.py

jjyao · 2023-10-01T06:07:31Z

Generally it'd be good to write more comments, especially the high level design and how things work together. This can help future users understand and maintain the code.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao · 2023-10-02T05:05:53Z

Lint failure



Mon Oct  2 03:10:47 UTC 2023 Flake8....
--
  | python/ray/autoscaler/_private/spark/node_provider.py:4:1: F401 'threading' imported but unused
  | python/ray/autoscaler/_private/spark/node_provider.py:5:1: F401 'time' imported but unused
  | python/ray/autoscaler/_private/spark/node_provider.py:211:89: E501 line too long (91 > 88 characters)
  | python/ray/util/spark/cluster_init.py:2:1: F401 'tempfile' imported but unused

python/ray/autoscaler/_private/spark/spark_job_server.py

jjyao · 2023-10-02T05:37:28Z

python/ray/util/spark/cluster_init.py

+            collect_log_to_path,
+        )
+        ray_head_node_cmd = autoscaling_cluster.ray_head_node_cmd
+    else:


ok, let's refactor in a follow-up PR. Also this file is very big now. We should also split it into multiple smaller files during refactoring.

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao

As a follow-up, I think we can consolidate the static cluster and autoscaling cluster code by always using the autoscaling code path? A static cluster is just an autoscaling cluster with the same min_workers and max_workers.

WeichenXu123 · 2023-10-04T05:09:02Z

As a follow-up, I think we can consolidate the static cluster and autoscaling cluster code by always using the autoscaling code path? A static cluster is just an autoscaling cluster with the same min_workers and max_workers.

This makes sense.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao · 2023-10-04T20:34:48Z

python/setup.py

@@ -198,6 +198,7 @@ def get_packages(self):
    "ray/autoscaler/aws/cloudwatch/prometheus.yml",
    "ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh",
    "ray/autoscaler/azure/defaults.yaml",
+    "ray/autoscaler/spark/defaults.yaml",


@ericl This file needs your approval.

ericl

Approved for setup.py changes.

Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Victor <vctr.y.m@example.com>

init

98a43af

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested review from wuisawesome, DmitriGekhtman, ericl and a team as code owners August 8, 2023 08:45

WeichenXu123 marked this pull request as draft August 8, 2023 08:45

WeichenXu123 changed the title ~~[WIP] [spark] ray on spark autoscaling~~ [WIP] [spark] ray on spark autoscaling prototyping Aug 8, 2023

WeichenXu123 added 8 commits August 9, 2023 16:26

use NODE_ID_AS_RESOURCE

e9e5ee9

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update3

b522961

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

f7d2af2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

8b6c4cb

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

5567177

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

updates

dd0ff49

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

updates 15

867e491

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

31deda5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 commented Aug 15, 2023

View reviewed changes

python/ray/autoscaler/_private/spark/node_provider.py Show resolved Hide resolved

WeichenXu123 commented Aug 15, 2023

View reviewed changes

python/ray/autoscaler/_private/spark/node_provider.py Outdated Show resolved Hide resolved

WeichenXu123 commented Aug 15, 2023

View reviewed changes

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved

WeichenXu123 added 3 commits August 16, 2023 12:02

fix

cda9a77

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

c1355fe

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

merge master

80bdcae

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 changed the title ~~[WIP] [spark] ray on spark autoscaling prototyping~~ [WIP] [spark] ray on spark autoscaling Aug 21, 2023

WeichenXu123 added 8 commits August 24, 2023 18:21

Merge branch 'master' into autoscale-prototyping3

eec0c31

update sys config

7c24dcc

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Simplify ray worker node killing logic when parent process died

00b2dd3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

merge resources option

b637a0e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

47102ce

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

exc handling

eeecc5d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update todo

146278c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

support pending status

e325055

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 6 commits September 28, 2023 15:44

merge master

487c539

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

e9861a3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

refine code

b4be606

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

b77ec40

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

efaaa00

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

2ed7684

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Sep 29, 2023

View reviewed changes

python/ray/autoscaler/_private/spark/node_provider.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/spark/node_provider.py Outdated Show resolved Hide resolved

address comments

069cf1c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested a review from jjyao September 29, 2023 12:41

update

946d870

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Oct 1, 2023

View reviewed changes

WeichenXu123 added 4 commits October 1, 2023 15:18

move update node status to non_terminated_nodes

92be9e5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

nit update

73fc9ff

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

4724d15

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

0a51204

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao approved these changes Oct 2, 2023

View reviewed changes

WeichenXu123 added 4 commits October 2, 2023 15:33

format

c37688b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

ec85a94

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into autoscale-prototyping

e61ed73

use wait_for_condition

34085e0

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Oct 4, 2023

View reviewed changes

merge master

5777b94

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Oct 4, 2023

View reviewed changes

ericl approved these changes Oct 4, 2023

View reviewed changes

jjyao merged commit 19a58d2 into ray-project:master Oct 5, 2023
87 of 93 checks passed

Zandew pushed a commit to Zandew/ray that referenced this pull request Oct 10, 2023

[spark] ray on spark autoscaling (ray-project#38215)

e9fcea6

Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[spark] ray on spark autoscaling (ray-project#38215)

4ef5431

Implement ray on spark autoscaling. Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Victor <vctr.y.m@example.com>

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] ray on spark autoscaling #38215

[spark] ray on spark autoscaling #38215

WeichenXu123 commented Aug 8, 2023 •

edited

Loading

jjyao Oct 1, 2023

jjyao Oct 1, 2023

WeichenXu123 Oct 1, 2023

jjyao Oct 1, 2023

WeichenXu123 Oct 1, 2023

jjyao Oct 2, 2023

jjyao commented Oct 1, 2023 •

edited

Loading

jjyao commented Oct 2, 2023

jjyao Oct 2, 2023

jjyao left a comment

WeichenXu123 commented Oct 4, 2023

jjyao Oct 4, 2023

ericl left a comment



		class SparkNodeProvider(NodeProvider):
		"""A node provider that implements provider for nodes of Ray on spark."""

[spark] ray on spark autoscaling #38215

[spark] ray on spark autoscaling #38215

Conversation

WeichenXu123 commented Aug 8, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Oct 1, 2023 • edited Loading

jjyao commented Oct 2, 2023

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Oct 4, 2023

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Aug 8, 2023 •

edited

Loading

jjyao commented Oct 1, 2023 •

edited

Loading