[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

WeichenXu123 · 2024-01-23T12:35:09Z

Add heap_memory param for setup_ray_cluster API:
This is for advanced user to better tune memory setting with co-exist spark executor processes.
Change default value of per ray worker node config:
change default value of num_cpus_worker_node to number of CPU cores of spark worker node.
change default value of num_gpus_worker_node to number of GPUs of spark worker node.
This is a more optimal setting in most cases, compared with previous default setting.
Change default value of Ray head node config for Ray global mode cluster:
change default value of num_cpus_head_node to number of CPU cores of spark driver node.
change default value of num_gpus_head_node to number of GPUs of spark driver node.
This is a more optimal setting in most cases, compared with previous default setting.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2024-01-23T12:36:37Z

CC @jjyao

python/ray/util/spark/utils.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2024-01-28T11:36:43Z

CC @jjyao

btw, shall I name the new param "heap_memory_worker_node" or "memory_worker_node" ?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao

lg

jjyao · 2024-02-02T05:32:12Z

python/ray/util/spark/cluster_init.py

+            "'num_cpus_worker_node' and 'num_gpus_worker_node' arguments must be"
+            "set together or unset together."


Why is that?

The reason is:

in this PR I change num_cpus_worker_node and num_gpus_worker_node default values to "maximum" CPU / GPU cores of spark worker node:

Assuming the spark worker node instance has 4 CPUs and 2 GPU,
then if user only set one param like:

setup_ray_cluster(num_cpus_worker_node=1)

then num_gpus_worker_node default values still use 2, then it causes 3 CPU cores in this node wasted, because once the shape (1 CPUs/ 2 GPUs) ray node launched in this spark worker, no other ray worker node of the same shape can be launched in the same spark worker.

So I only allow 2 cases:

num_cpus_worker_node / num_gpus_worker_node are generated by default, to use all CPU /GPU cores in the spark worker.

Let user decide both of num_cpus_worker_node and num_gpus_worker_node, then user will set proper value e.g. (num_cpus_worker_node=2, num_gpus_worker_node=1) which can fit 4 CPU cores 2 GPU cores spark worker shape well.

I feel it's better to allow users to set them separately since they are two independent resources. If you want to avoid wasted resources, we can print some warning message. (What if the waste is on purpose that user wants to leave some resources for spark jobs)?

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…obal cluster Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2024-02-08T02:59:25Z

CC @jjyao

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 4 commits January 23, 2024 19:54

init

94a1f5c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

3f83aee

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

e0d0506

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

4b8161d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

ianrodney-db reviewed Jan 23, 2024

View reviewed changes

python/ray/util/spark/utils.py Outdated Show resolved Hide resolved

WeichenXu123 added 11 commits January 25, 2024 18:06

update

fe0688e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

5169ce5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

c831ce3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

3a06310

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

218397c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

f5f0330

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

21fa49c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

1904a56

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

50dd5dd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

bb23223

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

994c6d0

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

format

31fba01

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Feb 2, 2024

View reviewed changes

WeichenXu123 added 2 commits February 2, 2024 18:01

update

8ea3ba2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into ray-on-spark-memory-param

ed8727d

WeichenXu123 requested a review from jjyao February 2, 2024 10:02

WeichenXu123 added 2 commits February 4, 2024 11:31

refine message

83851c2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update num_cpus_head_node and num_gpus_head_node default value for gl…

25f2820

…obal cluster Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

format

2b4b247

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao approved these changes Feb 23, 2024

View reviewed changes

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved

update

23f9df3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits February 24, 2024 10:34

Merge branch 'master' into ray-on-spark-memory-param

eaec6a8

update

83d0f39

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao merged commit c1535c0 into ray-project:master Feb 26, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

WeichenXu123 commented Jan 23, 2024 •

edited

Loading

WeichenXu123 commented Jan 23, 2024

WeichenXu123 commented Jan 28, 2024

jjyao left a comment

jjyao Feb 2, 2024

WeichenXu123 Feb 2, 2024 •

edited

Loading

jjyao Feb 23, 2024

WeichenXu123 commented Feb 8, 2024

		"'num_cpus_worker_node' and 'num_gpus_worker_node' arguments must be"
		"set together or unset together."

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

Conversation

WeichenXu123 commented Jan 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

WeichenXu123 commented Jan 23, 2024

WeichenXu123 commented Jan 28, 2024

jjyao left a comment

Choose a reason for hiding this comment

jjyao Feb 2, 2024

Choose a reason for hiding this comment

WeichenXu123 Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

jjyao Feb 23, 2024

Choose a reason for hiding this comment

WeichenXu123 commented Feb 8, 2024

WeichenXu123 commented Jan 23, 2024 •

edited

Loading

WeichenXu123 Feb 2, 2024 •

edited

Loading