-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604
Conversation
CC @jjyao |
CC @jjyao btw, shall I name the new param "heap_memory_worker_node" or "memory_worker_node" ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg
"'num_cpus_worker_node' and 'num_gpus_worker_node' arguments must be" | ||
"set together or unset together." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is:
in this PR I change num_cpus_worker_node
and num_gpus_worker_node
default values to "maximum" CPU / GPU cores of spark worker node:
Assuming the spark worker node instance has 4 CPUs and 2 GPU,
then if user only set one param like:
setup_ray_cluster(num_cpus_worker_node=1)
then num_gpus_worker_node
default values still use 2, then it causes 3 CPU cores in this node wasted, because once the shape (1 CPUs/ 2 GPUs) ray node launched in this spark worker, no other ray worker node of the same shape can be launched in the same spark worker.
So I only allow 2 cases:
- num_cpus_worker_node / num_gpus_worker_node are generated by default, to use all CPU /GPU cores in the spark worker.
- Let user decide both of num_cpus_worker_node and num_gpus_worker_node, then user will set proper value e.g. (num_cpus_worker_node=2, num_gpus_worker_node=1) which can fit 4 CPU cores 2 GPU cores spark worker shape well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's better to allow users to set them separately since they are two independent resources. If you want to avoid wasted resources, we can print some warning message. (What if the waste is on purpose that user wants to leave some resources for spark jobs)?
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…obal cluster Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
CC @jjyao |
Add heap_memory param for setup_ray_cluster API:
This is for advanced user to better tune memory setting with co-exist spark executor processes.
Change default value of per ray worker node config:
change default value of
num_cpus_worker_node
to number of CPU cores of spark worker node.change default value of
num_gpus_worker_node
to number of GPUs of spark worker node.This is a more optimal setting in most cases, compared with previous default setting.
Change default value of Ray head node config for Ray global mode cluster:
change default value of
num_cpus_head_node
to number of CPU cores of spark driver node.change default value of
num_gpus_head_node
to number of GPUs of spark driver node.This is a more optimal setting in most cases, compared with previous default setting.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.