Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster #42604

Merged
merged 24 commits into from
Feb 26, 2024

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jan 23, 2024

  • Add heap_memory param for setup_ray_cluster API:
    This is for advanced user to better tune memory setting with co-exist spark executor processes.

  • Change default value of per ray worker node config:
    change default value of num_cpus_worker_node to number of CPU cores of spark worker node.
    change default value of num_gpus_worker_node to number of GPUs of spark worker node.
    This is a more optimal setting in most cases, compared with previous default setting.

  • Change default value of Ray head node config for Ray global mode cluster:
    change default value of num_cpus_head_node to number of CPU cores of spark driver node.
    change default value of num_gpus_head_node to number of GPUs of spark driver node.
    This is a more optimal setting in most cases, compared with previous default setting.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @jjyao

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @jjyao

btw, shall I name the new param "heap_memory_worker_node" or "memory_worker_node" ?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg

Comment on lines +1097 to +1098
"'num_cpus_worker_node' and 'num_gpus_worker_node' arguments must be"
"set together or unset together."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that?

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is:

in this PR I change num_cpus_worker_node and num_gpus_worker_node default values to "maximum" CPU / GPU cores of spark worker node:

Assuming the spark worker node instance has 4 CPUs and 2 GPU,
then if user only set one param like:

setup_ray_cluster(num_cpus_worker_node=1)

then num_gpus_worker_node default values still use 2, then it causes 3 CPU cores in this node wasted, because once the shape (1 CPUs/ 2 GPUs) ray node launched in this spark worker, no other ray worker node of the same shape can be launched in the same spark worker.

So I only allow 2 cases:

  1. num_cpus_worker_node / num_gpus_worker_node are generated by default, to use all CPU /GPU cores in the spark worker.
  2. Let user decide both of num_cpus_worker_node and num_gpus_worker_node, then user will set proper value e.g. (num_cpus_worker_node=2, num_gpus_worker_node=1) which can fit 4 CPU cores 2 GPU cores spark worker shape well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's better to allow users to set them separately since they are two independent resources. If you want to avoid wasted resources, we can print some warning message. (What if the waste is on purpose that user wants to leave some resources for spark jobs)?

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved
python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…obal cluster

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 changed the title [spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config [spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster Feb 5, 2024
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @jjyao

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@jjyao jjyao merged commit c1535c0 into ray-project:master Feb 26, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants