Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support spill objects to node_id sub-directory #44487

Merged
merged 30 commits into from
Apr 5, 2024

Conversation

ruisearch42
Copy link
Contributor

@ruisearch42 ruisearch42 commented Apr 5, 2024

Why are these changes needed?

When system-config was used to specify the spilling location, only one direction can be used which is the same for all the workers. When the dir is a shared NFS, conflict might happen.

https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html#cluster-mode

ray start --head --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/tmp/spill\"}}"}'

This PR adds support for using <NODE_ID> as part of the spilling directory name to avoid the conflicts. Specifically,
the spilling directory would be set to /tmp/spill/ray_spilled_objects_<NODE_ID>. Note that this change is not backwards compatible since it changes the spill directory name. We preferred this as opposed to introducing an additional config argument (API change) that controls whether to use <NODE_ID> as part of the directory name.

Note that after this change Ray doesn't clean up spill directory at start-up (as it has no information of the old node id), and leaves it to external system such as VM clean-up. As future work, functionality should be added to clean up spill directory at node shutdown time.

Related issue number

Closes #44206
Fixes postmerge errors in #44341

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

ruisearch42 and others added 29 commits March 28, 2024 10:10
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@jjyao jjyao merged commit b572cd7 into ray-project:master Apr 5, 2024
5 checks passed
jjyao pushed a commit that referenced this pull request Apr 12, 2024
In #44487 , node_id is obtained from an RPC call, which is extra overhead. In this change, we pass down node_id from raylet when default workers are created to avoid such overhead.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
harborn pushed a commit to harborn/ray that referenced this pull request Apr 18, 2024
In ray-project#44487 , node_id is obtained from an RPC call, which is extra overhead. In this change, we pass down node_id from raylet when default workers are created to avoid such overhead.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
In ray-project#44487 , node_id is obtained from an RPC call, which is extra overhead. In this change, we pass down node_id from raylet when default workers are created to avoid such overhead.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Object store spilling to the directory with the node id to avoid conflict.
2 participants