Skip to content

[WIP][core][gpu-objects] GC #53911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kevin85421
Copy link
Member

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@@ -43,12 +43,12 @@ def __init__(self):
#
# Note: Currently, `gpu_object_store` is only supported for Ray Actors.
self.gpu_object_store: Dict[str, List["torch.Tensor"]] = {}
# A dictionary that maps from owned object ref to a metadata tuple: (actor handle, object ref).
# A dictionary that maps from owned object ID to a metadata tuple: (actor handle, object ref).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we key on ObjectID instead of hex str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind sharing why you prefer using an ObjectID rather than a hex string?

@@ -2264,6 +2264,14 @@ cdef execute_task_with_cancellation_handler(
f"Exited because worker reached max_calls={execution_info.max_calls}"
" for this method.")

cdef void clean_up_gpu_object_callback(const CObjectID &c_object_id) nogil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which thread does this callback run on? Can it get blocked by task execution on the main thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: The callback will be executed on the IO thread.

  1. The RPC server is running on io_service_.

// Start RPC server after all the task receivers are properly initialized and we have
// our assigned port from the raylet.
core_worker_server_ =
std::make_unique<rpc::GrpcServer>(WorkerTypeString(options_.worker_type),
assigned_port,
options_.node_ip_address == "127.0.0.1");
core_worker_server_->RegisterService(
std::make_unique<rpc::CoreWorkerGrpcService>(io_service_, *this),
false /* token_auth */);
core_worker_server_->Run();

  1. io_service_ is running on the IO thread.

io_thread_ = boost::thread(io_thread_attrs, [this]() { RunIOService(); });

@kevin85421
Copy link
Member Author

TODO: check all remove_gpu_object calls

…-tmux7-ray4

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@kevin85421
Copy link
Member Author

Investigate:

  1. How does object store GC work?
  2. Why ref count is 0? (Ray's ref count = Python ref count + objects in the cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants