-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[WIP][core][gpu-objects] GC #53911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP][core][gpu-objects] GC #53911
Conversation
@@ -43,12 +43,12 @@ def __init__(self): | |||
# | |||
# Note: Currently, `gpu_object_store` is only supported for Ray Actors. | |||
self.gpu_object_store: Dict[str, List["torch.Tensor"]] = {} | |||
# A dictionary that maps from owned object ref to a metadata tuple: (actor handle, object ref). | |||
# A dictionary that maps from owned object ID to a metadata tuple: (actor handle, object ref). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we key on ObjectID instead of hex str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind sharing why you prefer using an ObjectID
rather than a hex string?
python/ray/_raylet.pyx
Outdated
@@ -2264,6 +2264,14 @@ cdef execute_task_with_cancellation_handler( | |||
f"Exited because worker reached max_calls={execution_info.max_calls}" | |||
" for this method.") | |||
|
|||
cdef void clean_up_gpu_object_callback(const CObjectID &c_object_id) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which thread does this callback run on? Can it get blocked by task execution on the main thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: The callback will be executed on the IO thread.
- The RPC server is running on
io_service_
.
ray/src/ray/core_worker/core_worker.cc
Lines 524 to 534 in 1bc0087
// Start RPC server after all the task receivers are properly initialized and we have | |
// our assigned port from the raylet. | |
core_worker_server_ = | |
std::make_unique<rpc::GrpcServer>(WorkerTypeString(options_.worker_type), | |
assigned_port, | |
options_.node_ip_address == "127.0.0.1"); | |
core_worker_server_->RegisterService( | |
std::make_unique<rpc::CoreWorkerGrpcService>(io_service_, *this), | |
false /* token_auth */); | |
core_worker_server_->Run(); |
io_service_
is running on the IO thread.
ray/src/ray/core_worker/core_worker.cc
Line 470 in 1bc0087
io_thread_ = boost::thread(io_thread_attrs, [this]() { RunIOService(); }); |
TODO: check all |
…-tmux7-ray4 Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Investigate:
|
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.