[WIP][core][gpu-objects] GC #53911

kevin85421 · 2025-06-18T08:38:42Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

stephanie-wang · 2025-06-18T18:01:17Z

python/ray/_private/gpu_object_manager.py

@@ -43,12 +43,12 @@ def __init__(self):
        #
        # Note: Currently, `gpu_object_store` is only supported for Ray Actors.
        self.gpu_object_store: Dict[str, List["torch.Tensor"]] = {}
-        # A dictionary that maps from owned object ref to a metadata tuple: (actor handle, object ref).
+        # A dictionary that maps from owned object ID to a metadata tuple: (actor handle, object ref).


Can we key on ObjectID instead of hex str?

Would you mind sharing why you prefer using an ObjectID rather than a hex string?

stephanie-wang · 2025-06-18T18:01:57Z

python/ray/_raylet.pyx

@@ -2264,6 +2264,14 @@ cdef execute_task_with_cancellation_handler(
                f"Exited because worker reached max_calls={execution_info.max_calls}"
                " for this method.")

+cdef void clean_up_gpu_object_callback(const CObjectID &c_object_id) nogil:


Which thread does this callback run on? Can it get blocked by task execution on the main thread?

TLDR: The callback will be executed on the IO thread.

The RPC server is running on io_service_.

ray/src/ray/core_worker/core_worker.cc

Lines 524 to 534 in 1bc0087

// Start RPC server after all the task receivers are properly initialized and we have

// our assigned port from the raylet.

core_worker_server_ =

std::make_unique<rpc::GrpcServer>(WorkerTypeString(options_.worker_type),

assigned_port,

options_.node_ip_address == "127.0.0.1");

core_worker_server_->RegisterService(

std::make_unique<rpc::CoreWorkerGrpcService>(io_service_, *this),

false /* token_auth */);

core_worker_server_->Run();

io_service_ is running on the IO thread.

ray/src/ray/core_worker/core_worker.cc

Line 470 in 1bc0087

io_thread_ = boost::thread(io_thread_attrs, [this]() { RunIOService(); });

src/ray/core_worker/core_worker.cc

kevin85421 · 2025-06-19T07:49:29Z

TODO: check all remove_gpu_object calls

…-tmux7-ray4 Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-06-24T20:00:40Z

Investigate:

How does object store GC work?
Why ref count is 0? (Ray's ref count = Python ref count + objects in the cluster)

update

4014897

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

stephanie-wang reviewed Jun 18, 2025

View reviewed changes

kevin85421 added 5 commits June 20, 2025 07:46

Merge remote-tracking branch 'upstream/master' into 20250613-devbox-2…

b2075e3

…-tmux7-ray4 Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Rename CleanUpGPUObject to FreeActorObject

1260154

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update comments

7a0aeb2

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

fix linter

c19a701

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

add test

d9b38dd

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][core][gpu-objects] GC #53911

[WIP][core][gpu-objects] GC #53911

Uh oh!

kevin85421 commented Jun 18, 2025

Uh oh!

stephanie-wang Jun 18, 2025

Uh oh!

kevin85421 Jun 23, 2025

Uh oh!

stephanie-wang Jun 18, 2025

Uh oh!

kevin85421 Jun 23, 2025

Uh oh!

Uh oh!

kevin85421 commented Jun 19, 2025

Uh oh!

kevin85421 commented Jun 24, 2025

Uh oh!

Uh oh!

	// Start RPC server after all the task receivers are properly initialized and we have
	// our assigned port from the raylet.
	core_worker_server_ =
	std::make_unique<rpc::GrpcServer>(WorkerTypeString(options_.worker_type),
	assigned_port,
	options_.node_ip_address == "127.0.0.1");

	core_worker_server_->RegisterService(
	std::make_unique<rpc::CoreWorkerGrpcService>(io_service_, *this),
	false /* token_auth */);
	core_worker_server_->Run();

[WIP][core][gpu-objects] GC #53911

Are you sure you want to change the base?

[WIP][core][gpu-objects] GC #53911

Uh oh!

Conversation

kevin85421 commented Jun 18, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

stephanie-wang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 commented Jun 19, 2025

Uh oh!

kevin85421 commented Jun 24, 2025

Uh oh!

Uh oh!