Skip to content

[Core][DAG] Deprecate DAGNode.execute()#63716

Merged
edoakes merged 7 commits into
ray-project:masterfrom
Phucvt123:fix/dag-function-node-cache-remote
Jun 4, 2026
Merged

[Core][DAG] Deprecate DAGNode.execute()#63716
edoakes merged 7 commits into
ray-project:masterfrom
Phucvt123:fix/dag-function-node-cache-remote

Conversation

@Phucvt123
Copy link
Copy Markdown
Contributor

@Phucvt123 Phucvt123 commented May 29, 2026

Why are these changes needed?

In the non-compiled DAG path, calling DAGNode.execute() executes FunctionNode.execute(), which dynamically defines a new remote function via ray.remote(self._body) on every execution. This exports new metadata to the GCS KV store on every run, leading to an unbounded memory leak (GCS KV leak) during the job lifetime (#63666).

Since the non-compiled DAG execution path is not recommended for production and is a primary source of this leak, we are deprecating DAGNode.execute() in favor of the compiled DAG API.

This PR adds a DeprecationWarning to DAGNode.execute() to warn users that it is deprecated and will be removed in a future release.

Related issue number

Closes #63666

@Phucvt123 Phucvt123 requested a review from a team as a code owner May 29, 2026 02:52
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a class-level cache (_remote_func_cache) in FunctionNode to prevent re-registering the same function in GCS KV on every execution, along with corresponding unit tests. The reviewer pointed out that using a class-level dictionary to store RemoteFunction objects creates a permanent memory leak because it holds strong references to the function bodies, preventing garbage collection. The reviewer suggested storing the cached RemoteFunction directly as an attribute on the function body itself (with a fallback to the class-level cache for objects that do not support setting attributes) and provided corresponding updates for the unit tests.

Comment thread python/ray/dag/function_node.py Outdated
Comment on lines 51 to 58
func_id = id(self._body)
if func_id not in FunctionNode._remote_func_cache:
FunctionNode._remote_func_cache[func_id] = ray.remote(self._body)
return (
ray.remote(self._body)
FunctionNode._remote_func_cache[func_id]
.options(**self._bound_options)
.remote(*self._bound_args, **self._bound_kwargs)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a class-level dictionary (_remote_func_cache) to store RemoteFunction objects creates a permanent memory leak.

Because RemoteFunction holds a strong reference to the underlying function body (self._body), and the class-level cache holds a strong reference to the RemoteFunction, any function executed via a DAG will never be garbage collected. In long-running applications that dynamically define functions or use closures, this will lead to unbounded memory growth.

To fix this, we can attempt to store the cached RemoteFunction directly as an attribute on the function body itself. This ties the lifetime of the cached RemoteFunction to the function body. For objects that do not support setting attributes (like built-ins), we can fall back to the class-level cache.

Suggested change
func_id = id(self._body)
if func_id not in FunctionNode._remote_func_cache:
FunctionNode._remote_func_cache[func_id] = ray.remote(self._body)
return (
ray.remote(self._body)
FunctionNode._remote_func_cache[func_id]
.options(**self._bound_options)
.remote(*self._bound_args, **self._bound_kwargs)
)
func = self._body
try:
if not hasattr(func, "_ray_remote_func"):
func._ray_remote_func = ray.remote(func)
remote_func = func._ray_remote_func
except AttributeError:
func_id = id(func)
if func_id not in FunctionNode._remote_func_cache:
FunctionNode._remote_func_cache[func_id] = ray.remote(func)
remote_func = FunctionNode._remote_func_cache[func_id]
return (
remote_func
.options(**self._bound_options)
.remote(*self._bound_args, **self._bound_kwargs)
)

Comment on lines +32 to +40
assert len(FunctionNode._remote_func_cache) == 0

ray.get(node.execute())
assert len(FunctionNode._remote_func_cache) == 1
assert id(double) in FunctionNode._remote_func_cache

# Execute again - cache size should NOT increase
ray.get(node.execute())
assert len(FunctionNode._remote_func_cache) == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the test assertions to verify the attribute-based caching on the function body instead of checking the class-level cache size.

Suggested change
assert len(FunctionNode._remote_func_cache) == 0
ray.get(node.execute())
assert len(FunctionNode._remote_func_cache) == 1
assert id(double) in FunctionNode._remote_func_cache
# Execute again - cache size should NOT increase
ray.get(node.execute())
assert len(FunctionNode._remote_func_cache) == 1
assert not hasattr(double, "_ray_remote_func")
ray.get(node.execute())
assert hasattr(double, "_ray_remote_func")
# Execute again - cache should still be active
ray.get(node.execute())
assert hasattr(double, "_ray_remote_func")

Comment on lines +54 to +55
# 50 executions but only 1 cache entry
assert len(FunctionNode._remote_func_cache) == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the test assertion to verify the attribute-based caching on the function body.

Suggested change
# 50 executions but only 1 cache entry
assert len(FunctionNode._remote_func_cache) == 1
# 50 executions but only 1 cache entry on the function itself
assert hasattr(add_one, "_ray_remote_func")

@Phucvt123 Phucvt123 force-pushed the fix/dag-function-node-cache-remote branch 2 times, most recently from 41bf907 to e970ee2 Compare May 29, 2026 03:02
Comment thread python/ray/dag/function_node.py Outdated
@Phucvt123 Phucvt123 force-pushed the fix/dag-function-node-cache-remote branch 3 times, most recently from a2cc37b to 1860443 Compare May 29, 2026 04:29
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 29, 2026
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented May 29, 2026

I don't think this execute() functionality is actually used anywhere, probably best to deprecate/remove it

@Phucvt123
Copy link
Copy Markdown
Contributor Author

I don't think this execute() functionality is actually used anywhere, probably best to deprecate/remove it

Thanks for the feedback! That makes sense — if execute() on the non-compiled DAG path is no longer used anywhere, removing it would be a cleaner fix than caching.

Would you prefer:

  1. A deprecation warning on FunctionNode._execute_impl() / DAGNode.execute() in this PR, or
  2. A full removal of the execute() path in a separate PR?

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 2, 2026

@Phucvt123 let's add a deprecation warning and leave it around for a couple of releases before we remove

Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
@Phucvt123 Phucvt123 force-pushed the fix/dag-function-node-cache-remote branch 2 times, most recently from 498aa23 to 412656d Compare June 2, 2026 04:19
Comment thread python/ray/dag/function_node.py Outdated
@Phucvt123 Phucvt123 force-pushed the fix/dag-function-node-cache-remote branch 3 times, most recently from c447594 to da0fa9e Compare June 2, 2026 06:59
Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
@Phucvt123 Phucvt123 force-pushed the fix/dag-function-node-cache-remote branch from da0fa9e to a5e0d96 Compare June 2, 2026 07:18
Phucvt123 and others added 2 commits June 2, 2026 14:19
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 3, 2026
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 3, 2026

Pushed a change to adjust the warning message. Please also update the PR title and description

@Phucvt123 Phucvt123 changed the title [Core][DAG] Cache RemoteFunction in FunctionNode to prevent GCS KV leak [Core][DAG] Deprecate DAGNode.execute() Jun 3, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f0bf17a. Configure here.

Comment thread python/ray/dag/dag_node.py
Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
@Phucvt123
Copy link
Copy Markdown
Contributor Author

@edoakes Thanks! I have updated the PR title and description to reflect the deprecation of DAGNode.execute().

I also fixed a minor syntax error in the warning message and switched it to RayDeprecationWarning so it is visible to users. Please let me know if this looks good to you!

@edoakes edoakes enabled auto-merge (squash) June 4, 2026 00:51
@edoakes edoakes merged commit ab3d514 into ray-project:master Jun 4, 2026
7 checks passed
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
## Why are these changes needed?

In the non-compiled DAG path, calling `DAGNode.execute()` executes
`FunctionNode.execute()`, which dynamically defines a new remote
function via `ray.remote(self._body)` on every execution. This exports
new metadata to the GCS KV store on every run, leading to an unbounded
memory leak (GCS KV leak) during the job lifetime (ray-project#63666).

Since the non-compiled DAG execution path is not recommended for
production and is a primary source of this leak, we are deprecating
`DAGNode.execute()` in favor of the compiled DAG API.

This PR adds a `DeprecationWarning` to `DAGNode.execute()` to warn users
that it is deprecated and will be removed in a future release.

## Related issue number
Closes ray-project#63666

---------

Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] GCS kv store has no cleanup or eviction strategy for pickled function during job lifetime

2 participants