Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object-store based function manager #18096

Open
ericl opened this issue Aug 25, 2021 · 14 comments
Open

Object-store based function manager #18096

ericl opened this issue Aug 25, 2021 · 14 comments
Labels
core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical reliability size:large
Milestone

Comments

@ericl
Copy link
Contributor

ericl commented Aug 25, 2021

Currently, function and actor definitions are broadcast to all workers. This has a number of issues:

  1. CPU overhead on the GCS is O(n) with the size of the cluster when defining a function.
  2. Large functions are not supported.
  3. Function lifetime is not easily tracked (not ref-counted).

This has led to a number of user issues:

There is a short term mitigation to reduce the worst of these issues: #18095

However, longer term it makes sense to migrate the function manager to the object store. This involves a couple things:

  1. Function objects should be owned by the GCS, to support detached actors using remote functions.
  2. We should treat function objects as an object dependency of the task prior to dispatch.
  3. Modification of the function execution code to get functions from the object store instead of via redis.
@ericl ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks size:large labels Aug 25, 2021
@ericl ericl added this to the Core Backlog milestone Aug 25, 2021
@DonYum
Copy link

DonYum commented Sep 8, 2021

I'm confused by #8822 , so, when will this bug be fixed?

@ericl
Copy link
Contributor Author

ericl commented Sep 8, 2021

Let me re-open 8822, since it's looking like this solution may be too far out and we could do a shorter-term fix.

@ericl ericl removed the usability label Sep 12, 2021
@raulchen
Copy link
Contributor

As we are implementing RuntimeEnv, I think we should default to using RuntimeEnv to distribute code. The only exception is dynamically-defined functions. Does this make sense?
Also, by switching to RuntimeEnv for all static functions. I guess these issues will be greatly mitigated?
cc @edoakes

@ericl
Copy link
Contributor Author

ericl commented Nov 15, 2021

These functions are dynamically defined during the execution of the program though, and typically there are many for a single runtime_env. I don't think unifying these mechanisms makes sense.

Btw in Python there is no such thing as a static function really as relevant to Ray, everything is dynamically defined.

@raulchen
Copy link
Contributor

I know that everything in Python is dynamic. By "static", I was referring to functions that don't capture any dynamic variables. If we distribute the python code files with runtime env, we don't have to store the functions in GCS or object store. This will be more efficient and easier to maintain.

@ericl
Copy link
Contributor Author

ericl commented Nov 16, 2021

That's definitely not the case for Python though. Remote functions commonly closure-capture both global variables and local variables, and hence have to be defined at runtime.

@ericl ericl added P2 Important issue, but not time-critical reliability and removed P1 Issue that should be fixed within a few weeks enhancement Request for new feature and/or capability labels Nov 17, 2021
@raulchen
Copy link
Contributor

But __globals__ and __locals__ are not used usually. Take the following function for example, it doesn't matter whether the function to execute in worker is pickled from the driver, or the worker loads the function from the local file.

@ray.remote
def foo():
    return 1
    
ray.get(foo.remote())

@edoakes
Copy link
Contributor

edoakes commented Nov 18, 2021

@raulchen this is a very trivial example. Users capture globals/locals all the time, especially if dynamically defining things:

refs = []
for i in range(10):
    @ray.remote
    def process_chunk():
        return my_lib.process(i)
    refs.append(process_chunk.remote())

It may be considered "bad practice," but this is a very common usage pattern and it would hamper usability if we disallow it..

@kfstorm
Copy link
Member

kfstorm commented Nov 19, 2021

Do the objects survive GCS failover if we store functions in the object store and make GCS the owner?

@ericl
Copy link
Contributor Author

ericl commented Nov 19, 2021 via email

@raulchen
Copy link
Contributor

@edoakes right. I was thinking that maybe we can detect whether a function uses a captured variable, and choose different ways to distribute the function.

After thinking more about this, global variable support is still problematic in some cases, which will confuse users. For example:

global_var = 0
@ray.remote
def foo():
    return global_var
if __module__ == "__main__":
    print(ray.get(foo.remote())  # Prints 0, as "global_var = 0" is captured in the remote function.
    global_var = 1
    print(ray.get(foo.remote()) # This still prints 1, unless we re-export the function.

Also, changing a global variable in a remote function doesn't take effect either.

I don't have a good solution at this moment. But I think, as long as the original code use global variables, the user has to be careful when migrating the code to Ray.

@marsupialtail
Copy link

Has there been updates on this thing? I am being hit with memory leaks on ray head when I kill stuff, and I can't call ray stop manually.

@smacpher
Copy link

Hi, I've also been running into my Ray head node's memory slowly growing over time as I run the same driver script (using the Ray Client) with different arguments over time. Are there any workarounds to make clean up resources at the end of my Ray Client script? Thanks in advance for your time.

@rkooo567 rkooo567 added the core Issues that should be addressed in Ray Core label Dec 9, 2022
@mohitjain2504
Copy link

@marsupialtail @smacpher Were you able to identify any work around?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical reliability size:large
Projects
Status: In discussion
Development

No branches or pull requests

9 participants