New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545
Comments
I was able to reproduce this. I looked at one of the hanging Trial processes (using Thread 0x700011ABA000 (idle): "Thread-5"
get_objects (ray/_private/worker.py:669)
get (ray/_private/worker.py:2274)
wrapper (ray/_private/client_mode_hook.py:105)
__init__ (ray/serve/handle.py:141)
_deserialize (ray/serve/handle.py:251)
_deserialize_pickle5_data (ray/_private/serialization.py:186)
_deserialize_msgpack_data (ray/_private/serialization.py:196)
_deserialize_object (ray/_private/serialization.py:241)
deserialize_objects (ray/_private/serialization.py:352)
deserialize_objects (ray/_private/worker.py:641)
get_objects (ray/_private/worker.py:683)
get (ray/_private/worker.py:2274)
wrapper (ray/_private/client_mode_hook.py:105)
get (ray/tune/registry.py:234)
inner (ray/tune/trainable/util.py:358)
_inner (ray/tune/trainable/util.py:368)
_trainable_func (ray/tune/trainable/function_trainable.py:684)
_resume_span (ray/util/tracing/tracing_helper.py:466)
entrypoint (ray/tune/trainable/function_trainable.py:362)
run (ray/tune/trainable/function_trainable.py:289)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890) I then simplified the script to only try to put/get the Serve handle, and was able to reproduce the hang: import ray
from ray import serve
@serve.deployment
class Model:
def __init__(self):
pass
async def __call__(self, request):
pass
def main():
ray.init()
handle = serve.run(Model.bind())
ref = ray.put(handle)
ray.get(ref)
print("Never gets here.")
if __name__ == "__main__":
main() Thread 0x112B77600 (idle): "MainThread"
get_objects (ray/_private/worker.py:669)
get (ray/_private/worker.py:2274)
wrapper (ray/_private/client_mode_hook.py:105)
__init__ (ray/serve/handle.py:141)
_deserialize (ray/serve/handle.py:251)
_deserialize_pickle5_data (ray/_private/serialization.py:186)
_deserialize_msgpack_data (ray/_private/serialization.py:196)
_deserialize_object (ray/_private/serialization.py:241)
deserialize_objects (ray/_private/serialization.py:352)
deserialize_objects (ray/_private/worker.py:641)
get_objects (ray/_private/worker.py:683)
get (ray/_private/worker.py:2274)
wrapper (ray/_private/client_mode_hook.py:105)
main (serve_handle.py:16)
<module> (serve_handle.py:20) I was able to run the same script with Ray 1.13 successfully. I'm not sure why the first few trials were successfully run, but based on this I wouldn't expect any of the trials to run. @sihanwang41 are Serve handles expected to be (de)serializable? |
Current in ray serve, we have https://github.com/ray-project/ray/blob/master/python/ray/serve/tests/test_handle.py#L50 to make sure the handle is all serializable. I can reproduce the issue in local with @matthewdeng script, but a little bit out of clue, I want to borrow some help from core side. The script will be stuck at this step https://github.com/ray-project/ray/blob/master/python/ray/serve/handle.py#L140. I double checked the controller received the request and return the response, but ray.get is stuck during the handle initialization, and if the line140 is commented out, the script can succeed to run. |
At high level i think the problem is we are calling |
I tried ray.get an actor handle, and it seems to work (though I am not sure if it is supposed to be supported. Otherwise we should raise an exception when this happens. cc @stephanie-wang ). So I assume this is serve specific. Since it used to work at 1.13, why don't you guys try bisecting it between 1.13 and 2.0? |
What are alternative approaches to using a handle (not considering http requests) to call a Ray Serve deployment from remote tasks / Ray actors? The simulation code in the remote task / Ray actor will need to get the result of the Serve prediction for further processing. |
@sihanwang41 we need to move https://github.com/ray-project/ray/blame/master/python/ray/serve/handle.py#L140 out of deserialization code path. |
Hi @jharaldson , can try to use https://discuss.ray.io/t/deploy-delete-and-use-deployments-in-ray-serve-2-0-0/7557/6?u=sihan_wang? You can directly grab a handle based on the deployment name. Let me know if it works for you. Hi @scv119 and @rkooo567, thank you for taking a look! The serve handle is passed around between different actor as an argument which is working well, it is only not working in using ray.put and ray.get for this case. If this is not supported, I will try to make changes to workaround it. |
Thanks @sihanwang41, your proposed solution works. |
@sihanwang41 do you mind assigning a priority for this (or closing it if it's already addressed)? |
The root cause was
on_completed seems to not work in this scenario |
@sihanwang41 will have the further investigation. In the OSS side, I think the issue is |
Hi, it seems we have a similar issue. Not related to tune, but with core and serve components, when passing serve-handle in tasks. The first time a serve-handle is passed, it works correctly. But it gets stuck the second time the handle is passed. I made a basic reproduction script to isolate the problem: Reproduction scriptimport time
import ray
from ray import serve
@ray.remote(num_cpus=0.1)
class A:
def __init__(self):
self.b = B.remote()
deployment = ModelDeployment.options(name="model1")
deployment.deploy()
self.handle = deployment.get_handle()
def f(self):
ray.get(self.b.g.remote(self.handle))
print("A.f()")
@ray.remote(num_cpus=0.1)
class B:
def g(self, handle):
print(f"B.g()")
@serve.deployment(route_prefix=None, ray_actor_options={"num_cpus": 0.1})
class ModelDeployment:
def __call__(self, inputs):
return "pred"
if __name__ == "__main__":
ray.init()
serve.start(detached=True)
a = A.remote()
ray.get(a.f.remote())
ray.get(a.f.remote())
while True:
time.sleep(1) It only shows |
What happened + What you expected to happen
The bug
Tune trials get stuck in RUNNING and PENDING mode after initially running and TERMINATING a number of trials (the number seems to be related with the number of CPUs available).
Context
We use Ray Tune to trigger parallell execution of simulations that take in a Ray Serve handle to allow ML model predictions inside the simulations. A
ray.get(handle.remote())
call would used inside the simulator code to run predictions.tune.with_parameters(simulation, handle=handle)
is used to pass the handle.Expected behaviour
In Ray 1.13.0 the same code works, but we call
tune.run()
instead of creating aTuner
object and then calltune.fit()
. For clarity it is observed that thetune.run()
approach also fails in Ray 2.0.0. The expectation is that Ray 2.0.0 would show the same behaviour as Ray 1.13.0.Logs
Versions / Dependencies
Python=3.8.10
Ray=2.0.0
Ray installed in virtutal environment according to instructions: https://docs.ray.io/en/latest/ray-overview/installation.html
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: