[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

jharaldson · 2022-10-21T11:30:29Z

What happened + What you expected to happen

The bug
Tune trials get stuck in RUNNING and PENDING mode after initially running and TERMINATING a number of trials (the number seems to be related with the number of CPUs available).

Context
We use Ray Tune to trigger parallell execution of simulations that take in a Ray Serve handle to allow ML model predictions inside the simulations. A ray.get(handle.remote()) call would used inside the simulator code to run predictions. tune.with_parameters(simulation, handle=handle) is used to pass the handle.

Expected behaviour
In Ray 1.13.0 the same code works, but we call tune.run() instead of creating a Tuner object and then call tune.fit(). For clarity it is observed that the tune.run() approach also fails in Ray 2.0.0. The expectation is that Ray 2.0.0 would show the same behaviour as Ray 1.13.0.

Logs

== Status ==
Current time: 2022-10-21 11:19:46 (running for 00:04:17.23)
Memory usage on this node: 9.0/45.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 11.0/12 CPUs, 0/0 GPUs, 0.0/13.47 GiB heap, 0.0/6.73 GiB objects
Result logdir: /home/ejohara/ray_results/simulation_2022-10-21_11-15-28
Number of trials: 39/50 (16 PENDING, 11 RUNNING, 12 TERMINATED)
+------------------------+------------+----------------------+--------+
| Trial name             | status     | loc                  |   seed |
|------------------------+------------+----------------------+--------|
| simulation_ab491_00011 | RUNNING    | 172.31.205.43:529570 |     11 |
| simulation_ab491_00012 | RUNNING    | 172.31.205.43:529672 |     12 |
| simulation_ab491_00013 | RUNNING    | 172.31.205.43:529677 |     13 |
| simulation_ab491_00014 | RUNNING    | 172.31.205.43:529686 |     14 |
| simulation_ab491_00015 | RUNNING    | 172.31.205.43:529683 |     15 |
| simulation_ab491_00016 | RUNNING    | 172.31.205.43:529674 |     16 |
| simulation_ab491_00017 | RUNNING    | 172.31.205.43:529700 |     17 |
| simulation_ab491_00023 | PENDING    |                      |     23 |
| simulation_ab491_00024 | PENDING    |                      |     24 |
| simulation_ab491_00025 | PENDING    |                      |     25 |
| simulation_ab491_00026 | PENDING    |                      |     26 |
| simulation_ab491_00027 | PENDING    |                      |     27 |
| simulation_ab491_00028 | PENDING    |                      |     28 |
| simulation_ab491_00029 | PENDING    |                      |     29 |
| simulation_ab491_00000 | TERMINATED | 172.31.205.43:529570 |      0 |
| simulation_ab491_00001 | TERMINATED | 172.31.205.43:529672 |      1 |
| simulation_ab491_00002 | TERMINATED | 172.31.205.43:529674 |      2 |
| simulation_ab491_00003 | TERMINATED | 172.31.205.43:529677 |      3 |
| simulation_ab491_00004 | TERMINATED | 172.31.205.43:529680 |      4 |
| simulation_ab491_00005 | TERMINATED | 172.31.205.43:529683 |      5 |
| simulation_ab491_00006 | TERMINATED | 172.31.205.43:529686 |      6 |
+------------------------+------------+----------------------+--------+
... 19 more trials not shown (4 RUNNING, 9 PENDING, 5 TERMINATED)

Versions / Dependencies

Python=3.8.10
Ray=2.0.0
Ray installed in virtutal environment according to instructions: https://docs.ray.io/en/latest/ray-overview/installation.html

Reproduction script

import ray
from ray import serve, tune
from ray.tune import Tuner

def simulation(config, handle=None):
  pass

@serve.deployment
class Model:
  def __init__(self):
    pass

  async def __call__(self, request):
    pass

def main():
  ray.init()
  handle = serve.run(Model.bind())
  tuner = Tuner(tune.with_parameters(simulation, handle=handle), 
                param_space = {'seed': tune.grid_search(list(range(50)))})
  tuner.fit()

if __name__ == "__main__":
  main()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

matthewdeng · 2022-10-22T01:51:20Z

I was able to reproduce this.

I looked at one of the hanging Trial processes (using py-spy dump --pid <pid>) and saw that it was hanging during deserialization:

Thread 0x700011ABA000 (idle): "Thread-5"
    get_objects (ray/_private/worker.py:669)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    __init__ (ray/serve/handle.py:141)
    _deserialize (ray/serve/handle.py:251)
    _deserialize_pickle5_data (ray/_private/serialization.py:186)
    _deserialize_msgpack_data (ray/_private/serialization.py:196)
    _deserialize_object (ray/_private/serialization.py:241)
    deserialize_objects (ray/_private/serialization.py:352)
    deserialize_objects (ray/_private/worker.py:641)
    get_objects (ray/_private/worker.py:683)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    get (ray/tune/registry.py:234)
    inner (ray/tune/trainable/util.py:358)
    _inner (ray/tune/trainable/util.py:368)
    _trainable_func (ray/tune/trainable/function_trainable.py:684)
    _resume_span (ray/util/tracing/tracing_helper.py:466)
    entrypoint (ray/tune/trainable/function_trainable.py:362)
    run (ray/tune/trainable/function_trainable.py:289)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

I then simplified the script to only try to put/get the Serve handle, and was able to reproduce the hang:

import ray
from ray import serve

@serve.deployment
class Model:
  def __init__(self):
    pass

  async def __call__(self, request):
    pass

def main():
  ray.init()
  handle = serve.run(Model.bind())
  ref = ray.put(handle)
  ray.get(ref)
  print("Never gets here.")

if __name__ == "__main__":
  main()

Thread 0x112B77600 (idle): "MainThread"
    get_objects (ray/_private/worker.py:669)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    __init__ (ray/serve/handle.py:141)
    _deserialize (ray/serve/handle.py:251)
    _deserialize_pickle5_data (ray/_private/serialization.py:186)
    _deserialize_msgpack_data (ray/_private/serialization.py:196)
    _deserialize_object (ray/_private/serialization.py:241)
    deserialize_objects (ray/_private/serialization.py:352)
    deserialize_objects (ray/_private/worker.py:641)
    get_objects (ray/_private/worker.py:683)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    main (serve_handle.py:16)
    <module> (serve_handle.py:20)

I was able to run the same script with Ray 1.13 successfully.

I'm not sure why the first few trials were successfully run, but based on this I wouldn't expect any of the trials to run.

@sihanwang41 are Serve handles expected to be (de)serializable?

sihanwang41 · 2022-10-26T16:50:45Z

Current in ray serve, we have https://github.com/ray-project/ray/blob/master/python/ray/serve/tests/test_handle.py#L50 to make sure the handle is all serializable.

I can reproduce the issue in local with @matthewdeng script, but a little bit out of clue, I want to borrow some help from core side.

The script will be stuck at this step https://github.com/ray-project/ray/blob/master/python/ray/serve/handle.py#L140. I double checked the controller received the request and return the response, but ray.get is stuck during the handle initialization, and if the line140 is commented out, the script can succeed to run.

scv119 · 2022-10-26T18:41:40Z

At high level i think the problem is we are calling ray.get in another (ray) object's deserialization (ray.get). This is likely an undefined behavior and we should avoid using it. cc @rkooo567 @stephanie-wang in case this is actually well supported.

rkooo567 · 2022-10-27T07:43:01Z

I tried ray.get an actor handle, and it seems to work (though I am not sure if it is supposed to be supported. Otherwise we should raise an exception when this happens. cc @stephanie-wang ). So I assume this is serve specific. Since it used to work at 1.13, why don't you guys try bisecting it between 1.13 and 2.0?

jharaldson · 2022-10-27T11:36:23Z

What are alternative approaches to using a handle (not considering http requests) to call a Ray Serve deployment from remote tasks / Ray actors? The simulation code in the remote task / Ray actor will need to get the result of the Serve prediction for further processing.

scv119 · 2022-10-27T14:42:54Z

@sihanwang41 we need to move https://github.com/ray-project/ray/blame/master/python/ray/serve/handle.py#L140 out of deserialization code path.

sihanwang41 · 2022-10-27T19:57:21Z

Hi @jharaldson , can try to use https://discuss.ray.io/t/deploy-delete-and-use-deployments-in-ray-serve-2-0-0/7557/6?u=sihan_wang? You can directly grab a handle based on the deployment name. Let me know if it works for you.

Hi @scv119 and @rkooo567, thank you for taking a look! The serve handle is passed around between different actor as an argument which is working well, it is only not working in using ray.put and ray.get for this case. If this is not supported, I will try to make changes to workaround it.

jharaldson · 2022-10-28T16:49:20Z

Thanks @sihanwang41, your proposed solution works.

architkulkarni · 2022-10-28T20:08:30Z

@sihanwang41 do you mind assigning a priority for this (or closing it if it's already addressed)?

rkooo567 · 2022-11-02T01:10:08Z

The root cause was

    def _poll_next(self):
        """Poll the update. The callback is expected to scheduler another
        _poll_next call.
        """
        self._current_ref = self.host_actor.listen_for_change.remote(self.snapshot_ids)
        self._current_ref._on_completed(lambda update: self._process_update(update))

on_completed seems to not work in this scenario

rkooo567 · 2022-11-02T01:28:29Z

@sihanwang41 will have the further investigation. In the OSS side, I think the issue is _on_completed has a bug when it is deserialized. We may need to take a look at it as a follow up

jdonzallaz · 2023-01-21T12:15:12Z

Hi, it seems we have a similar issue. Not related to tune, but with core and serve components, when passing serve-handle in tasks. The first time a serve-handle is passed, it works correctly. But it gets stuck the second time the handle is passed. I made a basic reproduction script to isolate the problem:

Reproduction script

import time
import ray
from ray import serve

@ray.remote(num_cpus=0.1)
class A:
    def __init__(self):
        self.b = B.remote()
        deployment = ModelDeployment.options(name="model1")
        deployment.deploy()
        self.handle = deployment.get_handle()

    def f(self):
        ray.get(self.b.g.remote(self.handle))
        print("A.f()")

@ray.remote(num_cpus=0.1)
class B:
    def g(self, handle):
        print(f"B.g()")

@serve.deployment(route_prefix=None, ray_actor_options={"num_cpus": 0.1})
class ModelDeployment:
    def __call__(self, inputs):
        return "pred"

if __name__ == "__main__":
    ray.init()
    serve.start(detached=True)

    a = A.remote()
    ray.get(a.f.remote())
    ray.get(a.f.remote())

    while True:
        time.sleep(1)

It only shows (A pid=18780) A.f() (B pid=32756) B.g() once, but it should appear two times. When killing the script, the stacktrace shows that the script is stuck in a ray.get call.
If I remove the handle parameter from the B.g() method, it works correctly,
It also works if we use deployment.get_handle() in B instead of passing in the handle.

jharaldson added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 21, 2022

hora-anyscale added tune Tune-related issues serve Ray Serve Related Issue air labels Oct 21, 2022

hora-anyscale assigned matthewdeng Oct 21, 2022

matthewdeng assigned sihanwang41 Oct 22, 2022

scv119 assigned scv119 and rkooo567 and unassigned matthewdeng and sihanwang41 Oct 26, 2022

scv119 assigned sihanwang41 and unassigned scv119 and rkooo567 Oct 27, 2022

hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 28, 2022

sihanwang41 added the P2 Important issue, but not time-critical label Oct 28, 2022

rkooo567 added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed P2 Important issue, but not time-critical tune Tune-related issues serve Ray Serve Related Issue labels Nov 1, 2022

rkooo567 added serve Ray Serve Related Issue air and removed air labels Nov 1, 2022

rkooo567 added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks Ray 2.2 labels Nov 2, 2022

anyscalesam added tune Tune-related issues and removed air labels Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

jharaldson commented Oct 21, 2022

matthewdeng commented Oct 22, 2022 •

edited

sihanwang41 commented Oct 26, 2022

scv119 commented Oct 26, 2022 •

edited

rkooo567 commented Oct 27, 2022 •

edited

jharaldson commented Oct 27, 2022

scv119 commented Oct 27, 2022

sihanwang41 commented Oct 27, 2022

jharaldson commented Oct 28, 2022 •

edited

architkulkarni commented Oct 28, 2022

rkooo567 commented Nov 2, 2022

rkooo567 commented Nov 2, 2022

jdonzallaz commented Jan 21, 2023

[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

Comments

jharaldson commented Oct 21, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

matthewdeng commented Oct 22, 2022 • edited

sihanwang41 commented Oct 26, 2022

scv119 commented Oct 26, 2022 • edited

rkooo567 commented Oct 27, 2022 • edited

jharaldson commented Oct 27, 2022

scv119 commented Oct 27, 2022

sihanwang41 commented Oct 27, 2022

jharaldson commented Oct 28, 2022 • edited

architkulkarni commented Oct 28, 2022

rkooo567 commented Nov 2, 2022

rkooo567 commented Nov 2, 2022

jdonzallaz commented Jan 21, 2023

matthewdeng commented Oct 22, 2022 •

edited

scv119 commented Oct 26, 2022 •

edited

rkooo567 commented Oct 27, 2022 •

edited

jharaldson commented Oct 28, 2022 •

edited