Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Passing a handle to grid search cause trials to get stuck in running and pending mode #29545

Open
jharaldson opened this issue Oct 21, 2022 · 12 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical serve Ray Serve Related Issue tune Tune-related issues

Comments

@jharaldson
Copy link

What happened + What you expected to happen

The bug
Tune trials get stuck in RUNNING and PENDING mode after initially running and TERMINATING a number of trials (the number seems to be related with the number of CPUs available).

Context
We use Ray Tune to trigger parallell execution of simulations that take in a Ray Serve handle to allow ML model predictions inside the simulations. A ray.get(handle.remote()) call would used inside the simulator code to run predictions. tune.with_parameters(simulation, handle=handle) is used to pass the handle.

Expected behaviour
In Ray 1.13.0 the same code works, but we call tune.run() instead of creating a Tuner object and then call tune.fit(). For clarity it is observed that the tune.run() approach also fails in Ray 2.0.0. The expectation is that Ray 2.0.0 would show the same behaviour as Ray 1.13.0.

Logs

== Status ==
Current time: 2022-10-21 11:19:46 (running for 00:04:17.23)
Memory usage on this node: 9.0/45.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 11.0/12 CPUs, 0/0 GPUs, 0.0/13.47 GiB heap, 0.0/6.73 GiB objects
Result logdir: /home/ejohara/ray_results/simulation_2022-10-21_11-15-28
Number of trials: 39/50 (16 PENDING, 11 RUNNING, 12 TERMINATED)
+------------------------+------------+----------------------+--------+
| Trial name             | status     | loc                  |   seed |
|------------------------+------------+----------------------+--------|
| simulation_ab491_00011 | RUNNING    | 172.31.205.43:529570 |     11 |
| simulation_ab491_00012 | RUNNING    | 172.31.205.43:529672 |     12 |
| simulation_ab491_00013 | RUNNING    | 172.31.205.43:529677 |     13 |
| simulation_ab491_00014 | RUNNING    | 172.31.205.43:529686 |     14 |
| simulation_ab491_00015 | RUNNING    | 172.31.205.43:529683 |     15 |
| simulation_ab491_00016 | RUNNING    | 172.31.205.43:529674 |     16 |
| simulation_ab491_00017 | RUNNING    | 172.31.205.43:529700 |     17 |
| simulation_ab491_00023 | PENDING    |                      |     23 |
| simulation_ab491_00024 | PENDING    |                      |     24 |
| simulation_ab491_00025 | PENDING    |                      |     25 |
| simulation_ab491_00026 | PENDING    |                      |     26 |
| simulation_ab491_00027 | PENDING    |                      |     27 |
| simulation_ab491_00028 | PENDING    |                      |     28 |
| simulation_ab491_00029 | PENDING    |                      |     29 |
| simulation_ab491_00000 | TERMINATED | 172.31.205.43:529570 |      0 |
| simulation_ab491_00001 | TERMINATED | 172.31.205.43:529672 |      1 |
| simulation_ab491_00002 | TERMINATED | 172.31.205.43:529674 |      2 |
| simulation_ab491_00003 | TERMINATED | 172.31.205.43:529677 |      3 |
| simulation_ab491_00004 | TERMINATED | 172.31.205.43:529680 |      4 |
| simulation_ab491_00005 | TERMINATED | 172.31.205.43:529683 |      5 |
| simulation_ab491_00006 | TERMINATED | 172.31.205.43:529686 |      6 |
+------------------------+------------+----------------------+--------+
... 19 more trials not shown (4 RUNNING, 9 PENDING, 5 TERMINATED)

Versions / Dependencies

Python=3.8.10
Ray=2.0.0
Ray installed in virtutal environment according to instructions: https://docs.ray.io/en/latest/ray-overview/installation.html

Reproduction script

import ray
from ray import serve, tune
from ray.tune import Tuner

def simulation(config, handle=None):
  pass

@serve.deployment
class Model:
  def __init__(self):
    pass

  async def __call__(self, request):
    pass

def main():
  ray.init()
  handle = serve.run(Model.bind())
  tuner = Tuner(tune.with_parameters(simulation, handle=handle), 
                param_space = {'seed': tune.grid_search(list(range(50)))})
  tuner.fit()

if __name__ == "__main__":
  main()

Issue Severity

High: It blocks me from completing my task.

@jharaldson jharaldson added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 21, 2022
@hora-anyscale hora-anyscale added tune Tune-related issues serve Ray Serve Related Issue air labels Oct 21, 2022
@matthewdeng
Copy link
Contributor

matthewdeng commented Oct 22, 2022

I was able to reproduce this.

I looked at one of the hanging Trial processes (using py-spy dump --pid <pid>) and saw that it was hanging during deserialization:

Thread 0x700011ABA000 (idle): "Thread-5"
    get_objects (ray/_private/worker.py:669)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    __init__ (ray/serve/handle.py:141)
    _deserialize (ray/serve/handle.py:251)
    _deserialize_pickle5_data (ray/_private/serialization.py:186)
    _deserialize_msgpack_data (ray/_private/serialization.py:196)
    _deserialize_object (ray/_private/serialization.py:241)
    deserialize_objects (ray/_private/serialization.py:352)
    deserialize_objects (ray/_private/worker.py:641)
    get_objects (ray/_private/worker.py:683)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    get (ray/tune/registry.py:234)
    inner (ray/tune/trainable/util.py:358)
    _inner (ray/tune/trainable/util.py:368)
    _trainable_func (ray/tune/trainable/function_trainable.py:684)
    _resume_span (ray/util/tracing/tracing_helper.py:466)
    entrypoint (ray/tune/trainable/function_trainable.py:362)
    run (ray/tune/trainable/function_trainable.py:289)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

I then simplified the script to only try to put/get the Serve handle, and was able to reproduce the hang:

import ray
from ray import serve

@serve.deployment
class Model:
  def __init__(self):
    pass

  async def __call__(self, request):
    pass

def main():
  ray.init()
  handle = serve.run(Model.bind())
  ref = ray.put(handle)
  ray.get(ref)
  print("Never gets here.")

if __name__ == "__main__":
  main()
Thread 0x112B77600 (idle): "MainThread"
    get_objects (ray/_private/worker.py:669)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    __init__ (ray/serve/handle.py:141)
    _deserialize (ray/serve/handle.py:251)
    _deserialize_pickle5_data (ray/_private/serialization.py:186)
    _deserialize_msgpack_data (ray/_private/serialization.py:196)
    _deserialize_object (ray/_private/serialization.py:241)
    deserialize_objects (ray/_private/serialization.py:352)
    deserialize_objects (ray/_private/worker.py:641)
    get_objects (ray/_private/worker.py:683)
    get (ray/_private/worker.py:2274)
    wrapper (ray/_private/client_mode_hook.py:105)
    main (serve_handle.py:16)
    <module> (serve_handle.py:20)

I was able to run the same script with Ray 1.13 successfully.

I'm not sure why the first few trials were successfully run, but based on this I wouldn't expect any of the trials to run.

@sihanwang41 are Serve handles expected to be (de)serializable?

@sihanwang41
Copy link
Contributor

Current in ray serve, we have https://github.com/ray-project/ray/blob/master/python/ray/serve/tests/test_handle.py#L50 to make sure the handle is all serializable.

I can reproduce the issue in local with @matthewdeng script, but a little bit out of clue, I want to borrow some help from core side.

The script will be stuck at this step https://github.com/ray-project/ray/blob/master/python/ray/serve/handle.py#L140. I double checked the controller received the request and return the response, but ray.get is stuck during the handle initialization, and if the line140 is commented out, the script can succeed to run.

@scv119
Copy link
Contributor

scv119 commented Oct 26, 2022

At high level i think the problem is we are calling ray.get in another (ray) object's deserialization (ray.get). This is likely an undefined behavior and we should avoid using it. cc @rkooo567 @stephanie-wang in case this is actually well supported.

@scv119 scv119 assigned scv119 and rkooo567 and unassigned matthewdeng and sihanwang41 Oct 26, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 27, 2022

I tried ray.get an actor handle, and it seems to work (though I am not sure if it is supposed to be supported. Otherwise we should raise an exception when this happens. cc @stephanie-wang ). So I assume this is serve specific. Since it used to work at 1.13, why don't you guys try bisecting it between 1.13 and 2.0?

@jharaldson
Copy link
Author

What are alternative approaches to using a handle (not considering http requests) to call a Ray Serve deployment from remote tasks / Ray actors? The simulation code in the remote task / Ray actor will need to get the result of the Serve prediction for further processing.

@scv119
Copy link
Contributor

scv119 commented Oct 27, 2022

@sihanwang41 we need to move https://github.com/ray-project/ray/blame/master/python/ray/serve/handle.py#L140 out of deserialization code path.

@scv119 scv119 assigned sihanwang41 and unassigned scv119 and rkooo567 Oct 27, 2022
@sihanwang41
Copy link
Contributor

Hi @jharaldson , can try to use https://discuss.ray.io/t/deploy-delete-and-use-deployments-in-ray-serve-2-0-0/7557/6?u=sihan_wang? You can directly grab a handle based on the deployment name. Let me know if it works for you.

Hi @scv119 and @rkooo567, thank you for taking a look! The serve handle is passed around between different actor as an argument which is working well, it is only not working in using ray.put and ray.get for this case. If this is not supported, I will try to make changes to workaround it.

@jharaldson
Copy link
Author

jharaldson commented Oct 28, 2022

Thanks @sihanwang41, your proposed solution works.

@hora-anyscale hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 28, 2022
@architkulkarni
Copy link
Contributor

@sihanwang41 do you mind assigning a priority for this (or closing it if it's already addressed)?

@sihanwang41 sihanwang41 added the P2 Important issue, but not time-critical label Oct 28, 2022
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed P2 Important issue, but not time-critical tune Tune-related issues serve Ray Serve Related Issue labels Nov 1, 2022
@rkooo567 rkooo567 added serve Ray Serve Related Issue air and removed air labels Nov 1, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 2, 2022

The root cause was

    def _poll_next(self):
        """Poll the update. The callback is expected to scheduler another
        _poll_next call.
        """
        self._current_ref = self.host_actor.listen_for_change.remote(self.snapshot_ids)
        self._current_ref._on_completed(lambda update: self._process_update(update))

on_completed seems to not work in this scenario

@rkooo567 rkooo567 added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks Ray 2.2 labels Nov 2, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 2, 2022

@sihanwang41 will have the further investigation. In the OSS side, I think the issue is _on_completed has a bug when it is deserialized. We may need to take a look at it as a follow up

@jdonzallaz
Copy link

Hi, it seems we have a similar issue. Not related to tune, but with core and serve components, when passing serve-handle in tasks. The first time a serve-handle is passed, it works correctly. But it gets stuck the second time the handle is passed. I made a basic reproduction script to isolate the problem:

Reproduction script
import time
import ray
from ray import serve

@ray.remote(num_cpus=0.1)
class A:
    def __init__(self):
        self.b = B.remote()
        deployment = ModelDeployment.options(name="model1")
        deployment.deploy()
        self.handle = deployment.get_handle()

    def f(self):
        ray.get(self.b.g.remote(self.handle))
        print("A.f()")

@ray.remote(num_cpus=0.1)
class B:
    def g(self, handle):
        print(f"B.g()")

@serve.deployment(route_prefix=None, ray_actor_options={"num_cpus": 0.1})
class ModelDeployment:
    def __call__(self, inputs):
        return "pred"

if __name__ == "__main__":
    ray.init()
    serve.start(detached=True)

    a = A.remote()
    ray.get(a.f.remote())
    ray.get(a.f.remote())

    while True:
        time.sleep(1)

It only shows (A pid=18780) A.f() (B pid=32756) B.g() once, but it should appear two times. When killing the script, the stacktrace shows that the script is stuck in a ray.get call.
If I remove the handle parameter from the B.g() method, it works correctly,
It also works if we use deployment.get_handle() in B instead of passing in the handle.

@anyscalesam anyscalesam added tune Tune-related issues and removed air labels Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical serve Ray Serve Related Issue tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

9 participants