Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] race condition between job exits and actor creation. #24890

Open
scv119 opened this issue May 17, 2022 · 3 comments
Open

[Core] race condition between job exits and actor creation. #24890

scv119 opened this issue May 17, 2022 · 3 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-worker P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared size:medium

Comments

@scv119
Copy link
Contributor

scv119 commented May 17, 2022

What happened + What you expected to happen

Original discussion:
https://discuss.ray.io/t/detached-actor-correct-definition-and-declaration-cant-reproduce-consistently/6166

I’m trying to create a detached actor so that I can use it in another driver script. This is still local testing. This code works inconsistently(it fails sometimes) and I don’t know how to reproduce a consistent success or failure

I expected the following code to work?

# tf1/main.py
import tensorflow as tf
import time
DEPLOY_TIME = time.time()
class Predictor:
    def __init__(self):
        pass

    def work(self):
        return tf.__version__ + f"|{DEPLOY_TIME}"
        pass



# ray_url = "ray://localhost:10002"

if __name__ == "__main__":
    print("Deploy Time:" + str(DEPLOY_TIME))

    import ray
    with ray.init(namespace='indexing'):
        try:
            old = ray.get_actor("tf1")
            print("Killing TF1")
            ray.kill(old)
        except ValueError:
            print("Not Killing TF1 as it's not present")


        PredictorActor = ray.remote(Predictor)
        PredictorActor.options(name="tf1", lifetime="detached").remote()

If I add the below three lines at the end, it works consistently.

        a = ray.get_actor("tf1")
        print("Named Actor Call")
        print(ray.get(a.work.remote()))

I’m calling the above code in another driver script

# indexing/main.py
import ray

ray.init(namespace="indexing")
print("Ray Namespace")
print(ray.get_runtime_context().namespace)

print("In Pipeline Indexing Both")
a = ray.get_actor("tf1")
print(ray.get(a.work.remote()))

a = ray.get_actor("tf2")
print(ray.get(a.work.remote()))

My run script

indexing/run.sh

cd /home/rajiv/Documents/dev/bht/wdml/steps/tf1 &&
source ./venv/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["tensorflow==1.15"], "excludes": ["venv"]}' -- python main.py &&
cd /home/rajiv/Documents/dev/bht/wdml/pipelines/indexing &&
source /home/rajiv/venvs/indexing/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": []}' -- python main.py

The error I get is

Traceback (most recent call last):
File "main.py", line 10, in
a = ray.get_actor("tf1")
File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/worker.py", line 2031, in get_actor
return worker.core_worker.get_named_actor_handle(name, namespace or "")
File "python/ray/_raylet.pyx", line 1875, in ray._raylet.CoreWorker.get_named_actor_handle
File "python/ray/_raylet.pyx", line 171, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'tf1'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

Details:

Ray 1.12 Default
All code is submitted via the ray job api

Versions / Dependencies

1.12

Reproduction script

N/A

Issue Severity

No response

@scv119 scv119 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 17, 2022
@scv119
Copy link
Contributor Author

scv119 commented May 17, 2022

I think this is probably a race condition where your first script may be exiting before the actor is successfully created. This is because the .remote() call is async. When you call ray.get() on an actor method, that forces the script to block until the actor is created successfully.

The workaround is to always ray.get() to ensure the actor is up prior to exiting the launch script.

I think we should wait a little bit for actors to get registered successfully prior to exiting the job.

@jovany-wang
Copy link
Contributor

It seems that solution should be similar with #13902

@fishbone fishbone removed their assignment Dec 7, 2022
@rkooo567 rkooo567 added the api-bug Bug in which APIs behavior is wrong label Mar 24, 2023
@rkooo567 rkooo567 added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed P1 Issue that should be fixed within a few weeks api-bug Bug in which APIs behavior is wrong Ray 2.5 labels Apr 8, 2023
@stale
Copy link

stale bot commented Aug 10, 2023

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-worker P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared size:medium
Projects
None yet
Development

No branches or pull requests

5 participants