Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] ray.wait not actually wait until ready when the task is longer than 12 days #44909

Open
Michaelvll opened this issue Apr 22, 2024 · 3 comments
Labels
core Issues that should be addressed in Ray Core core-api docs An issue or change related to documentation P3 Issue moderate in impact or severity

Comments

@Michaelvll
Copy link
Contributor

What happened + What you expected to happen

For a task longer than 12 days, ray.wait will return an empty list of ready object refs after 10**6 seconds when timeout is not specified, which is about 11.5 days.

This is inconsistent with what ray.get will do when timeout is not specified.

Versions / Dependencies

ray==2.9.3 (but I suppose it happens for all the ray versions)
python 3.10
OS Ubuntu 20.04

Reproduction script

timeout = timeout if timeout is not None else 10**6
timeout_milliseconds = int(timeout * 1000)

Issue Severity

None

@Michaelvll Michaelvll added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 22, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Apr 23, 2024
@jjyao jjyao added core-api P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 29, 2024
@jjyao
Copy link
Contributor

jjyao commented Apr 29, 2024

Hi @Michaelvll what's the cluster setup. Does the task run on the same node where ray.wait is called?

@Michaelvll
Copy link
Contributor Author

Hi @Michaelvll what's the cluster setup. Does the task run on the same node where ray.wait is called?

Yes, the task is run on the same node as the driver, but I believe this happens for multi-node cases as well, due to the code quoted above. ray.get does not have the issue.

@hongchaodeng hongchaodeng added docs An issue or change related to documentation and removed bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks labels May 1, 2024
@hongchaodeng
Copy link
Member

If it is always set to 10**6 seconds, we probably keep it as is and not break any compatibility.

It makes sense to have some default timeout like that so that api call would not hang forever. Nonetheless, we should change the docs to mention this.

@hongchaodeng hongchaodeng added the P3 Issue moderate in impact or severity label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-api docs An issue or change related to documentation P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

4 participants