Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Jobs submitted with the same ID in quick succession will both fail, with unfriendly error #31356

Closed
architkulkarni opened this issue Dec 29, 2022 · 1 comment · Fixed by #33259
Assignees
Labels
bug Something that is supposed to be working; but isn't core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks

Comments

@architkulkarni
Copy link
Contributor

What happened + What you expected to happen

If jobs are submitted with the same ID, the second one will fail with an internal unfriendly error (though it hints at the root cause). Even if the first job would have succeeded, its status is overwritten with the failed status from the second job.

❯ ray job submit --submission-id blah3 --no-wait -- echo hi again & ray job submit --submission-id blah3 --no-wait -- echo hi again

❯     ray job status blah3
Job submission server address: http://127.0.0.1:8265

------------------
Job 'blah3' failed
------------------

Status message: Failed to start Job Supervisor actor: The name _ray_internal_job_actor_blah3 (namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE) is already taken. Please use a different name or get the existing actor using ray.get_actor('_ray_internal_job_actor_blah3', namespace='SUPERVISOR_ACTOR_RAY_NAMESPACE').

I would expect that (1) the first command should succeed and its status should reflect that, and (2) the second should fail with RuntimeError: Job blah3 already exists.. This currently happens if the first command is given a second or so to run and update its internal JobInfo, but this should still happen even if the commands are issued right after one another.

Versions / Dependencies

master, MacOS, Python 3.8

Reproduction script

Above

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@architkulkarni architkulkarni added bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Dec 29, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Jan 6, 2023

why don't we prioritize and fix it? Seems pretty bad (impact correctness) and looks like a simple fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants