-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job] Ray Job fail to start because of Job supervisor actor creation race condition. #34172
Comments
@architkulkarni I see this happens to a lot of release tests, probably a recent error too. Appreciate if we can prioritize this and potentially get in 2.4 release. |
@can-anyscale , does it happen on the gce or aws? if aws, it makes me concerning why it happens recently, the code is old. |
This is the only recent change that could be relevant: #33259 I don't see yet how it could cause the new error to happen more frequently. (We're seeing the new error in cases where there's only a single job, and in this case that PR doesn't affect anything) |
@sihanwang41 @akshay-anyscale could you help provide an ETA? Thanks |
@sihanwang41: the examples I gave are from GCE because I'm testing GCE specifically, not sure if it happens to AWS as well, let me look around |
@zhe-thoughts The ETA for the fix to be merged is by the end of the day tomorrow. |
Here are two other error instances I can find in aws: |
What happened + What you expected to happen
Job failed with
env: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGN[…]n7s7ta93e95d?command-history-section=head_start_up_log
Root cause:
#34190 with this pr, and run
ray job submit — ls
Inside
_recover_running_jobs
, we haveawait self._job_info_client.get_all_jobs()
, and it will release the lock forsubmit_job
kicked in, and then we put the new job info into internal kv, and then we have other await call before supervisor actor is created, so that recover tasks resumes, and then use the new job id to monitor, and then it will put the job into failed,https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L564
and then we see https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L363
Versions / Dependencies
N/A
Reproduction script
N/A
Issue Severity
None
The text was updated successfully, but these errors were encountered: