Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job] Ray Job fail to start because of Job supervisor actor creation race condition. #34172

Closed
sihanwang41 opened this issue Apr 7, 2023 · 8 comments · Fixed by #34223
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't job P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@sihanwang41
Copy link
Contributor

sihanwang41 commented Apr 7, 2023

What happened + What you expected to happen

Job failed with

2023-04-05 12:51:02,406	ERROR worker.py:409 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::JobSupervisor.run()[39m (pid=7670, ip=10.138.0.93, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7f44e04f8d90>)
  File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/job/job_manager.py", line 363, in run
    assert curr_status == JobStatus.PENDING, "Run should only be called once."
AssertionError: Run should only be called once.

env: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGN[…]n7s7ta93e95d?command-history-section=head_start_up_log

Root cause:
#34190 with this pr, and run ray job submit — ls

Inside _recover_running_jobs, we have await self._job_info_client.get_all_jobs(), and it will release the lock for submit_job kicked in, and then we put the new job info into internal kv, and then we have other await call before supervisor actor is created, so that recover tasks resumes, and then use the new job id to monitor, and then it will put the job into failed,
https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L564
and then we see https://github.com/ray-project/ray/blob/master/dashboard/modules/job/job_manager.py#L363

Versions / Dependencies

N/A

Reproduction script

N/A

Issue Severity

None

@sihanwang41 sihanwang41 added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023
@sihanwang41 sihanwang41 self-assigned this Apr 7, 2023
@sihanwang41
Copy link
Contributor Author

cc: @architkulkarni @akshay-anyscale

@sihanwang41 sihanwang41 removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Apr 7, 2023
@akshay-anyscale akshay-anyscale added job and removed job labels Apr 7, 2023
@sihanwang41 sihanwang41 added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Apr 7, 2023
@architkulkarni architkulkarni self-assigned this Apr 10, 2023
@can-anyscale
Copy link
Collaborator

@architkulkarni I see this happens to a lot of release tests, probably a recent error too. Appreciate if we can prioritize this and potentially get in 2.4 release.

@architkulkarni architkulkarni added the release-blocker P0 Issue that blocks the release label Apr 10, 2023
@sihanwang41
Copy link
Contributor Author

@can-anyscale , does it happen on the gce or aws? if aws, it makes me concerning why it happens recently, the code is old.

@architkulkarni
Copy link
Contributor

architkulkarni commented Apr 10, 2023

This is the only recent change that could be relevant: #33259

I don't see yet how it could cause the new error to happen more frequently. (We're seeing the new error in cases where there's only a single job, and in this case that PR doesn't affect anything)

@zhe-thoughts
Copy link
Collaborator

@sihanwang41 @akshay-anyscale could you help provide an ETA? Thanks

@can-anyscale
Copy link
Collaborator

@sihanwang41: the examples I gave are from GCE because I'm testing GCE specifically, not sure if it happens to AWS as well, let me look around

@architkulkarni
Copy link
Contributor

@sihanwang41 @akshay-anyscale could you help provide an ETA? Thanks

@zhe-thoughts The ETA for the fix to be merged is by the end of the day tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't job P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
5 participants