-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job] Repro race issue #34190
[Job] Repro race issue #34190
Conversation
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…t for pending jobs (#34223) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with #34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id #33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
…t for pending jobs (ray-project#34223) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with ray-project#34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
…t for pending jobs (#34223) (#34318) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with #34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id #33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
…t for pending jobs (ray-project#34223) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with ray-project#34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider. Signed-off-by: elliottower <elliot@elliottower.com>
…t for pending jobs (ray-project#34223) @rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens: A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created. In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV. The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user. If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed. The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop. This PR adds a unit test that fails with ray-project#34190 (which forces the race condition). This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable. Some questions are still open: Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job. What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider. Signed-off-by: Jack He <jackhe2345@gmail.com>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.