[Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete
and a JobStatus SUCCEEDED
#1919
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
When
ttlSecondsAfterFinished
is set to 0, the submitter job may fail and retry, even though the RayJob's has JobDeploymentStatus isComplete
and JobStatus isSUCCEEDED
. Without this PR, if the Ray job is in a terminal state (e.g.SUCCEEDED
), the RayJob will transition fromRunning
toComplete
, and checkttlSecondsAfterFinished
to determine whether is it ready to delete RayCluster or not. However, the submitter Kubernetes Job may fail because it still requires some time to receive the logs from the Ray head, but the head service has already been deleted.In this PR, RayJob transitions from
Running
toComplete
if and only if the submitter Job isComplete
orFailed
.Related issue number
Checks
Without this PR
Complete
(JobDeploymentStatus) andSUCCEEDED
(JobStatus), one submitter Kubernetes Job fails among the 25 RayJob custom resources.With this PR