-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762
Conversation
@@ -135,28 +158,10 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
// include STOPPED which is also a terminal status because `suspend` requires to stop the Ray job gracefully before | |||
// delete the RayCluster. | |||
if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) { | |||
// If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, we only delete RayCluster (TTL) if the JobDeploymentStatus
is JobDeploymentStatusComplete
. Hence, we don't need to check whether it is deleted or not if the status if not JobDeploymentStatusComplete
.
@@ -268,39 +272,7 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
} | |||
} | |||
|
|||
// Let's use rayJobInstance.Status.JobStatus to make sure we only delete cluster after the CR is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the logic to L125 with this PR.
// Otherwise only reconcile the RayJob upon new events for watched resources | ||
// to avoid infinite reconciliation. | ||
return ctrl.Result{}, nil | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case that we don't requeue the CR is when it is in JobDeploymentStatusComplete
and TTL.
@@ -388,6 +360,12 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance * | |||
}, | |||
} | |||
|
|||
// Without TTLSecondsAfterFinished, the job has a default deletion policy of `orphanDependents` causing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I want to give a shout-out to @astefanutti. This is my first time running and updating RayJob e2e tests by myself. The process is quite user-friendly, and writing new tests is pretty straightforward! |
@kevin85421 Thanks a lot for the feedback, really appreciated! That's also a great encouragement for us to contribute more to the e2e tests 😃! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Why are these changes needed?
Without this PR, the transition between
Running
andComplete
is pretty confusing. IfShutdownAfterJobFinishes
is false, the RayJob'sJobDeploymentStatus
will beJobDeploymentStatusRunning
although the Ray job isSucceeded
orFailed
. BothSucceeded
andFailed
are terminal states, so the job can't have any state transition. However, KubeRay still has to process a lot of unnecessary code, including sending a request to the RayCluster to check the status of the Ray job.This PR redefines
JobDeploymentStatusComplete
.JobDeploymentStatusComplete
means that the JobStatus isSucceeded
orFailed
, and we only check TTL when the RayJob isJobDeploymentStatusComplete
.Currently, the lifecycles of the submitter K8s Job and its Pod are not well-defined. For example, the K8s Job will not be deleted any matter TTL and
suspend
. In addition, the default deletion behavior for a Kubernetes Job isorphanDependents
, so the Pod will not be deleted cascadingly after the Kubernetes Job is deleted. In this PR, we set TTL for the K8s Job the same asRayJob.Spec.TTLSecondsAfterFinished
. I will chat with users to finalize the behavior in the follow up PRsRelated issue number
Closes #1233
Checks