[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

kevin85421 · 2023-12-21T00:21:20Z

Why are these changes needed?

Without this PR, the transition between Running and Complete is pretty confusing. If ShutdownAfterJobFinishes is false, the RayJob's JobDeploymentStatus will be JobDeploymentStatusRunning although the Ray job is Succeeded or Failed. Both Succeeded and Failed are terminal states, so the job can't have any state transition. However, KubeRay still has to process a lot of unnecessary code, including sending a request to the RayCluster to check the status of the Ray job.
This PR redefines JobDeploymentStatusComplete. JobDeploymentStatusComplete means that the JobStatus is Succeeded or Failed, and we only check TTL when the RayJob is JobDeploymentStatusComplete.
Currently, the lifecycles of the submitter K8s Job and its Pod are not well-defined. For example, the K8s Job will not be deleted any matter TTL and suspend. In addition, the default deletion behavior for a Kubernetes Job is orphanDependents, so the Pod will not be deleted cascadingly after the Kubernetes Job is deleted. In this PR, we set TTL for the K8s Job the same as RayJob.Spec.TTLSecondsAfterFinished. I will chat with users to finalize the behavior in the follow up PRs

Related issue number

Closes #1233

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2023-12-22T00:35:38Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -135,28 +158,10 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
 	// include STOPPED which is also a terminal status because `suspend` requires to stop the Ray job gracefully before
 	// delete the RayCluster.
 	if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) {
-		// If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation.


With this PR, we only delete RayCluster (TTL) if the JobDeploymentStatus is JobDeploymentStatusComplete. Hence, we don't need to check whether it is deleted or not if the status if not JobDeploymentStatusComplete.

kevin85421 · 2023-12-22T00:37:53Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -268,39 +272,7 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
 		}
 	}

-	// Let's use rayJobInstance.Status.JobStatus to make sure we only delete cluster after the CR is updated.


Move the logic to L125 with this PR.

kevin85421 · 2023-12-22T00:39:33Z

ray-operator/controllers/ray/rayjob_controller.go

-	// Otherwise only reconcile the RayJob upon new events for watched resources
-	// to avoid infinite reconciliation.
-	return ctrl.Result{}, nil
+	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil


The only case that we don't requeue the CR is when it is in JobDeploymentStatusComplete and TTL.

kevin85421 · 2023-12-22T00:40:14Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -388,6 +360,12 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *
 		},
 	}

+	// Without TTLSecondsAfterFinished, the job has a default deletion policy of `orphanDependents` causing


https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs

kevin85421 · 2023-12-22T01:15:09Z

By the way, I want to give a shout-out to @astefanutti. This is my first time running and updating RayJob e2e tests by myself. The process is quite user-friendly, and writing new tests is pretty straightforward!

astefanutti · 2023-12-22T10:39:38Z

By the way, I want to give a shout-out to @astefanutti. This is my first time running and updating RayJob e2e tests by myself. The process is quite user-friendly, and writing new tests is pretty straightforward!

@kevin85421 Thanks a lot for the feedback, really appreciated! That's also a great encouragement for us to contribute more to the e2e tests 😃!

architkulkarni

Looks good to me!

kevin85421 added 7 commits December 21, 2023 00:19

update

e6a7682

update

b20c9f5

update

34aa81d

update

0e11628

update

5909e52

update

d5d97de

update

60a3133

kevin85421 commented Dec 22, 2023

View reviewed changes

kevin85421 changed the title ~~WIP~~ [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL Dec 22, 2023

kevin85421 marked this pull request as ready for review December 22, 2023 01:10

kevin85421 requested a review from architkulkarni December 22, 2023 01:10

kevin85421 assigned architkulkarni Dec 22, 2023

astefanutti approved these changes Dec 22, 2023

View reviewed changes

architkulkarni approved these changes Dec 26, 2023

View reviewed changes

kevin85421 merged commit d49a7af into ray-project:master Dec 26, 2023
25 checks passed

kevin85421 mentioned this pull request Dec 27, 2023

[Bug] Rayjob pods stay in a running state even after the job itself has completed succesfully #1101

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

kevin85421 commented Dec 21, 2023 •

edited

kevin85421 Dec 22, 2023

kevin85421 Dec 22, 2023

kevin85421 Dec 22, 2023

kevin85421 Dec 22, 2023

kevin85421 commented Dec 22, 2023

astefanutti commented Dec 22, 2023

architkulkarni left a comment

[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

Conversation

kevin85421 commented Dec 21, 2023 • edited

Why are these changes needed?

Related issue number

Checks

kevin85421 Dec 22, 2023

Choose a reason for hiding this comment

kevin85421 Dec 22, 2023

Choose a reason for hiding this comment

kevin85421 Dec 22, 2023

Choose a reason for hiding this comment

kevin85421 Dec 22, 2023

Choose a reason for hiding this comment

kevin85421 commented Dec 22, 2023

astefanutti commented Dec 22, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Dec 21, 2023 •

edited