Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL #1762

Merged
merged 7 commits into from
Dec 26, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 21, 2023

Why are these changes needed?

  • Without this PR, the transition between Running and Complete is pretty confusing. If ShutdownAfterJobFinishes is false, the RayJob's JobDeploymentStatus will be JobDeploymentStatusRunning although the Ray job is Succeeded or Failed. Both Succeeded and Failed are terminal states, so the job can't have any state transition. However, KubeRay still has to process a lot of unnecessary code, including sending a request to the RayCluster to check the status of the Ray job.

  • This PR redefines JobDeploymentStatusComplete. JobDeploymentStatusComplete means that the JobStatus is Succeeded or Failed, and we only check TTL when the RayJob is JobDeploymentStatusComplete.

  • Currently, the lifecycles of the submitter K8s Job and its Pod are not well-defined. For example, the K8s Job will not be deleted any matter TTL and suspend. In addition, the default deletion behavior for a Kubernetes Job is orphanDependents, so the Pod will not be deleted cascadingly after the Kubernetes Job is deleted. In this PR, we set TTL for the K8s Job the same as RayJob.Spec.TTLSecondsAfterFinished. I will chat with users to finalize the behavior in the follow up PRs

Related issue number

Closes #1233

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -135,28 +158,10 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
// include STOPPED which is also a terminal status because `suspend` requires to stop the Ray job gracefully before
// delete the RayCluster.
if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) {
// If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we only delete RayCluster (TTL) if the JobDeploymentStatus is JobDeploymentStatusComplete. Hence, we don't need to check whether it is deleted or not if the status if not JobDeploymentStatusComplete.

@@ -268,39 +272,7 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
}
}

// Let's use rayJobInstance.Status.JobStatus to make sure we only delete cluster after the CR is updated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the logic to L125 with this PR.

// Otherwise only reconcile the RayJob upon new events for watched resources
// to avoid infinite reconciliation.
return ctrl.Result{}, nil
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case that we don't requeue the CR is when it is in JobDeploymentStatusComplete and TTL.

@@ -388,6 +360,12 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *
},
}

// Without TTLSecondsAfterFinished, the job has a default deletion policy of `orphanDependents` causing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 kevin85421 changed the title WIP [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL Dec 22, 2023
@kevin85421 kevin85421 marked this pull request as ready for review December 22, 2023 01:10
@kevin85421
Copy link
Member Author

By the way, I want to give a shout-out to @astefanutti. This is my first time running and updating RayJob e2e tests by myself. The process is quite user-friendly, and writing new tests is pretty straightforward!

@astefanutti
Copy link
Contributor

By the way, I want to give a shout-out to @astefanutti. This is my first time running and updating RayJob e2e tests by myself. The process is quite user-friendly, and writing new tests is pretty straightforward!

@kevin85421 Thanks a lot for the feedback, really appreciated! That's also a great encouragement for us to contribute more to the e2e tests 😃!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@kevin85421 kevin85421 merged commit d49a7af into ray-project:master Dec 26, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayJob does not enter Complete state after job application failure
3 participants