-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true #1881
Conversation
@kevin85421 thoughts? This is an alternative approach to #1832 which I think is much cleaner and practical for most users. |
16a62cb
to
309d93b
Compare
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
I prefer to continue deleting the submitter's Job to free up computing resources. For RayJob, I expect users to submit and then not have to worry about it further. The
|
I agree that there will definitely be many cases where you need the entire cluster to stay around.
This part I'm not sure about, it depends on the Ray job right? My main point is that a Kubernetes Job that is "Completed" does not actually use any compute resources. The node frees up requested resources for the submitter job once the job completes anyways. If the submitter job is not using any resources when completed, why delete it if it could contain useful logs? |
Here's an example using a local Kind cluster with one node:
While the job runs, you can see the requested resources on the node:
With ShutdownAfterJobFinishes=true, the submitter job is left in "Completed" state and the RayCluster is deleted:
And checking the node again, the submitter job is not taking any resources:
|
Good point. I didn't realize that when a job enters 'Completed,' it doesn't use any compute resources. In that case, this PR makes sense to me. |
@@ -289,6 +292,22 @@ var _ = Context("Inside the default namespace", func() { | |||
time.Second*15, time.Millisecond*500).Should(Equal(rayv1.JobDeploymentStatusComplete), "jobDeploymentStatus = %v", myRayJob.Status.JobDeploymentStatus) | |||
Expect(myRayJob.Status.EndTime.After(now)).Should(BeTrue(), "EndTime = %v, Now = %v", myRayJob.Status.EndTime, now) | |||
}) | |||
|
|||
It("job completed with ShutdownAfterJobFinishes=true, RayCluster should be deleted but not the submitter Job", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to split the test logic in the future instead of adding tests in a single Context
block. All tests coupling together make us hard to maintain.
…ray-project#1881) Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Why are these changes needed?
When a RayJob is configured with ShutdownAfterJobFinishes=true, Kuberay will immediately delete the associated RayCluster and submitter job once the job completes. Users can tune how long the cluster and submitter job stays around with TTLSecondsAfterFinished. Note that the RayJob resource itself is never deleted automatically, this always needs external action to be removed.
Even without setting
TTLSecondsAfterFinished
, there is very little reason to delete the submitter job. The submitter job usually contains the most useful logs for the job and is not using any cluster resources once completed. In addition, the submitter job will always be cleaned up when the RayJob is eventually deleted due it's owner reference back to the RayJob. This PR proposes to never delete the submitter job once a job completes when ShutdownAfterJobFinishes=true. The only exception is for suspended RayJob where the submitter job needs to be stopped.Related issue number
Closes #1832
Checks