[Feature] TTL Delete RayJob CRD After Job Termination #1944

peterghaddad · 2024-02-27T02:34:13Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Currently KubeRay enables the cluster to auto terminate after completion, but there is no mechanism to auto delete the Ray Job instance (owner).

This is beneficial for auto k8s cluster clean up, and behaves similar to a K8s jobs TTL.

Use case

Delete the Ray Job alongside controlled via an additional flag.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

…1944

mickvangelderen · 2024-06-12T08:43:15Z

@kevin85421 In terms of the solution direction, what if we can specify the submitter Job template, rather than the Pod template? Then you can set the BackoffLimit, ttlSecondsAfterFinished and whatever else someone might need in the future. Given that the pod template is already being patched, why couldn't we do the same for the job template instead?

mickvangelderen · 2024-06-25T07:33:21Z

@anyscalesam @jjyao how was this completed? I am failing to see how it was fixed.

kevin85421 · 2024-07-02T16:26:52Z

@MortalHappiness will take this issue.

MortalHappiness · 2024-07-02T16:27:23Z

@kevin85421 Thanks.

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

mickvangelderen · 2024-07-15T18:03:04Z

@MortalHappiness @kevin85421 it does not seem like the implementation in #2225 allows automatic deletion of the submitter after, let's say 1 week, like the TTL field. This issue does mention TTL. Am I missing something?

I'd like for my peers who create the jobs to be able to view them for some time, but to automatically clean up the job after a week. Is that possible with the current implementation?

andrewsykim · 2024-07-15T18:20:17Z

@mickvangelderen I believe the TTLSecondsAfterFinished field applies for deletion TTL if if RayJob deletion is enabled

mickvangelderen · 2024-07-15T18:33:55Z

Thanks for the reply. If that is the case, then is it still possible to delete the cluster head and workers (to free up resources) immediately after the job finishes, and then have the submitter be deleted after a week?

andrewsykim · 2024-07-15T18:50:23Z

There was a discussion in this thread about whether we want to control this behavior separately: #2097 (comment)

As of now, I think you either delete the whole RayJob or only the cluster. I think we're looking for user feedback on whether we should allow controlling the behavior separately (deleting the cluster and deleting the whole RayJob). Is this something you would find useful?

mickvangelderen · 2024-07-15T19:43:04Z

I do think it would be useful to control the deletion of the cluster and the submitter separately. However, I might be missing other solutions that would work for me and my team, and so I will give some more detail about how we are using ray.

We have a tool that allows spawning work on a cluster. That work can be performed by a native K8S pod, or if the user so desires by ray workers through a RayJob to leverage the distributed computing facilities ray offers. We do not use a persistent Ray cluster because the persistent cluster would sometimes end up in an unreliable state, possible due to our unreliable hardware. RayJobs have been working great.

We want our users to be able to view the logs of their work for about a week. Any data that must be persistent is stored in an external system. After one week, we want to clean up all jobs to keep things tidy and free up space. For K8S jobs, we use the ttlSecondsAfterFinished field. For RayJobs, we are using ShutdownAfterJobFinishes to stop the cluster and the workers which frees up any resources (CPU, GPU) that they claimed. We like that the submitter is not automatically deleted because we need that for our users to view the logs. However, we would like to clean up the submitter after a week. If we were able to specify the ttlSecondsAfterFinished on the submitter job, we would have an easy solution. Instead, we need to set up some cron job to work around simply not being to specify a certain field in the submitter pod spec.

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

andrewsykim · 2024-07-15T20:11:47Z

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

This will be possible in KubeRay v1.2: #2091

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

I think fundamentally what you need is better tooling to persist the Ray job logs. Once you have this you don't need to care about how long the cluster stays around when the job is deleted. Although I can see value in being able to read the logs directly with kubectl.

mickvangelderen · 2024-07-15T22:28:09Z

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

I might be confused with the terminology. Looking at the RayJob quickstart I want the RayCluster to be deleted immediately after the job finishes and I would like the logs to be available for a week. I thought the logs were tied to something referred to as "the submitter". I'm not sure if this submitter is a Pod, a Job, a Ray Job (notice the space) or something else.

MortalHappiness · 2024-07-16T06:27:41Z

@kevin85421 I remembered that you want to make the logging part to become structured logging such that logs can be read from external tools. Is that feature related to log persistence mentioned in this thread?

peterghaddad added enhancement New feature or request triage labels Feb 27, 2024

kevin85421 added rayjob P1 Issue that should be fixed within a few weeks and removed triage labels Mar 4, 2024

Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024

feat (rayjob_controller): add deleteRayJob when complete ray-project#…

2078608

…1944

Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024

refactor (DEVELOPMENT): reset ray-project#1944

576c889

Xiao75896453 mentioned this issue Apr 21, 2024

[Feat] TTL Delete RayJob CRD After Job Termination #1944 #2097

Closed

4 tasks

Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024

fix (rayjob_controller): fix delete ray job ray-project#1944

28bea0d

jjyao self-assigned this May 13, 2024

mickvangelderen mentioned this issue Jun 11, 2024

[Feature] Allow setting ttlSecondsAfterFinished for RayJob submitter Job #2187

Closed

2 tasks

anyscalesam closed this as completed Jun 24, 2024

kevin85421 reopened this Jul 2, 2024

kevin85421 unassigned jjyao Jul 2, 2024

kevin85421 assigned MortalHappiness Jul 2, 2024

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024

[Feat][RayJob] Delete RayJob CR after job termination

3a2b118

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness mentioned this issue Jul 8, 2024

[Feat][RayJob] Delete RayJob CR after job termination #2225

Merged

4 tasks

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024

[Feat][RayJob] Delete RayJob CR after job termination

22fab0c

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024

[Feat][RayJob] Delete RayJob CR after job termination

5bfcfa5

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 9, 2024

[Feat][RayJob] Delete RayJob CR after job termination

10b5989

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 10, 2024

[Feat][RayJob] Delete RayJob CR after job termination

b1f721a

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 10, 2024

[Feat][RayJob] Delete RayJob CR after job termination

c483413

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 11, 2024

[Feat][RayJob] Delete RayJob CR after job termination

27e417d

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 11, 2024

[Feat][RayJob] Delete RayJob CR after job termination

08e3b7a

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 12, 2024

[Feat][RayJob] Delete RayJob CR after job termination

82c68f8

Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

kevin85421 closed this as completed in #2225 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] TTL Delete RayJob CRD After Job Termination #1944

[Feature] TTL Delete RayJob CRD After Job Termination #1944

peterghaddad commented Feb 27, 2024 •

edited

Loading

mickvangelderen commented Jun 12, 2024 •

edited

Loading

mickvangelderen commented Jun 25, 2024 •

edited

Loading

kevin85421 commented Jul 2, 2024

MortalHappiness commented Jul 2, 2024 •

edited

Loading

mickvangelderen commented Jul 15, 2024 •

edited

Loading

andrewsykim commented Jul 15, 2024

mickvangelderen commented Jul 15, 2024

andrewsykim commented Jul 15, 2024 •

edited

Loading

mickvangelderen commented Jul 15, 2024 •

edited

Loading

andrewsykim commented Jul 15, 2024

mickvangelderen commented Jul 15, 2024 •

edited

Loading

MortalHappiness commented Jul 16, 2024 •

edited

Loading

[Feature] TTL Delete RayJob CRD After Job Termination #1944

[Feature] TTL Delete RayJob CRD After Job Termination #1944

Comments

peterghaddad commented Feb 27, 2024 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

mickvangelderen commented Jun 12, 2024 • edited Loading

mickvangelderen commented Jun 25, 2024 • edited Loading

kevin85421 commented Jul 2, 2024

MortalHappiness commented Jul 2, 2024 • edited Loading

mickvangelderen commented Jul 15, 2024 • edited Loading

andrewsykim commented Jul 15, 2024

mickvangelderen commented Jul 15, 2024

andrewsykim commented Jul 15, 2024 • edited Loading

mickvangelderen commented Jul 15, 2024 • edited Loading

andrewsykim commented Jul 15, 2024

mickvangelderen commented Jul 15, 2024 • edited Loading

MortalHappiness commented Jul 16, 2024 • edited Loading

peterghaddad commented Feb 27, 2024 •

edited

Loading

mickvangelderen commented Jun 12, 2024 •

edited

Loading

mickvangelderen commented Jun 25, 2024 •

edited

Loading

MortalHappiness commented Jul 2, 2024 •

edited

Loading

mickvangelderen commented Jul 15, 2024 •

edited

Loading

andrewsykim commented Jul 15, 2024 •

edited

Loading

mickvangelderen commented Jul 15, 2024 •

edited

Loading

mickvangelderen commented Jul 15, 2024 •

edited

Loading

MortalHappiness commented Jul 16, 2024 •

edited

Loading