Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] TTL Delete RayJob CRD After Job Termination #1944

Closed
2 tasks done
peterghaddad opened this issue Feb 27, 2024 · 12 comments · Fixed by #2225
Closed
2 tasks done

[Feature] TTL Delete RayJob CRD After Job Termination #1944

peterghaddad opened this issue Feb 27, 2024 · 12 comments · Fixed by #2225
Assignees
Labels
enhancement New feature or request P1 Issue that should be fixed within a few weeks rayjob

Comments

@peterghaddad
Copy link

peterghaddad commented Feb 27, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently KubeRay enables the cluster to auto terminate after completion, but there is no mechanism to auto delete the Ray Job instance (owner).

This is beneficial for auto k8s cluster clean up, and behaves similar to a K8s jobs TTL.

Use case

Delete the Ray Job alongside controlled via an additional flag.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@peterghaddad peterghaddad added enhancement New feature or request triage labels Feb 27, 2024
@kevin85421 kevin85421 added rayjob P1 Issue that should be fixed within a few weeks and removed triage labels Mar 4, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this issue Apr 21, 2024
@jjyao jjyao self-assigned this May 13, 2024
@mickvangelderen
Copy link

mickvangelderen commented Jun 12, 2024

@kevin85421 In terms of the solution direction, what if we can specify the submitter Job template, rather than the Pod template? Then you can set the BackoffLimit, ttlSecondsAfterFinished and whatever else someone might need in the future. Given that the pod template is already being patched, why couldn't we do the same for the job template instead?

@mickvangelderen
Copy link

mickvangelderen commented Jun 25, 2024

@anyscalesam @jjyao how was this completed? I am failing to see how it was fixed.

@kevin85421
Copy link
Member

@MortalHappiness will take this issue.

@MortalHappiness
Copy link
Contributor

MortalHappiness commented Jul 2, 2024

@kevin85421 Thanks.

MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 8, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 9, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 10, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 10, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 11, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 11, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
MortalHappiness added a commit to MortalHappiness/kuberay that referenced this issue Jul 12, 2024
Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@mickvangelderen
Copy link

mickvangelderen commented Jul 15, 2024

@MortalHappiness @kevin85421 it does not seem like the implementation in #2225 allows automatic deletion of the submitter after, let's say 1 week, like the TTL field. This issue does mention TTL. Am I missing something?

I'd like for my peers who create the jobs to be able to view them for some time, but to automatically clean up the job after a week. Is that possible with the current implementation?

@andrewsykim
Copy link
Contributor

@mickvangelderen I believe the TTLSecondsAfterFinished field applies for deletion TTL if if RayJob deletion is enabled

@mickvangelderen
Copy link

Thanks for the reply. If that is the case, then is it still possible to delete the cluster head and workers (to free up resources) immediately after the job finishes, and then have the submitter be deleted after a week?

@andrewsykim
Copy link
Contributor

andrewsykim commented Jul 15, 2024

There was a discussion in this thread about whether we want to control this behavior separately: #2097 (comment)

As of now, I think you either delete the whole RayJob or only the cluster. I think we're looking for user feedback on whether we should allow controlling the behavior separately (deleting the cluster and deleting the whole RayJob). Is this something you would find useful?

@mickvangelderen
Copy link

mickvangelderen commented Jul 15, 2024

I do think it would be useful to control the deletion of the cluster and the submitter separately. However, I might be missing other solutions that would work for me and my team, and so I will give some more detail about how we are using ray.

We have a tool that allows spawning work on a cluster. That work can be performed by a native K8S pod, or if the user so desires by ray workers through a RayJob to leverage the distributed computing facilities ray offers. We do not use a persistent Ray cluster because the persistent cluster would sometimes end up in an unreliable state, possible due to our unreliable hardware. RayJobs have been working great.

We want our users to be able to view the logs of their work for about a week. Any data that must be persistent is stored in an external system. After one week, we want to clean up all jobs to keep things tidy and free up space. For K8S jobs, we use the ttlSecondsAfterFinished field. For RayJobs, we are using ShutdownAfterJobFinishes to stop the cluster and the workers which frees up any resources (CPU, GPU) that they claimed. We like that the submitter is not automatically deleted because we need that for our users to view the logs. However, we would like to clean up the submitter after a week. If we were able to specify the ttlSecondsAfterFinished on the submitter job, we would have an easy solution. Instead, we need to set up some cron job to work around simply not being to specify a certain field in the submitter pod spec.

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

@andrewsykim
Copy link
Contributor

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

This will be possible in KubeRay v1.2: #2091

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

I think fundamentally what you need is better tooling to persist the Ray job logs. Once you have this you don't need to care about how long the cluster stays around when the job is deleted. Although I can see value in being able to read the logs directly with kubectl.

@mickvangelderen
Copy link

mickvangelderen commented Jul 15, 2024

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

I might be confused with the terminology. Looking at the RayJob quickstart I want the RayCluster to be deleted immediately after the job finishes and I would like the logs to be available for a week. I thought the logs were tied to something referred to as "the submitter". I'm not sure if this submitter is a Pod, a Job, a Ray Job (notice the space) or something else.

@MortalHappiness
Copy link
Contributor

MortalHappiness commented Jul 16, 2024

@kevin85421 I remembered that you want to make the logging part to become structured logging such that logs can be read from external tools. Is that feature related to log persistence mentioned in this thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Issue that should be fixed within a few weeks rayjob
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants