Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Finalizer to block deletion of RayCluster with running jobs #1740

Open
2 tasks done
andrewsykim opened this issue Dec 12, 2023 · 5 comments · May be fixed by #2041
Open
2 tasks done

[Feature] Finalizer to block deletion of RayCluster with running jobs #1740

andrewsykim opened this issue Dec 12, 2023 · 5 comments · May be fixed by #2041
Labels
enhancement New feature or request

Comments

@andrewsykim
Copy link
Contributor

andrewsykim commented Dec 12, 2023

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

I would like to introduce a finalizer that can be used with RayCluster to block deletion until all jobs in the Ray cluster are completed.

Use case

This feature would allow you to delete a Ray cluster while jobs are still running. The finalizer will ensure that all jobs are completed before cleaning up resources by querying the Ray head service. This is handy for when you want to automatically clean up resource immediately after a long-running training job. Even more important for larger jobs where resources need to be cleaned up as soon as possible to save costs.

This can also be used as a safety measure to ensure RayClusters with running jobs can't be accidentally deleted.

While RayJob can be used for similar use-cases, it is not a viable option for longer-lived RayClusters that can accept multiple jobs before being deleted.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@andrewsykim andrewsykim added the enhancement New feature or request label Dec 12, 2023
@andrewsykim
Copy link
Contributor Author

Thoughts @kevin85421 ?

@kevin85421
Copy link
Member

The suspend feature in RayJob will issue a request to the Ray head Pod to halt the job before the RayCluster is deleted. For RayCluster, I prefer to avoid doing too many things on the data plane (i.e. Ray). If users want to suspend a RayCluster, they should make sure all jobs are stopped by themselves.

@andrewsykim
Copy link
Contributor Author

Btw, this is pertaining to deletion, not suspension.

@andrewsykim
Copy link
Contributor Author

@kevin85421 here's the use-case I am thinking about it:

  • Data Scientist starts their day by creating a RayCluster. The cluster is large and consumes expensive hardware accelerators.
  • Throughout the day they run several jobs on their cluster. They submit multiple jobs interactively, which is why RayJob is not a viable option.
  • Near the end of the day, they want to run one more job that will take several hours to complete and they want to check the results the next day.
  • Because their cluster is very expensive, they want the cluster to be automatically deleted after the final job completes but they don't want to babysit the job until completion.
  • They add a finalizer to the RayCluster ray.io/wait-for-job-completion and then runs the delete command kubectl delete raycluster my-cluster.
  • kuberay-operator sees the finalizer and waits for all jobs to complete before cleaning up all resources/
  • Data Scientists checks the results the next day and spins up new RayCluster again to start development.

Note that the finalizer would be optional and blocking deletion on job completion is not default behavior. I agree with your previous comment that we don't need to cover this for suspension

@chenk008
Copy link
Contributor

We has the similar use-case:

  1. The RayCluster is shared by some tenants, they will submit job to it.
  2. When the RayCluster is deleting, pods should not be deleted before all jobs is completed
  3. New job will be forbidden

@YichengWang12 YichengWang12 linked a pull request Mar 24, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants