-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending
#1798
Conversation
Suspending
Suspending
Suspending
Suspending
Suspending
} | ||
} | ||
return nil | ||
|
||
isReleaseComplete := isClusterNotFound && isJobNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, this function returns nil when the DeletionTimeStamp is not zero.
|
||
// TODO (kevin85421): Currently, Ray doesn't have a best practice to stop a Ray job gracefully. At this moment, | ||
// KubeRay doesn't stop the Ray job before suspending the RayJob. If users want to stop the Ray job by SIGTERM, | ||
// users need to set the Pod's preStop hook by themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we document this anywhere other than here in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not. I will streamline the end-to-end UX of RayJob with some Kubernetes maintainers from Google this quarter. I will document important information when we figure out the best practice.
Why are these changes needed?
The
suspend
operation should be atomic. In other words, if users set thesuspend
flag to true and then immediately set it back to false, either all of the RayJob's associated resources should be cleaned up, or no resources should be cleaned up at all. To keep the atomicity, if a RayJob is in theSuspending
status, we should delete all of its associated resources and then transition the status toSuspended
no matter the value of thesuspend
flag.There is a breaking change in this PR:
StopJob
request to stop the Ray job to simplify logic. Currently, Ray doesn't have a best practice to stop a Ray job gracefully. At this moment, KubeRay doesn't stop the Ray job before suspending the RayJob. If users want to stop the Ray job by SIGTERM, users need to set the Pod's preStop hook by themselves. I think this change is acceptable because thesuspend
operation has not been working for several releases already.Related issue number
Checks