Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending #1798

Merged
merged 2 commits into from
Jan 3, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 3, 2024

Why are these changes needed?

The suspend operation should be atomic. In other words, if users set the suspend flag to true and then immediately set it back to false, either all of the RayJob's associated resources should be cleaned up, or no resources should be cleaned up at all. To keep the atomicity, if a RayJob is in the Suspending status, we should delete all of its associated resources and then transition the status to Suspended no matter the value of the suspend flag.

There is a breaking change in this PR:

  • This PR doesn't send a StopJob request to stop the Ray job to simplify logic. Currently, Ray doesn't have a best practice to stop a Ray job gracefully. At this moment, KubeRay doesn't stop the Ray job before suspending the RayJob. If users want to stop the Ray job by SIGTERM, users need to set the Pod's preStop hook by themselves. I think this change is acceptable because the suspend operation has not been working for several releases already.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# ray-job.long-running.yaml: https://gist.github.com/kevin85421/68fadd3ec00fc75d112512954f62cacf
kubectl apply -f ray-job.long-running.yaml

# Wait for the JobDeploymentStatus to be `Running`.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expectation]: Running

# Add a finalizer to the RayCluster
kubectl patch rayclusters.ray.io rayjob-sample-raycluster-kp7c4 --type='json' -p='[{"op": "add", "path": "/metadata/finalizers", "value":["ray.io/test"]}]'

# Set `suspend` to true in `ray-job.long-running.yaml` and then apply again.
kubectl apply -f ray-job.long-running.yaml
# [Expectation]: The K8s Job should be removed, but the RayCluster will still be there.

# The status should be `Suspending`.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expectation]: Suspending

# Remove the finalizer
kubectl patch rayclusters.ray.io rayjob-sample-raycluster-kp7c4 --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

# The status should be `Suspended`
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expectation]: Suspended

# Set `suspend` to false in `ray-job.long-running.yaml` and then apply again.
kubectl apply -f ray-job.long-running.yaml
# [Expectation]: The K8s Job and the RayCluster should be created again.

# The status should finally be `Running` again.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expectation]: Running

@kevin85421 kevin85421 changed the title [WIP][RayJob][Status][13/n] Make suspend operation atomic [WIP][RayJob][Status][13/n] Make suspend operation atomic by introducing a new status Suspending Jan 3, 2024
@kevin85421 kevin85421 changed the title [WIP][RayJob][Status][13/n] Make suspend operation atomic by introducing a new status Suspending [WIP][RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending Jan 3, 2024
@kevin85421 kevin85421 changed the title [WIP][RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending Jan 3, 2024
@kevin85421 kevin85421 marked this pull request as ready for review January 3, 2024 07:34
}
}
return nil

isReleaseComplete := isClusterNotFound && isJobNotFound
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, this function returns nil when the DeletionTimeStamp is not zero.


// TODO (kevin85421): Currently, Ray doesn't have a best practice to stop a Ray job gracefully. At this moment,
// KubeRay doesn't stop the Ray job before suspending the RayJob. If users want to stop the Ray job by SIGTERM,
// users need to set the Pod's preStop hook by themselves.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we document this anywhere other than here in code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not. I will streamline the end-to-end UX of RayJob with some Kubernetes maintainers from Google this quarter. I will document important information when we figure out the best practice.

@kevin85421 kevin85421 merged commit 448e33d into ray-project:master Jan 3, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants