-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Docs] Update RayJob documentation to introduce the New DeletionStrategy #58306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
|
Hi @Future-Outlier and @rueian , please help review it when you have a chance. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the RayJob documentation to introduce the new deletionStrategy field. The changes are clear and well-structured. I've provided a suggestion to further improve the documentation's clarity by adding information about the default behavior and providing more details on the legacy mode.
| * `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported: | ||
| * **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies: | ||
| * `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion). | ||
| * `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay. | ||
| * This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds). | ||
| * Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead. | ||
| * See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations. | ||
| * **Legacy** (Deprecated): Define both `onSuccess` and `onFailure` policies. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged. | ||
| * Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. | ||
| * For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for the new deletionStrategy is a great addition. I have a couple of suggestions to make it even clearer for users:
- It would be helpful to mention the default behavior when
deletionStrategyis not provided. Based on the API definition, it seems to fall back toshutdownAfterJobFinishes. - The description for the Legacy mode could be more explicit about how to define
onSuccessandonFailurepolicies and what their valid values are.
Here is a suggested update incorporating these points:
| * `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported: | |
| * **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies: | |
| * `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion). | |
| * `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay. | |
| * This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds). | |
| * Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead. | |
| * See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations. | |
| * **Legacy** (Deprecated): Define both `onSuccess` and `onFailure` policies. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged. | |
| * Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. | |
| * For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/). | |
| * `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. If this field is not set, cleanup behavior is determined by `shutdownAfterJobFinishes`. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported: | |
| * **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies: | |
| * `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion). | |
| * `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay. | |
| * This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds). | |
| * Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead. | |
| * See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations. | |
| * **Legacy** (Deprecated): Define `onSuccess` and `onFailure` policies to specify the deletion action to take when the job succeeds or fails, respectively. The value for each policy can be `DeleteCluster`, `DeleteWorkers`, `DeleteSelf`, or `DeleteNone`. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged. | |
| * Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. | |
| * For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/). |
…egy (ray-project#58306) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
…egy (ray-project#58306) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
…egy (ray-project#58306) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
Update RayJob documentation to introduce the New DeletionStrategy, which provides a rules-based, multi-stage cleanup strategy.
Related issues
Closes ray-project/kuberay#4022.
Additional information