Skip to content

Conversation

@seanlaii
Copy link
Contributor

@seanlaii seanlaii commented Oct 30, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

Update RayJob documentation to introduce the New DeletionStrategy, which provides a rules-based, multi-stage cleanup strategy.

Related issues

Closes ray-project/kuberay#4022.

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
@seanlaii seanlaii requested review from a team as code owners October 30, 2025 04:32
@seanlaii
Copy link
Contributor Author

Hi @Future-Outlier and @rueian , please help review it when you have a chance. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the RayJob documentation to introduce the new deletionStrategy field. The changes are clear and well-structured. I've provided a suggestion to further improve the documentation's clarity by adding information about the default behavior and providing more details on the legacy mode.

Comment on lines +72 to +81
* `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported:
* **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies:
* `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion).
* `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay.
* This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds).
* Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead.
* See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations.
* **Legacy** (Deprecated): Define both `onSuccess` and `onFailure` policies. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged.
* Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`.
* For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for the new deletionStrategy is a great addition. I have a couple of suggestions to make it even clearer for users:

  1. It would be helpful to mention the default behavior when deletionStrategy is not provided. Based on the API definition, it seems to fall back to shutdownAfterJobFinishes.
  2. The description for the Legacy mode could be more explicit about how to define onSuccess and onFailure policies and what their valid values are.

Here is a suggested update incorporating these points:

Suggested change
* `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported:
* **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies:
* `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion).
* `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay.
* This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds).
* Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead.
* See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations.
* **Legacy** (Deprecated): Define both `onSuccess` and `onFailure` policies. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged.
* Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`.
* For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/).
* `deletionStrategy` (Optional, alpha in v1.5.0): Configures automated cleanup after the RayJob reaches a terminal state. If this field is not set, cleanup behavior is determined by `shutdownAfterJobFinishes`. This field requires the `RayJobDeletionPolicy` feature gate to be enabled. Two mutually exclusive styles are supported:
* **Rules-based** (Recommended): Define `deletionRules` as a list of deletion actions triggered by specific conditions. Each rule specifies:
* `policy`: The deletion action to perform — `DeleteCluster` (delete the entire RayCluster and its Pods), `DeleteWorkers` (delete only worker Pods), `DeleteSelf` (delete the RayJob and all associated resources), or `DeleteNone` (no deletion).
* `condition`: When to trigger the deletion, based on `jobStatus` (`SUCCEEDED` or `FAILED`) and an optional `ttlSeconds` delay.
* This approach enables flexible, multi-stage cleanup strategies (e.g., delete workers immediately on success, then delete the cluster after 300 seconds).
* Rules-based mode is incompatible with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`. Use per-rule `condition.ttlSeconds` instead.
* See [ray-job.deletion-rules.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-job.deletion-rules.yaml) for example configurations.
* **Legacy** (Deprecated): Define `onSuccess` and `onFailure` policies to specify the deletion action to take when the job succeeds or fails, respectively. The value for each policy can be `DeleteCluster`, `DeleteWorkers`, `DeleteSelf`, or `DeleteNone`. This approach is deprecated and will be removed in v1.6.0. Migration to `deletionRules` is strongly encouraged.
* Legacy mode can be combined with `shutdownAfterJobFinishes` and the global `ttlSecondsAfterFinished`.
* For detailed API specifications, see the [KubeRay API Reference](https://ray-project.github.io/kuberay/reference/api/).

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Oct 30, 2025
@jjyao jjyao enabled auto-merge (squash) October 30, 2025 05:50
@jjyao jjyao merged commit 3fcb2a2 into ray-project:master Oct 30, 2025
7 checks passed
@seanlaii seanlaii deleted the deletion-strategy branch October 30, 2025 06:09
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…egy (ray-project#58306)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…egy (ray-project#58306)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…egy (ray-project#58306)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Update User Documentation and API Reference for DeletionStrategy enhancement

4 participants