-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray job] support stop job after job cr is deleted in cluster selector mode #629
[ray job] support stop job after job cr is deleted in cluster selector mode #629
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me and please address the linter issue reported from github actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @Basasuya for the contribution!
Logic
In my understanding, WithEventFilter
[1] is used to define the resource event handlers, the filters for Add / Update / Delete events. Only the events that all predicates evaluate to true will be put into WorkQueue. Next, reconciler will retrieve resource event from the WorkQueue and perform the operator logic based on the event.
To summarize,
(1) WithEventFilter
defines which events we are interested in.
(2) Reconciler
defines how to handle the resource event.
Currently, this PR implements both (1) and (2) in WithEventFilter
. Hence, I will suggest to separate the logic of (1) and (2).
Test
For this PR, an integration test is not required now because I am currently refactoring the E2E test frameworks ( compatibility-test.py), but we can still do something to ensure its reliability.
(1) Add unit tests for StopJob
(2) Add more information (e.g. instructions, screenshot, screen recording) about how you test this feature in the PR description.
@kevin85421 |
bd39001
to
be36c6c
Compare
Thank @Basasuya for your reply! It is OK to have no unit test for this PR now, and we can open a new issue to track the integration test of this feature. By the way, as I mentioned above, why do we run |
be36c6c
to
172fad5
Compare
@kevin85421 I have added the UT for StopJob; because delete job cr would not run into Reconciler, so I add the logic for WithEventFilter; |
We need to discuss |
I guess this PR makes enough sense in the context of cluster selector mode for RayJobs... however, I think cluster selector mode is a fundamentally awkward workflow for a kubernetes operator to manage. In cluster selector mode, creating a Job CR is supposed to trigger the one-time event of job submission. However, CRs are supposed to code for resources that can be reconciled by idempotent actions. If you need to submit a job to an existing RayCluster, I think it would make more sense to use the Ray Job submission API directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DmitriGekhtman |
Thank you for the explanation! It really surprised me that operator-sdk hides the details of delete events. (I implemented other operators with client-go directly.) I found a related discussion at operator-framework/operator-sdk#955. Maybe finalizers is a solution to: (1) Move the reconcile logic from It is OK to merge it if this feature is very urgent for some users, and I will improve it with finalizers before 0.4.0 release. |
Good point, using a finalizer is probably the most idiomatic way to do it. |
In that situation, users can still submit a patch to remove the finalizers manually. Here is an example: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/ |
Yep, pretty much every new hire at Anyscale is trained in that particular maneuver at one point or another... |
it's a nice solution for me to guarantee only once stop job, may be we can implement these way in next PR? |
Let's follow up with a more idiomatic way of orchestrating stopping a job in another PR. |
…luster deletion (#735) See #629 to get more context. The behavior of this PR is almost the same as #629. The only difference is that this PR promises that operator will try to stop the job at least once. In #629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
…r mode (ray-project#629) In cluster selector mode, the deleting a CR should stop a job. This PR provides an initial implementation for this behavior. Co-authored-by: huyuanzhe <huyuanzhe@bytedance.com>
…luster deletion (ray-project#735) See ray-project#629 to get more context. The behavior of this PR is almost the same as ray-project#629. The only difference is that this PR promises that operator will try to stop the job at least once. In ray-project#629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
Why are these changes needed?
when submitting job in cluster selector mode, job will submit to the existing cluster.
If the job cr deleted, these cluster would not be deleted.
we will stop job in these condition
Related issue number
#595
#470
Checks
Manual test