[ray job] support stop job after job cr is deleted in cluster selector mode #629

Basasuya · 2022-10-12T11:47:10Z

Why are these changes needed?

when submitting job in cluster selector mode, job will submit to the existing cluster.
If the job cr deleted, these cluster would not be deleted.
we will stop job in these condition

Related issue number

#595
#470

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- [*] Manual tests
- This PR is not tested :(

Manual test

apply the ray cluster (modify from ray-operator/config/samples/ray-cluster.complete.yaml, just add sample code)

......
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            - mountPath: /home/ray/samples
              name: code-sample
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "500m"
              memory: "1G"
        volumes:
          - name: ray-logs
            emptyDir: {}
            # You set volumes at the Pod level, then mount them into containers inside that Pod
          - name: code-sample
            configMap:
              # Provide the name of the ConfigMap you want to mount.
              name: ray-job-code-sample
              # An array of keys from the ConfigMap to create as files
              items:
                - key: sample_code.py
                  path: sample_code.py
.....

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import time

    @ray.remote
    class MyActor:
        def __init__(self):
            pass

        def func(self, v):
            return v

    seconds = 60
    actor = MyActor.remote()

    while seconds > 0:
        print(ray.get(actor.func.remote(seconds)))
        time.sleep(1)

apply the ray job

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: rayjob-sample-2
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  clusterSelector:
    ray.io/cluster: "raycluster-complete"

make sure the job is long running(through describe rayjob rayjob-sample-2 or exec into head node),
then delete the job (delete rayjob rayjob-sample-2)
make sure the job is deleted and cluster is not deleted(by exec into the head pod, then ps -ef or use ray job list)

Jeffwan

overall looks good to me and please address the linter issue reported from github actions

ray-operator/controllers/ray/rayjob_controller.go

kevin85421

Thank @Basasuya for the contribution!

Logic

In my understanding, WithEventFilter [1] is used to define the resource event handlers, the filters for Add / Update / Delete events. Only the events that all predicates evaluate to true will be put into WorkQueue. Next, reconciler will retrieve resource event from the WorkQueue and perform the operator logic based on the event.

To summarize,

(1) WithEventFilter defines which events we are interested in.
(2) Reconciler defines how to handle the resource event.

Currently, this PR implements both (1) and (2) in WithEventFilter. Hence, I will suggest to separate the logic of (1) and (2).

[1] https://sdk.operatorframework.io/docs/building-operators/golang/references/event-filtering/#using-predicates

Test

For this PR, an integration test is not required now because I am currently refactoring the E2E test frameworks ( compatibility-test.py), but we can still do something to ensure its reliability.

(1) Add unit tests for StopJob
(2) Add more information (e.g. instructions, screenshot, screen recording) about how you test this feature in the PR description.

Basasuya · 2022-10-18T13:44:19Z

@kevin85421
This PR only modify the logic about WithEventFilter;
I have added the test information in the PR description. This test might be a bit complicated

kevin85421 · 2022-10-18T23:44:35Z

@kevin85421 This PR only modify the logic about WithEventFilter; I have added the test information in the PR description. This test might be a bit complicated

Thank @Basasuya for your reply!

It is OK to have no unit test for this PR now, and we can open a new issue to track the integration test of this feature. By the way, as I mentioned above, why do we run StopJob in WithEventFilter rather than Reconciler?

Basasuya · 2022-10-28T06:52:00Z

@kevin85421 I have added the UT for StopJob; because delete job cr would not run into Reconciler, so I add the logic for WithEventFilter;
here is the new issue to add integration test:
#664

kevin85421 · 2022-10-28T22:16:12Z

We need to discuss clusterSelector further. I find it very weird to control the Ray cluster state (stop Ray job) based on Kubernetes resource events (delete RayJob CR). cc @DmitriGekhtman

DmitriGekhtman · 2022-10-28T23:05:13Z

I guess this PR makes enough sense in the context of cluster selector mode for RayJobs... however, I think cluster selector mode is a fundamentally awkward workflow for a kubernetes operator to manage.

In cluster selector mode, creating a Job CR is supposed to trigger the one-time event of job submission.
Deleting a Job CR is supposed to trigger the one-time event of stopping the job.

However, CRs are supposed to code for resources that can be reconciled by idempotent actions.
With the current design, there are no delivery semantics for the job stop request -- it's neither at-most-once nor at-least-once.

If you need to submit a job to an existing RayCluster, I think it would make more sense to use the Ray Job submission API directly.

DmitriGekhtman

@Jeffwan @Basasuya is it ready to merge?

Basasuya · 2022-10-31T07:05:40Z

@DmitriGekhtman
Agree with your opinion, cluster selector mode may have some difference with kubernetes operator design.
it is ready to merge.

kevin85421 · 2022-10-31T07:20:55Z

@kevin85421 I have added the UT for StopJob; because delete job cr would not run into Reconciler, so I add the logic for WithEventFilter; here is the new issue to add integration test: #664

Thank you for the explanation! It really surprised me that operator-sdk hides the details of delete events. (I implemented other operators with client-go directly.)

I found a related discussion at operator-framework/operator-sdk#955. Maybe finalizers is a solution to:

(1) Move the reconcile logic from WithEventFilter to Reconciler.
(2) Make sure to stop the job exactly once (the problem mentioned by @DmitriGekhtman's comment )

It is OK to merge it if this feature is very urgent for some users, and I will improve it with finalizers before 0.4.0 release.

DmitriGekhtman · 2022-10-31T16:34:55Z

Good point, using a finalizer is probably the most idiomatic way to do it.
The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

kevin85421 · 2022-10-31T16:55:54Z

Good point, using a finalizer is probably the most idiomatic way to do it. The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

In that situation, users can still submit a patch to remove the finalizers manually. Here is an example: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/

DmitriGekhtman · 2022-10-31T20:00:29Z

Good point, using a finalizer is probably the most idiomatic way to do it. The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

In that situation, users can still submit a patch to remove the finalizers manually. Here is an example: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/

Yep, pretty much every new hire at Anyscale is trained in that particular maneuver at one point or another...

Basasuya · 2022-11-02T13:24:10Z

it's a nice solution for me to guarantee only once stop job, may be we can implement these way in next PR?
\cc @Jeffwan

DmitriGekhtman · 2022-11-03T21:24:13Z

Let's follow up with a more idiomatic way of orchestrating stopping a job in another PR.

…luster deletion (#735) See #629 to get more context. The behavior of this PR is almost the same as #629. The only difference is that this PR promises that operator will try to stop the job at least once. In #629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.

…r mode (ray-project#629) In cluster selector mode, the deleting a CR should stop a job. This PR provides an initial implementation for this behavior. Co-authored-by: huyuanzhe <huyuanzhe@bytedance.com>

…luster deletion (ray-project#735) See ray-project#629 to get more context. The behavior of this PR is almost the same as ray-project#629. The only difference is that this PR promises that operator will try to stop the job at least once. In ray-project#629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.

Jeffwan reviewed Oct 12, 2022

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

kevin85421 reviewed Oct 12, 2022

View reviewed changes

Basasuya force-pushed the basasuya_upstream_master_support_stop_job branch from bd39001 to be36c6c Compare October 18, 2022 14:00

huyuanzhe added 6 commits October 28, 2022 14:24

[ray job] support stop job after cr is deleted in cluster selector mode

15948bc

UT

237f91e

fix for UT

3857aac

fix for UT

0159a94

fix

849df40

fix for dashboard UT

172fad5

Basasuya force-pushed the basasuya_upstream_master_support_stop_job branch from be36c6c to 172fad5 Compare October 28, 2022 06:25

DmitriGekhtman approved these changes Oct 28, 2022

View reviewed changes

DmitriGekhtman merged commit 792af2a into ray-project:master Nov 3, 2022

DmitriGekhtman mentioned this pull request Nov 3, 2022

[Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion #676

Closed

2 tasks

kevin85421 mentioned this pull request Nov 17, 2022

[Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion #735

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray job] support stop job after job cr is deleted in cluster selector mode #629

[ray job] support stop job after job cr is deleted in cluster selector mode #629

Basasuya commented Oct 12, 2022 •

edited

Loading

Jeffwan left a comment

kevin85421 left a comment •

edited

Loading

Basasuya commented Oct 18, 2022

kevin85421 commented Oct 18, 2022

Basasuya commented Oct 28, 2022

kevin85421 commented Oct 28, 2022

DmitriGekhtman commented Oct 28, 2022

DmitriGekhtman left a comment

Basasuya commented Oct 31, 2022

kevin85421 commented Oct 31, 2022

DmitriGekhtman commented Oct 31, 2022

kevin85421 commented Oct 31, 2022

DmitriGekhtman commented Oct 31, 2022

Basasuya commented Nov 2, 2022

DmitriGekhtman commented Nov 3, 2022

[ray job] support stop job after job cr is deleted in cluster selector mode #629

[ray job] support stop job after job cr is deleted in cluster selector mode #629

Conversation

Basasuya commented Oct 12, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Manual test

Jeffwan left a comment

Choose a reason for hiding this comment

kevin85421 left a comment • edited Loading

Choose a reason for hiding this comment

Logic

Test

Basasuya commented Oct 18, 2022

kevin85421 commented Oct 18, 2022

Basasuya commented Oct 28, 2022

kevin85421 commented Oct 28, 2022

DmitriGekhtman commented Oct 28, 2022

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Basasuya commented Oct 31, 2022

kevin85421 commented Oct 31, 2022

DmitriGekhtman commented Oct 31, 2022

kevin85421 commented Oct 31, 2022

DmitriGekhtman commented Oct 31, 2022

Basasuya commented Nov 2, 2022

DmitriGekhtman commented Nov 3, 2022

Basasuya commented Oct 12, 2022 •

edited

Loading

kevin85421 left a comment •

edited

Loading