Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray job] support stop job after job cr is deleted in cluster selector mode #629

Conversation

Basasuya
Copy link
Contributor

@Basasuya Basasuya commented Oct 12, 2022

Why are these changes needed?

when submitting job in cluster selector mode, job will submit to the existing cluster.
If the job cr deleted, these cluster would not be deleted.
we will stop job in these condition

Related issue number

#595
#470

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • [*] Manual tests
    • This PR is not tested :(

Manual test

  1. apply the ray cluster (modify from ray-operator/config/samples/ray-cluster.complete.yaml, just add sample code)
......
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            - mountPath: /home/ray/samples
              name: code-sample
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "500m"
              memory: "1G"
        volumes:
          - name: ray-logs
            emptyDir: {}
            # You set volumes at the Pod level, then mount them into containers inside that Pod
          - name: code-sample
            configMap:
              # Provide the name of the ConfigMap you want to mount.
              name: ray-job-code-sample
              # An array of keys from the ConfigMap to create as files
              items:
                - key: sample_code.py
                  path: sample_code.py
.....

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import time

    @ray.remote
    class MyActor:
        def __init__(self):
            pass

        def func(self, v):
            return v

    seconds = 60
    actor = MyActor.remote()

    while seconds > 0:
        print(ray.get(actor.func.remote(seconds)))
        time.sleep(1)


  1. apply the ray job
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: rayjob-sample-2
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  clusterSelector:
    ray.io/cluster: "raycluster-complete"
  1. make sure the job is long running(through describe rayjob rayjob-sample-2 or exec into head node),
  2. then delete the job (delete rayjob rayjob-sample-2)
  3. make sure the job is deleted and cluster is not deleted(by exec into the head pod, then ps -ef or use ray job list)

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me and please address the linter issue reported from github actions

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @Basasuya for the contribution!

Logic

In my understanding, WithEventFilter [1] is used to define the resource event handlers, the filters for Add / Update / Delete events. Only the events that all predicates evaluate to true will be put into WorkQueue. Next, reconciler will retrieve resource event from the WorkQueue and perform the operator logic based on the event.

To summarize,

(1) WithEventFilter defines which events we are interested in.
(2) Reconciler defines how to handle the resource event.

Currently, this PR implements both (1) and (2) in WithEventFilter. Hence, I will suggest to separate the logic of (1) and (2).

[1] https://sdk.operatorframework.io/docs/building-operators/golang/references/event-filtering/#using-predicates

Test

For this PR, an integration test is not required now because I am currently refactoring the E2E test frameworks ( compatibility-test.py), but we can still do something to ensure its reliability.

(1) Add unit tests for StopJob
(2) Add more information (e.g. instructions, screenshot, screen recording) about how you test this feature in the PR description.

@Basasuya
Copy link
Contributor Author

@kevin85421
This PR only modify the logic about WithEventFilter;
I have added the test information in the PR description. This test might be a bit complicated

@Basasuya Basasuya force-pushed the basasuya_upstream_master_support_stop_job branch from bd39001 to be36c6c Compare October 18, 2022 14:00
@kevin85421
Copy link
Member

@kevin85421 This PR only modify the logic about WithEventFilter; I have added the test information in the PR description. This test might be a bit complicated

Thank @Basasuya for your reply!

It is OK to have no unit test for this PR now, and we can open a new issue to track the integration test of this feature. By the way, as I mentioned above, why do we run StopJob in WithEventFilter rather than Reconciler?

@Basasuya Basasuya force-pushed the basasuya_upstream_master_support_stop_job branch from be36c6c to 172fad5 Compare October 28, 2022 06:25
@Basasuya
Copy link
Contributor Author

@kevin85421 I have added the UT for StopJob; because delete job cr would not run into Reconciler, so I add the logic for WithEventFilter;
here is the new issue to add integration test:
#664

@kevin85421
Copy link
Member

We need to discuss clusterSelector further. I find it very weird to control the Ray cluster state (stop Ray job) based on Kubernetes resource events (delete RayJob CR). cc @DmitriGekhtman

@DmitriGekhtman
Copy link
Collaborator

I guess this PR makes enough sense in the context of cluster selector mode for RayJobs... however, I think cluster selector mode is a fundamentally awkward workflow for a kubernetes operator to manage.

In cluster selector mode, creating a Job CR is supposed to trigger the one-time event of job submission.
Deleting a Job CR is supposed to trigger the one-time event of stopping the job.

However, CRs are supposed to code for resources that can be reconciled by idempotent actions.
With the current design, there are no delivery semantics for the job stop request -- it's neither at-most-once nor at-least-once.

If you need to submit a job to an existing RayCluster, I think it would make more sense to use the Ray Job submission API directly.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan @Basasuya is it ready to merge?

@Basasuya
Copy link
Contributor Author

@DmitriGekhtman
Agree with your opinion, cluster selector mode may have some difference with kubernetes operator design.
it is ready to merge.

@kevin85421
Copy link
Member

@kevin85421 I have added the UT for StopJob; because delete job cr would not run into Reconciler, so I add the logic for WithEventFilter; here is the new issue to add integration test: #664

Thank you for the explanation! It really surprised me that operator-sdk hides the details of delete events. (I implemented other operators with client-go directly.)

I found a related discussion at operator-framework/operator-sdk#955. Maybe finalizers is a solution to:

(1) Move the reconcile logic from WithEventFilter to Reconciler.
(2) Make sure to stop the job exactly once (the problem mentioned by @DmitriGekhtman's comment )

It is OK to merge it if this feature is very urgent for some users, and I will improve it with finalizers before 0.4.0 release.

@DmitriGekhtman
Copy link
Collaborator

Good point, using a finalizer is probably the most idiomatic way to do it.
The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

@kevin85421
Copy link
Member

Good point, using a finalizer is probably the most idiomatic way to do it. The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

In that situation, users can still submit a patch to remove the finalizers manually. Here is an example: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/

@DmitriGekhtman
Copy link
Collaborator

Good point, using a finalizer is probably the most idiomatic way to do it. The downside of that approach is that could result hard-to-remove dangling resources in situations where an operator isn't working properly.

In that situation, users can still submit a patch to remove the finalizers manually. Here is an example: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/

Yep, pretty much every new hire at Anyscale is trained in that particular maneuver at one point or another...

@Basasuya
Copy link
Contributor Author

Basasuya commented Nov 2, 2022

it's a nice solution for me to guarantee only once stop job, may be we can implement these way in next PR?
\cc @Jeffwan

@DmitriGekhtman
Copy link
Collaborator

Let's follow up with a more idiomatic way of orchestrating stopping a job in another PR.

@DmitriGekhtman DmitriGekhtman merged commit 792af2a into ray-project:master Nov 3, 2022
kevin85421 added a commit that referenced this pull request Nov 24, 2022
…luster deletion (#735)

See #629 to get more context. The behavior of this PR is almost the same as #629. The only difference is that this PR promises that operator will try to stop the job at least once. In #629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…r mode (ray-project#629)

In cluster selector mode, the deleting a CR should stop a job. This PR provides an initial implementation for this behavior.

Co-authored-by: huyuanzhe <huyuanzhe@bytedance.com>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…luster deletion (ray-project#735)

See ray-project#629 to get more context. The behavior of this PR is almost the same as ray-project#629. The only difference is that this PR promises that operator will try to stop the job at least once. In ray-project#629, if the RayJob is deleted when the operator is down, the operator will not try to stop the job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants