Skip to content

Conversation

@kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 8, 2024

Why are these changes needed?

#1783 (review)

Kueue does not admit RayJob with shutDownAfterFinishes = false, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title WIP [RayJob] Validate RayJob spec Jan 8, 2024
@kevin85421 kevin85421 marked this pull request as ready for review January 8, 2024 20:48
@kevin85421
Copy link
Member Author

cc @astefanutti @andrewsykim

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me, but why do we want this to be invalid?

@kevin85421
Copy link
Member Author

The code looks good to me, but why do we want this to be invalid?

#1813 (comment)?

@architkulkarni
Copy link
Contributor

I read that but I still don't understand, I think I'm missing some context. Why doesn't Kueue accept jobs with shutDownAfterFinishes = false, and secondly, why should this be enforced by KubeRay and not just by Kueue?

@kevin85421
Copy link
Member Author

Got it.

Why doesn't Kueue accept jobs with shutDownAfterFinishes = false

In my understanding, if users set shutDownAfterFinishes to false, it means they do not want to delete the cluster for some reason. Hence, we should not delete the RayCluster when suspend is set to true.

secondly, why should this be enforced by KubeRay and not just by Kueue?

This is a good question. Kueue can prevent the suspension of a RayJob with shutDownAfterFinishes: false. However, users might still be able to suspend the RayJob manually on their own. It might be better to align with Kueue's behavior. Therefore, adding the validation in KubeRay makes sense to me.

cc @andrewsykim @astefanutti for more context about Kueue

@kevin85421 kevin85421 force-pushed the validate-rayjob-cr-spec branch from a5eb5e8 to 1471c46 Compare January 10, 2024 03:10
@andrewsykim
Copy link
Member

Why doesn't Kueue accept jobs with shutDownAfterFinishes = false

This is mainly because we release the quota for the job when the job finishes, but leaving the cluster would mean those resources are still used.

secondly, why should this be enforced by KubeRay and not just by Kueue?

Kueue does have a validating webhook for RayJob and does enforce this: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/rayjob/rayjob_webhook.go#L89-L92

@kevin85421
Copy link
Member Author

@architkulkarni Does the explanation make sense to you? Thanks!

@kevin85421
Copy link
Member Author

test-rayservice-sample-yamls-nightly-operator is flaky due to #1808 and has no relationship with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants