[Feature] Support suspend in RayJob #926

oginskis · 2023-02-23T14:49:26Z

Why are these changes needed?

Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them.

So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511

Implementation details

If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job.
If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job.
If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued.

Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else.

No Kueue-specific code leaked to Kuberay implementation

Contributors from Kueue/Kubernetes cc'ed:

@alculquicondor
@mwielgus

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Jeffwan · 2023-02-23T18:29:21Z

Thanks for the feature request. We will have a look and come back to you soon

oginskis · 2023-02-28T13:52:35Z

Hello @Jeffwan
Thank you!
Can I do anything to speed up the review?

alculquicondor · 2023-02-28T15:19:55Z

I quickly looked through https://github.com/ray-project/community and I didn't see any recurring meeting.

Is there any venue where we can meet if you have questions about why this is important? @Jeffwan

architkulkarni · 2023-03-01T00:52:06Z

I quickly looked through https://github.com/ray-project/community and I didn't see any recurring meeting.

Is there any venue where we can meet if you have questions about why this is important? @Jeffwan

Thanks for bringing this up -- we do have a biweekly Kuberay sync, @gvspraveen can check on adding this to the linked community page.

In this case, we discussed internally and don't have any questions, no need to present anything at the sync. Because suspend is a standard feature for native Kubernetes jobs, it makes sense for this to be a part of KubeRay. We'll review the PR as soon as we can!

gvspraveen · 2023-03-01T06:05:58Z

Thanks for bringing this up -- we do have a biweekly Kuberay sync, @gvspraveen can check on adding this to the linked community page.

That community page seems to be for all of ray-project. Not sure if adding meeting links is appropriate there. But there is a Slack group linked in ray docs. You can reach out to us there and we can add you to meeting.

oginskis · 2023-03-01T10:30:44Z

@architkulkarni Could you re-run the workflows? goimports issue fixed

alculquicondor · 2023-03-02T16:52:08Z

ray-operator/apis/ray/v1alpha1/rayjob_types.go

@@ -61,6 +62,8 @@ type RayJobSpec struct {
 	RayClusterSpec *RayClusterSpec `json:"rayClusterSpec,omitempty"`
 	// clusterSelector is used to select running rayclusters by labels
 	ClusterSelector map[string]string `json:"clusterSelector,omitempty"`
+	// suspend specifies whether the RayJob controller should create a RayCluster instance


Worth explaining what happens with the RayCluster on transitions from false to true and true to false.
Does it affect .status.startTime?

done .status.startTime is updating for each new start.

@mcariatm Could you also reset .status.endTime when a job is resumed (transitions from suspend true -> false)

@mcariatm Could you also reset .status.endTime when a job is resumed (transitions from suspend true -> false)

Done, line 400 in rayjob_controller.go

alculquicondor · 2023-03-02T16:56:03Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -272,6 +326,11 @@ func isJobPendingOrRunning(status rayv1alpha1.JobStatus) bool {
 	return (status == rayv1alpha1.JobStatusPending) || (status == rayv1alpha1.JobStatusRunning)
 }

+// isSuspendFlagSet indicates whether the job has a suspended flag set.
+func isSuspendFlagSet(job *rayv1alpha1.RayJob) bool {


this one-liner doesn't add value. I would just inline job.Spec.Suspend where necessary.

mcariatm · 2023-03-06T15:46:36Z

@architkulkarni, @Jeffwan I found an issue. When a rayjob is deleted and after a short time it is applied again we get this error.

Failed to start Job Supervisor actor: The name _ray_internal_job_actor_rayjob-sample-fl6rc
    (namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE) is already taken. Please use a different
    name or get the existing actor using ray.get_actor(''_ray_internal_job_actor_rayjob-sample-fl6rc'',
    namespace=''SUPERVISOR_ACTOR_RAY_NAMESPACE'').'
  rayClusterName: rayjob-sample-raycluster-nwxd6

This happens when the old RayCluster is still in terminating status and the job remains in FAILED status with this error as I noticed. If the jobs are orchestrated by Kueue this problem may occur more and more often.
What do you think?
Should we fix it in this PR? Should we wait for cluster terminating?

oginskis · 2023-03-06T16:42:53Z

@mcariatm
I believe the easiest way to address this race condition is to reset rayJob.Status.RayClusterName every time when we delete the cluster (suspend goes from false to true) along with resetting rayJob.Status.RayClusterStatus here: https://github.com/epam/kuberay/blob/kueue-integration/ray-operator/controllers/ray/rayjob_controller.go#L256

Currently, we use the old cluster id when the Ray cluster instance is re-created, so if the suspend false -> true -> false transitions happen too fast, then the Job controller could connect and try to submit the job against the old terminating instance as you explained.

Also, consider resetting rayJob.Status.DashboardURL, rayJob.Status.JobId, and maybe rayJob.Status.Message

alculquicondor · 2023-03-06T16:49:37Z

What happens with the RayCluster object when you call Delete on it? Does it immediately disappear or is it held by a finalizer until it's actually cleaned up?

If the object persists, I think it's more appropriate to wait for it to finish before creating the new RayCluster. If not, what @oginskis suggests sounds like a good option.

mcariatm · 2023-03-07T15:32:45Z

After a long time of testing I understood that the problem does not depend on the clusters that are in terminating status.
When the RayCluster is created a request to dashboard is done which uses a hard coded 2 seconds timeout. On low resources clusters the dashboard fails to respond in 2 seconds.
I added the fix using context to this PR because we depend on it.
I also decided to reset ClustName, DashboardUrl, JobId and message from status.

oginskis · 2023-03-07T16:06:44Z

ray-operator/controllers/ray/utils/dashboard_httpclient.go

@@ -160,14 +160,14 @@ func FetchDashboardURL(ctx context.Context, log *logr.Logger, cli client.Client,

 func (r *RayDashboardClient) InitClient(url string) {
 	r.client = http.Client{
-		Timeout: 2 * time.Second,
+		Timeout: 20 * time.Second,


Maybe make this timeout configurable?

maybe not in this PR... 😆

kevin85421 · 2023-03-07T19:12:22Z

Thanks @oginskis @mcariatm for this contribution!

I will start reviewing this PR this week. By the way, there seem to be some updates in dashboard_httpclient. It is highly possible that KubeRay will redesign RayJob in April. This means that we will likely remove the Dashboard HTTP client because submitting a job to RayCluster via the HTTP client is not an idempotent operation. It sometimes causes multiple job creations within a single RayJob. Please see #756 for more context.

If this PR heavily relies on that part, it may not be merged in this release (v0.5.0 plans to be out at the end of March or early April). The KubeRay team will do our best to get this PR merge in v0.6.0.

alculquicondor · 2023-03-07T21:09:25Z

ray-operator/controllers/ray/rayjob_controller.go

+			rayJobInstance.Status.DashboardURL = ""
+			rayJobInstance.Status.JobId = ""
+			rayJobInstance.Status.Message = ""
+			err = r.updateState(ctx, rayJobInstance, jobInfo, rayv1alpha1.JobStatusStopped, rayv1alpha1.JobDeploymentStatusSuspended, nil)


why is this in a separate API call? can it be one with line 246?

Seconding this question

It looks like this was addressed.

denkensk · 2023-03-09T10:22:12Z

/cc

alculquicondor · 2023-03-09T15:53:08Z

/retitle [Feature] Support suspend in RayJob

tenzen-y · 2023-03-09T18:18:51Z

/cc

kevin85421

Take a pass for this PR. Thanks for this contribution and the detailed PR description! As I mentioned above, it is highly possible that KubeRay will redesign RayJob in April (after v0.5.0) and remove the Dashboard HTTP client. We may either use head Pod command or a separate Kubernetes Job to submit the job to RayCluster. Again, the KubeRay team will do our best to get this PR merge in v0.6.0.

I'm not familiar with Kubernetes built-in Job and curious about the use case for suspend. Could you please explain it to me? I ask because when suspend is set to true while the RayJob is running, the RayCluster is deleted, and the job must be rerun entirely rather than resuming from the previous progress. What's the difference with submitting a new RayJob? Thanks!

alculquicondor · 2023-03-13T12:39:29Z

The documentation for Job is here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job

suspend is arguably a bit of a misnomer. You can think of it as "preemption", but we chose suspend for consistency with existing fields in Deployment and CronJob.

The idea is that there is a higher level controller than can determine that the cluster is full and keep suspend=true until there is space in the cluster (aka queueing), or even set it for a job that is running, in order to make space for a higher priority job (aka preemption).

alculquicondor · 2023-03-13T12:45:35Z

That said, if a particular application or framework implements checkpointing, the job could resume where it was suspended.
Is there an alternative way of implementing suspend in RayJob that would give us these semantics? Keep in mind that the key requirement for suspend is that, when set, all running pods should stop (respecting grace period).

oginskis · 2023-03-13T12:56:03Z

/retitle [Feature] Support suspend in RayJob

alculquicondor

LGTM from my side

alculquicondor · 2023-05-03T14:59:14Z

ray-operator/controllers/ray/rayjob_controller.go

+				return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
+			}
+			if !rayv1alpha1.IsJobTerminal(info.JobStatus) {
+				err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId, &r.Log)


it looks like this was addressed

alculquicondor · 2023-05-03T14:59:48Z

ray-operator/controllers/ray/rayjob_controller.go

+			rayJobInstance.Status.DashboardURL = ""
+			rayJobInstance.Status.JobId = ""
+			rayJobInstance.Status.Message = ""
+			err = r.updateState(ctx, rayJobInstance, jobInfo, rayv1alpha1.JobStatusStopped, rayv1alpha1.JobDeploymentStatusSuspended, nil)


It looks like this was addressed.

mcariatm · 2023-05-03T21:25:38Z

@architkulkarni please rerun linter and the other checks.

architkulkarni · 2023-05-03T23:30:50Z

ray-operator/apis/ray/v1alpha1/zz_generated.deepcopy.go

@@ -6,7 +6,7 @@
 package v1alpha1

 import (
-	v1 "k8s.io/api/core/v1"
+	"k8s.io/api/core/v1"


Suggested change

"k8s.io/api/core/v1"

v1 "k8s.io/api/core/v1"

Linter complains about this change for some reason... https://github.com/ray-project/kuberay/actions/runs/4876658441/jobs/8702043318?pr=926#step:11:19

I think it happens when running make test or possibly other commands. I should try to get to the bottom of it

please rerun

architkulkarni · 2023-05-04T17:23:01Z

https://github.com/ray-project/kuberay/actions/runs/4879883408/jobs/8718473999?pr=926#step:5:62
@mcariatm It looks like there were some autogenerated changes to register.go in your PR, do you mind reverting these? Sorry about this

architkulkarni

A basic question, what's the purpose of passing context to all the HTTP requests, is that a general best practice or does it fix something specific, separately from increasing the timeout?

Apart from that question, looks good to me pending the lint failure. Thanks for the contribution!

trasc · 2023-05-04T18:14:47Z

It's main purpose is to extend the timeout but keep it in the bounds of the reconcile call. On slow machines (minikube on our dev laptops) we observed that the job name collision was a result of not waiting for the initial request to coplete on the server (dashboard API server), and doing the same request again.

…

On Thu, May 4, 2023, 20:38 Archit Kulkarni ***@***.***> wrote: ***@***.**** approved this pull request. A basic question, what's the purpose of passing context to all the HTTP requests, is that a general best practice or does it fix something specific, separately from increasing the timeout? Apart from that question, looks good to me pending the lint failure. Thanks for the contribution! — Reply to this email directly, view it on GitHub <#926 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANJHDCNXRYMGQWSLM5YWUJ3XEPSQNANCNFSM6AAAAAAVFXN6R4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

mcariatm · 2023-05-04T19:44:18Z

A basic question, what's the purpose of passing context to all the HTTP requests, is that a general best practice or does it fix something specific, separately from increasing the timeout?

Apart from that question, looks good to me pending the lint failure. Thanks for the contribution!

can you try again please?

mcariatm · 2023-05-05T08:49:29Z

@kevin85421 can you have a look?

architkulkarni · 2023-05-16T18:32:51Z

ray-operator/controllers/ray/utils/dashboard_httpclient.go

-	UpdateDeployments(specs rayv1alpha1.ServeDeploymentGraphSpec) error
-	GetDeploymentsStatus() (*ServeDeploymentStatuses, error)
+	GetDeployments(context.Context) (string, error)
+	UpdateDeployments(ctx context.Context, specs rayv1alpha1.ServeDeploymentGraphSpec) error


Suggested change

UpdateDeployments(ctx context.Context, specs rayv1alpha1.ServeDeploymentGraphSpec) error

UpdateDeployments(ctx context.Context, spec rayv1alpha1.ServeDeploymentGraphSpec) error

Just one merge conflict here.

architkulkarni · 2023-05-16T18:33:30Z

I've tested this manually and discussed offline with @kevin85421 and we are ready to merge this. @oginskis if you can address the merge conflict, I will merge the PR

This reverts commit e19a4c0.

…eviously hard-coded 2 second timeout.

mcariatm · 2023-05-16T18:46:00Z

I've tested this manually and discussed offline with @kevin85421 and we are ready to merge this. @oginskis if you can address the merge conflict, I will merge the PR

I rebased it. Please rerun the tests.
If you have any questions please contact me directly.

@alculquicondor

Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them. So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511 Implementation details If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job. If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job. If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued. Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else. No Kueue-specific code leaked to Kuberay implementation Contributors from Kueue/Kubernetes cc'ed: @alculquicondor @mwielgus

mcariatm approved these changes Feb 23, 2023

View reviewed changes

oginskis force-pushed the kueue-integration branch from e4203ad to 5f3f0a5 Compare February 23, 2023 15:22

kevin85421 requested review from kevin85421 and architkulkarni February 24, 2023 00:43

gvspraveen mentioned this pull request Mar 1, 2023

Update contribution doc to show users how to reach out via slack #936

Merged

4 tasks

oginskis force-pushed the kueue-integration branch from 568a2b4 to 1ab1fdd Compare March 1, 2023 10:12

oginskis force-pushed the kueue-integration branch from 5bd476d to 224e3ce Compare March 2, 2023 08:31

alculquicondor reviewed Mar 2, 2023

View reviewed changes

oginskis commented Mar 7, 2023

View reviewed changes

alculquicondor reviewed Mar 7, 2023

View reviewed changes

alculquicondor mentioned this pull request Mar 8, 2023

Implement GenericJob interface on batchv1.Job (cherry-picked and resolved conflicts) kubernetes-sigs/kueue#616

Merged

kevin85421 reviewed Mar 12, 2023

View reviewed changes

oginskis changed the title ~~[Feature] Support Kueue.sh for RayJob admission~~ [Feature] Support suspend in RayJob Mar 13, 2023

alculquicondor reviewed May 3, 2023

View reviewed changes

mcariatm force-pushed the kueue-integration branch from 76bad3b to 8eb513d Compare May 3, 2023 21:20

architkulkarni reviewed May 3, 2023

View reviewed changes

mcariatm force-pushed the kueue-integration branch 2 times, most recently from d810ed9 to 3371dee Compare May 4, 2023 06:47

architkulkarni approved these changes May 4, 2023

View reviewed changes

mcariatm force-pushed the kueue-integration branch from 3371dee to ebd5054 Compare May 4, 2023 19:43

kevin85421 self-assigned this May 8, 2023

alculquicondor mentioned this pull request May 10, 2023

Support for RayJob kubernetes-sigs/kueue#666

Closed

3 tasks

architkulkarni reviewed May 16, 2023

View reviewed changes

oginskis and others added 7 commits May 16, 2023 21:40

KubeRay and kueue.sh integration

df66203

Removed unecessary function and updated doc for susoend field

56ec179

updated generated thinks

9729553

set job endTime to nil on new start

ed39743

Revert "updated generated thinks"

6d69374

This reverts commit e19a4c0.

Use the reconcile context in the dashboard API cals instead of the pr…

c4afac9

…eviously hard-coded 2 second timeout.

Remove Clustername DashboardURL JobID and message from RayJob status

a71c793

mcariatm force-pushed the kueue-integration branch from ebd5054 to 3c0496b Compare May 16, 2023 18:44

Fixed PR comments

fcaa614

mcariatm force-pushed the kueue-integration branch from 3c0496b to fcaa614 Compare May 16, 2023 18:47

architkulkarni merged commit 9bc5d85 into ray-project:master May 16, 2023
18 of 19 checks passed

Yicheng-Lu-llll mentioned this pull request May 18, 2023

Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field #1096

Merged

4 tasks

	UpdateDeployments(ctx context.Context, specs rayv1alpha1.ServeDeploymentGraphSpec) error
	UpdateDeployments(ctx context.Context, spec rayv1alpha1.ServeDeploymentGraphSpec) error

[Feature] Support suspend in RayJob #926

[Feature] Support suspend in RayJob #926

Conversation

oginskis commented Feb 23, 2023 • edited

Why are these changes needed?

Implementation details

Checks

Jeffwan commented Feb 23, 2023

oginskis commented Feb 28, 2023 • edited

alculquicondor commented Feb 28, 2023 • edited

architkulkarni commented Mar 1, 2023

gvspraveen commented Mar 1, 2023

oginskis commented Mar 1, 2023

Choose a reason for hiding this comment

mcariatm Mar 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcariatm commented Mar 6, 2023 • edited

oginskis commented Mar 6, 2023 • edited

alculquicondor commented Mar 6, 2023

mcariatm commented Mar 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Mar 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denkensk commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

tenzen-y commented Mar 9, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

alculquicondor commented Mar 13, 2023 • edited

alculquicondor commented Mar 13, 2023

oginskis commented Mar 13, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcariatm commented May 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

architkulkarni commented May 4, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

trasc commented May 4, 2023 via email

mcariatm commented May 4, 2023

mcariatm commented May 5, 2023

Choose a reason for hiding this comment

architkulkarni commented May 16, 2023

mcariatm commented May 16, 2023

oginskis commented Feb 23, 2023 •

edited

oginskis commented Feb 28, 2023 •

edited

alculquicondor commented Feb 28, 2023 •

edited

mcariatm Mar 6, 2023 •

edited

mcariatm commented Mar 6, 2023 •

edited

oginskis commented Mar 6, 2023 •

edited

alculquicondor commented Mar 13, 2023 •

edited