Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support suspend in RayJob #926

Merged
merged 8 commits into from
May 16, 2023

Conversation

oginskis
Copy link
Contributor

@oginskis oginskis commented Feb 23, 2023

Why are these changes needed?

Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them.

So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511

Implementation details

  • If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job.

  • If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job.

  • If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued.

Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else.

No Kueue-specific code leaked to Kuberay implementation

Contributors from Kueue/Kubernetes cc'ed:

@alculquicondor
@mwielgus

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 23, 2023

Thanks for the feature request. We will have a look and come back to you soon

@oginskis
Copy link
Contributor Author

oginskis commented Feb 28, 2023

Hello @Jeffwan
Thank you!
Can I do anything to speed up the review?

@alculquicondor
Copy link

alculquicondor commented Feb 28, 2023

I quickly looked through https://github.com/ray-project/community and I didn't see any recurring meeting.

Is there any venue where we can meet if you have questions about why this is important? @Jeffwan

@architkulkarni
Copy link
Contributor

I quickly looked through https://github.com/ray-project/community and I didn't see any recurring meeting.

Is there any venue where we can meet if you have questions about why this is important? @Jeffwan

Thanks for bringing this up -- we do have a biweekly Kuberay sync, @gvspraveen can check on adding this to the linked community page.

In this case, we discussed internally and don't have any questions, no need to present anything at the sync. Because suspend is a standard feature for native Kubernetes jobs, it makes sense for this to be a part of KubeRay. We'll review the PR as soon as we can!

@gvspraveen
Copy link
Contributor

Thanks for bringing this up -- we do have a biweekly Kuberay sync, @gvspraveen can check on adding this to the linked community page.

That community page seems to be for all of ray-project. Not sure if adding meeting links is appropriate there. But there is a Slack group linked in ray docs. You can reach out to us there and we can add you to meeting.

@oginskis
Copy link
Contributor Author

oginskis commented Mar 1, 2023

@architkulkarni Could you re-run the workflows? goimports issue fixed

@@ -61,6 +62,8 @@ type RayJobSpec struct {
RayClusterSpec *RayClusterSpec `json:"rayClusterSpec,omitempty"`
// clusterSelector is used to select running rayclusters by labels
ClusterSelector map[string]string `json:"clusterSelector,omitempty"`
// suspend specifies whether the RayJob controller should create a RayCluster instance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth explaining what happens with the RayCluster on transitions from false to true and true to false.
Does it affect .status.startTime?

Copy link

@mcariatm mcariatm Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done .status.startTime is updating for each new start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcariatm Could you also reset .status.endTime when a job is resumed (transitions from suspend true -> false)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcariatm Could you also reset .status.endTime when a job is resumed (transitions from suspend true -> false)

Done, line 400 in rayjob_controller.go

@@ -272,6 +326,11 @@ func isJobPendingOrRunning(status rayv1alpha1.JobStatus) bool {
return (status == rayv1alpha1.JobStatusPending) || (status == rayv1alpha1.JobStatusRunning)
}

// isSuspendFlagSet indicates whether the job has a suspended flag set.
func isSuspendFlagSet(job *rayv1alpha1.RayJob) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one-liner doesn't add value. I would just inline job.Spec.Suspend where necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mcariatm
Copy link

mcariatm commented Mar 6, 2023

@architkulkarni, @Jeffwan I found an issue. When a rayjob is deleted and after a short time it is applied again we get this error.

Failed to start Job Supervisor actor: The name _ray_internal_job_actor_rayjob-sample-fl6rc
    (namespace=SUPERVISOR_ACTOR_RAY_NAMESPACE) is already taken. Please use a different
    name or get the existing actor using ray.get_actor(''_ray_internal_job_actor_rayjob-sample-fl6rc'',
    namespace=''SUPERVISOR_ACTOR_RAY_NAMESPACE'').'
  rayClusterName: rayjob-sample-raycluster-nwxd6

This happens when the old RayCluster is still in terminating status and the job remains in FAILED status with this error as I noticed. If the jobs are orchestrated by Kueue this problem may occur more and more often.
What do you think?
Should we fix it in this PR? Should we wait for cluster terminating?

@oginskis
Copy link
Contributor Author

oginskis commented Mar 6, 2023

@mcariatm
I believe the easiest way to address this race condition is to reset rayJob.Status.RayClusterName every time when we delete the cluster (suspend goes from false to true) along with resetting rayJob.Status.RayClusterStatus here: https://github.com/epam/kuberay/blob/kueue-integration/ray-operator/controllers/ray/rayjob_controller.go#L256

Currently, we use the old cluster id when the Ray cluster instance is re-created, so if the suspend false -> true -> false transitions happen too fast, then the Job controller could connect and try to submit the job against the old terminating instance as you explained.

Also, consider resetting rayJob.Status.DashboardURL, rayJob.Status.JobId, and maybe rayJob.Status.Message

@alculquicondor
Copy link

What happens with the RayCluster object when you call Delete on it? Does it immediately disappear or is it held by a finalizer until it's actually cleaned up?

If the object persists, I think it's more appropriate to wait for it to finish before creating the new RayCluster. If not, what @oginskis suggests sounds like a good option.

@mcariatm
Copy link

mcariatm commented Mar 7, 2023

After a long time of testing I understood that the problem does not depend on the clusters that are in terminating status.
When the RayCluster is created a request to dashboard is done which uses a hard coded 2 seconds timeout. On low resources clusters the dashboard fails to respond in 2 seconds.
I added the fix using context to this PR because we depend on it.
I also decided to reset ClustName, DashboardUrl, JobId and message from status.

@@ -160,14 +160,14 @@ func FetchDashboardURL(ctx context.Context, log *logr.Logger, cli client.Client,

func (r *RayDashboardClient) InitClient(url string) {
r.client = http.Client{
Timeout: 2 * time.Second,
Timeout: 20 * time.Second,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this timeout configurable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not in this PR... 😆

@kevin85421
Copy link
Member

Thanks @oginskis @mcariatm for this contribution!

I will start reviewing this PR this week. By the way, there seem to be some updates in dashboard_httpclient. It is highly possible that KubeRay will redesign RayJob in April. This means that we will likely remove the Dashboard HTTP client because submitting a job to RayCluster via the HTTP client is not an idempotent operation. It sometimes causes multiple job creations within a single RayJob. Please see #756 for more context.

If this PR heavily relies on that part, it may not be merged in this release (v0.5.0 plans to be out at the end of March or early April). The KubeRay team will do our best to get this PR merge in v0.6.0.

rayJobInstance.Status.DashboardURL = ""
rayJobInstance.Status.JobId = ""
rayJobInstance.Status.Message = ""
err = r.updateState(ctx, rayJobInstance, jobInfo, rayv1alpha1.JobStatusStopped, rayv1alpha1.JobDeploymentStatusSuspended, nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in a separate API call? can it be one with line 246?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding this question

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was addressed.

@denkensk
Copy link
Contributor

denkensk commented Mar 9, 2023

/cc

@alculquicondor
Copy link

/retitle [Feature] Support suspend in RayJob

@tenzen-y
Copy link

tenzen-y commented Mar 9, 2023

/cc

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a pass for this PR. Thanks for this contribution and the detailed PR description! As I mentioned above, it is highly possible that KubeRay will redesign RayJob in April (after v0.5.0) and remove the Dashboard HTTP client. We may either use head Pod command or a separate Kubernetes Job to submit the job to RayCluster. Again, the KubeRay team will do our best to get this PR merge in v0.6.0.

I'm not familiar with Kubernetes built-in Job and curious about the use case for suspend. Could you please explain it to me? I ask because when suspend is set to true while the RayJob is running, the RayCluster is deleted, and the job must be rerun entirely rather than resuming from the previous progress. What's the difference with submitting a new RayJob? Thanks!

@alculquicondor
Copy link

alculquicondor commented Mar 13, 2023

The documentation for Job is here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job

suspend is arguably a bit of a misnomer. You can think of it as "preemption", but we chose suspend for consistency with existing fields in Deployment and CronJob.

The idea is that there is a higher level controller than can determine that the cluster is full and keep suspend=true until there is space in the cluster (aka queueing), or even set it for a job that is running, in order to make space for a higher priority job (aka preemption).

@alculquicondor
Copy link

That said, if a particular application or framework implements checkpointing, the job could resume where it was suspended.
Is there an alternative way of implementing suspend in RayJob that would give us these semantics? Keep in mind that the key requirement for suspend is that, when set, all running pods should stop (respecting grace period).

@oginskis
Copy link
Contributor Author

/retitle [Feature] Support suspend in RayJob

@oginskis oginskis changed the title [Feature] Support Kueue.sh for RayJob admission [Feature] Support suspend in RayJob Mar 13, 2023
Copy link

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
if !rayv1alpha1.IsJobTerminal(info.JobStatus) {
err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId, &r.Log)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this was addressed

rayJobInstance.Status.DashboardURL = ""
rayJobInstance.Status.JobId = ""
rayJobInstance.Status.Message = ""
err = r.updateState(ctx, rayJobInstance, jobInfo, rayv1alpha1.JobStatusStopped, rayv1alpha1.JobDeploymentStatusSuspended, nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was addressed.

@mcariatm
Copy link

mcariatm commented May 3, 2023

@architkulkarni please rerun linter and the other checks.

@@ -6,7 +6,7 @@
package v1alpha1

import (
v1 "k8s.io/api/core/v1"
"k8s.io/api/core/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"

Linter complains about this change for some reason... https://github.com/ray-project/kuberay/actions/runs/4876658441/jobs/8702043318?pr=926#step:11:19

I think it happens when running make test or possibly other commands. I should try to get to the bottom of it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rerun

@mcariatm mcariatm force-pushed the kueue-integration branch 2 times, most recently from d810ed9 to 3371dee Compare May 4, 2023 06:47
@architkulkarni
Copy link
Contributor

https://github.com/ray-project/kuberay/actions/runs/4879883408/jobs/8718473999?pr=926#step:5:62
@mcariatm It looks like there were some autogenerated changes to register.go in your PR, do you mind reverting these? Sorry about this

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A basic question, what's the purpose of passing context to all the HTTP requests, is that a general best practice or does it fix something specific, separately from increasing the timeout?

Apart from that question, looks good to me pending the lint failure. Thanks for the contribution!

@trasc
Copy link

trasc commented May 4, 2023 via email

@mcariatm
Copy link

mcariatm commented May 4, 2023

A basic question, what's the purpose of passing context to all the HTTP requests, is that a general best practice or does it fix something specific, separately from increasing the timeout?

Apart from that question, looks good to me pending the lint failure. Thanks for the contribution!

can you try again please?

@mcariatm
Copy link

mcariatm commented May 5, 2023

@kevin85421 can you have a look?

@kevin85421 kevin85421 self-assigned this May 8, 2023
UpdateDeployments(specs rayv1alpha1.ServeDeploymentGraphSpec) error
GetDeploymentsStatus() (*ServeDeploymentStatuses, error)
GetDeployments(context.Context) (string, error)
UpdateDeployments(ctx context.Context, specs rayv1alpha1.ServeDeploymentGraphSpec) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UpdateDeployments(ctx context.Context, specs rayv1alpha1.ServeDeploymentGraphSpec) error
UpdateDeployments(ctx context.Context, spec rayv1alpha1.ServeDeploymentGraphSpec) error

Just one merge conflict here.

@architkulkarni
Copy link
Contributor

I've tested this manually and discussed offline with @kevin85421 and we are ready to merge this. @oginskis if you can address the merge conflict, I will merge the PR

@mcariatm
Copy link

I've tested this manually and discussed offline with @kevin85421 and we are ready to merge this. @oginskis if you can address the merge conflict, I will merge the PR

I rebased it. Please rerun the tests.
If you have any questions please contact me directly.

@architkulkarni architkulkarni merged commit 9bc5d85 into ray-project:master May 16, 2023
18 of 19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them.

So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511

Implementation details
If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job.

If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job.

If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued.

Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else.

No Kueue-specific code leaked to Kuberay implementation

Contributors from Kueue/Kubernetes cc'ed:

@alculquicondor
@mwielgus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants