Support suspension of RayClusters #1711

andrewsykim · 2023-12-05T15:17:19Z

Why are these changes needed?

Support suspending RayClusters. See #1667 for more details on the use-case.

Related issue number

Closes #1667

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

andrewsykim · 2023-12-05T19:18:54Z

Some manual testing I did so far:

I have a kind cluster locally with lcoally built kuberay-operator and a ray cluster:

$ kubectl get po 
NAME                                           READY   STATUS    RESTARTS   AGE
kuberay-operator-5987588ffc-g7brs              1/1     Running   0          154m
ray-cluster-kuberay-head-8nmdb                 1/1     Running   0          152m
ray-cluster-kuberay-worker-workergroup-rnt9v   1/1     Running   0          152m

If I run kubectl edit raycluster ray-cluster-kuberay and set suspend: true, I can see all the ray pods are terminating:

$ kubectl get po 
NAME                                           READY   STATUS        RESTARTS   AGE
kuberay-operator-5987588ffc-g7brs              1/1     Running       0          155m
ray-cluster-kuberay-head-8nmdb                 1/1     Terminating   0          153m
ray-cluster-kuberay-worker-workergroup-rnt9v   1/1     Terminating   0          153m

If I remove the suspend field, pods are created again:

$ kubectl get po 
NAME                                           READY   STATUS        RESTARTS   AGE
kuberay-operator-5987588ffc-g7brs              1/1     Running       0          157m
ray-cluster-kuberay-head-p4xpn                 1/1     Running       0          9s
ray-cluster-kuberay-worker-workergroup-hn86f   0/1     Init:0/1      0          9s

andrewsykim · 2023-12-05T19:19:06Z

Will add unit / e2e tests shortly

astefanutti

Thanks, I've left a couple of general comments, otherwise looks good! I'll get back to you if I have any valuable feedback when I get a change to test it.

astefanutti · 2023-12-06T08:43:38Z

ray-operator/controllers/ray/raycluster_controller.go

+
+	// if RayCluster is suspended, delete all head pods
+	if instance.Spec.Suspend != nil && *instance.Spec.Suspend {
+		for _, headPod := range headPods.Items {
 			if err := r.Delete(ctx, &headPod); err != nil {


Is it possible to perform a delete collection operation, e.g. with DeleteAllOf, instead of multiple delete ones?

Yes, great suggestion :)

astefanutti · 2023-12-06T08:45:03Z

ray-operator/controllers/ray/raycluster_controller.go

+				return err
+			}
+
+			for _, workerPod := range workerPods.Items {


Dito: could it be possible to perform a single delete collection operation, e.g. with DeleteAllOf, instead of multiple delete ones?

astefanutti · 2023-12-06T08:54:43Z

ray-operator/controllers/ray/raycluster_controller.go

-		r.Log.Info("reconcilePods", "head Pod", headPod.Name, "shouldDelete", shouldDelete, "reason", reason)
-		if shouldDelete {
+
+	// if RayCluster is suspended, delete all head pods


It seems some logic is common to the standard cluster deletion. Maybe there is an opportunity to factorise some of it and avoid duplication / discrepancies?

It seems some logic is common to the standard cluster deletion.

I might have missed it, but I didn't see any deletion logic for the cluster, my assumption was that this was handled via OwnerReferences and Kubernetes garbage collection.

It could have but I don't think that's the case. I'm referring to that code block in particular:

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Lines 202 to 259 in 4c2c046

r.Log.Info(

fmt.Sprintf("The RayCluster with GCS enabled, %s, is being deleted. Start to handle the Redis cleanup finalizer %s.",

instance.Name, common.GCSFaultToleranceRedisCleanupFinalizer),

"DeletionTimestamp", instance.ObjectMeta.DeletionTimestamp)

// Delete the head Pod if it exists.

headPods := corev1.PodList{}

filterLabels := client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeTypeLabelKey: string(rayv1.HeadNode)}

if err := r.List(ctx, &headPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err

}

for _, headPod := range headPods.Items {

if !headPod.DeletionTimestamp.IsZero() {

r.Log.Info(fmt.Sprintf("The head Pod %s is already being deleted. Skip deleting this Pod.", headPod.Name))

continue

}

r.Log.Info(fmt.Sprintf(

"Delete the head Pod %s before the Redis cleanup. "+

"The storage namespace %s in Redis cannot be fully deleted if the GCS process on the head Pod is still writing to it.",

headPod.Name, headPod.Annotations[common.RayExternalStorageNSAnnotationKey]))

if err := r.Delete(ctx, &headPod); err != nil {

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err

}

}

// Delete all worker Pods if they exist.

for _, workerGroup := range instance.Spec.WorkerGroupSpecs {

workerPods := corev1.PodList{}

filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: workerGroup.GroupName}

if err := r.List(ctx, &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err

}

for _, workerPod := range workerPods.Items {

if !workerPod.DeletionTimestamp.IsZero() {

r.Log.Info(fmt.Sprintf("The worker Pod %s is already being deleted. Skip deleting this Pod.", workerPod.Name))

continue

}

r.Log.Info(fmt.Sprintf(

"Delete the worker Pod %s. This step isn't necessary for initiating the Redis cleanup process.", workerPod.Name))

if err := r.Delete(ctx, &workerPod); err != nil {

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err

}

}

}

// If the number of head Pods is not 0, wait for it to be terminated before initiating the Redis cleanup process.

if len(headPods.Items) != 0 {

r.Log.Info(fmt.Sprintf(

"Wait for the head Pod %s to be terminated before initiating the Redis cleanup process. "+

"The storage namespace %s in Redis cannot be fully deleted if the GCS process on the head Pod is still writing to it.",

headPods.Items[0].Name, headPods.Items[0].Annotations[common.RayExternalStorageNSAnnotationKey]))

// Requeue after 10 seconds because it takes much longer than DefaultRequeueDuration (2 seconds) for the head Pod to be terminated.

return ctrl.Result{RequeueAfter: 10 * time.Second}, nil

}

// We can start the Redis cleanup process now because the head Pod has been terminated.

I'm going to opt out of updating this code for now as I don't want to change the logic around redis cleanup job. But it might be good to revisit this.

In most delete cases though, it seems like we defer to garbage collection

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Lines 311 to 314 in 4c2c046

if instance.DeletionTimestamp != nil && !instance.DeletionTimestamp.IsZero() {

r.Log.Info("RayCluster is being deleted, just ignore", "cluster name", request.Name)

return ctrl.Result{}, nil

}

Right, sounds good!

astefanutti · 2023-12-07T08:36:44Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -599,6 +599,12 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
 		}
 	}

+	// if RayCluster is suspended, delete all pods and skip reconcile
+	if instance.Spec.Suspend != nil && *instance.Spec.Suspend {
+		filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name}


Should the cluster state be checked and the deletion skipped when it has already been suspended, to avoid unnecessary requests?

astefanutti · 2023-12-07T08:38:47Z

ray-operator/controllers/ray/raycluster_controller.go

+	// if RayCluster is suspended, delete all pods and skip reconcile
+	if instance.Spec.Suspend != nil && *instance.Spec.Suspend {
+		filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name}
+		return r.DeleteAllOf(ctx, &corev1.Pod{}, client.InNamespace(instance.Namespace), filterLabels)


Would emitting an event be useful, to track suspension events? Or maybe a condition would be better suited?

astefanutti · 2023-12-07T08:39:31Z

Will add unit / e2e tests shortly

@andrewsykim do you think e2e tests would be useful? I can help if needed.

andrewsykim · 2023-12-07T16:40:32Z

ray-operator/controllers/ray/raycluster_controller.go

+			return nil
+		}
+
+		clusterLabel := client.MatchingLabels{common.RayClusterLabelKey: instance.Name}


@kevin85421 is it safe to use this cluster label to find pods to terminate for suspension? Are there pods we shouldn't delete using this label if a RayCluster is suspended?

Alternatively we can specifically call DeleteCollection for head node and worker group nodes like this:

headPodLabels := client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeTypeLabelKey: string(rayv1.HeadNode)} if err := r.DeleteAllOf(ctx, &corev1.Pod{}, client.InNamespace(instance.Namespace), headPodLabels); err != nil { return err } for _, worker := range instance.Spec.WorkerGroupSpecs { workerGroupLabels := client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName} if err := r.DeleteAllOf(ctx, &corev1.Pod{}, client.InNamespace(instance.Namespace), workerGroupLabels); err != nil { return err } }

is it safe to use this cluster label to find pods to terminate for suspension? Are there pods we shouldn't delete using this label if a RayCluster is suspended?

Yes, I think it is safe.

Alternatively we can specifically call DeleteCollection for head node and worker group nodes like this

SGTM

andrewsykim · 2023-12-07T16:42:36Z

@andrewsykim do you think e2e tests would be useful? I can help if needed.

@astefanutti yes please feel free to open a PR against mine or have a follow-up PR after this one

andrewsykim · 2023-12-07T16:42:52Z

@kevin85421 I think this is ready for review, please take a look

astefanutti · 2023-12-07T16:44:58Z

@andrewsykim do you think e2e tests would be useful? I can help if needed.

@astefanutti yes please feel free to open a PR against mine or have a follow-up PR after this one

@andrewsykim perfect, will do once yours is merged.

astefanutti

LGTM!

kevin85421

Some questions:

Currently, we only terminate Pods. Should we also delete other Kubernetes resources like Services, Ingresses, Roles, etc.?
Do we need to keep reconciling a suspended CR?

ray-operator/controllers/ray/raycluster_controller_test.go

andrewsykim · 2023-12-08T20:10:17Z

Currently, we only terminate Pods. Should we also delete other Kubernetes resources like Services, Ingresses, Roles, etc.?

Personally I don't think this is necessary because the runtime cost of these are negligible.

kevin85421

Personally I don't think this is necessary because the runtime cost of these are negligible.

An edge case that I can come up with is the LoadBalancer service. If we don't delete the service, will the LoadBalancer outside of the Kubernetes cluster be deleted?

Do we need to keep reconciling a suspended CR?

Can we only reconcile it when suspend changes back to false?

kevin85421 · 2023-12-11T18:00:31Z

@andrewsykim Would you mind taking a look at the CI errors? Thanks!

andrewsykim · 2023-12-12T00:14:53Z

@kevin85421 the CI errors don't seem related to any of my changes:

Error: Failed to run: Error: working-directory (./experimental) was not a path, Error: working-directory (./experimental) was not a path
    at /home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:6851:23
    at Generator.next (<anonymous>)
    at /home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:6713:71
    at new Promise (<anonymous>)
    at module.exports.__awaiter (/home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:6709:12)
    at runLint (/home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:6816:12)
    at /home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:6885:57
    at Object.<anonymous> (/home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:[42](https://github.com/ray-project/kuberay/actions/runs/7148757387/job/19471851905?pr=1711#step:9:45)546:28)
    at Generator.next (<anonymous>)
    at /home/runner/work/_actions/golangci/golangci-lint-action/v2/dist/run/index.js:42349:71
Error: working-directory (./experimental) was not a path

Any ideas? It seems to fail pretty consistently

kevin85421 · 2023-12-12T00:17:40Z

Try to rebase with the master branch.

andrewsykim · 2023-12-12T00:19:18Z

An edge case that I can come up with is the LoadBalancer service. If we don't delete the service, will the LoadBalancer outside of the Kubernetes cluster be deleted?

Those LBs would not be deleted if a RayCluster is suspended. And for now I think this is the correct approach. Deleting the LB on suspension would mean that the LB is provisioned a new DNS or IP when ray cluster is resumed. Some users may expect the endpoint of the LB to remain consistent even if they temporarily spin down all the pods.

andrewsykim · 2023-12-12T00:26:26Z

Can we only reconcile it when suspend changes back to false?

Originally I didn't add this because of deletion cases, cause we still want to allow deletion of suspend clusters right? But I think we can just add this check after deletion checks. I updated the PR with a top level check in rayClusterReconcile after checking deletion

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

kevin85421 · 2023-12-12T00:47:20Z

Some users may expect the endpoint of the LB to remain consistent even if they temporarily spin down all the pods.

Got it. This makes sense.

Originally I didn't add this because of deletion cases, cause we still want to allow deletion of suspend clusters right?

I cannot get the point. What's the relationship between skipping reconciliation and the deletion of suspended RayCluster CR?

andrewsykim · 2023-12-12T01:33:40Z

I cannot get the point. What's the relationship between skipping reconciliation and the deletion of suspended RayCluster CR?

Sorry for the confusion, I just meant that deleting a suspended cluster should be possible which is why I didn't skip reconcile for suspended cluster initially. But I realized that deletion is handled by garbage collection via OwnerReference, so I added a check to skip reconcile if cluster is suspended. See https://github.com/ray-project/kuberay/pull/1711/files#diff-72ecc3ca405f1e828187748d4f1ec8160bccffa2a4f84a364cd7a94a78e1adb9R315-R318

kevin85421

Thanks!

kevin85421 · 2023-12-12T18:39:27Z

I manually tested it and found a bug. I will open a follow-up PR to fix it.

andrewsykim · 2023-12-12T18:44:51Z

Thanks @kevin85421

astefanutti reviewed Dec 6, 2023

View reviewed changes

andrewsykim force-pushed the suspend-rayclusters branch 2 times, most recently from 0fa4ace to 69c6061 Compare December 7, 2023 05:10

andrewsykim changed the title ~~[WIP] Support suspension of RayClusters~~ Support suspension of RayClusters Dec 7, 2023

astefanutti reviewed Dec 7, 2023

View reviewed changes

andrewsykim force-pushed the suspend-rayclusters branch from 69c6061 to 99cc751 Compare December 7, 2023 16:39

andrewsykim commented Dec 7, 2023

View reviewed changes

andrewsykim force-pushed the suspend-rayclusters branch from 99cc751 to 702e56c Compare December 7, 2023 16:49

kevin85421 self-assigned this Dec 7, 2023

kevin85421 self-requested a review December 7, 2023 17:02

astefanutti approved these changes Dec 8, 2023

View reviewed changes

kevin85421 reviewed Dec 8, 2023

View reviewed changes

ray-operator/controllers/ray/raycluster_controller_test.go Show resolved Hide resolved

andrewsykim force-pushed the suspend-rayclusters branch from 6e94eba to 9b4fe0d Compare December 9, 2023 03:25

kevin85421 reviewed Dec 11, 2023

View reviewed changes

andrewsykim force-pushed the suspend-rayclusters branch from 9b4fe0d to c8c7acb Compare December 12, 2023 00:26

andrewsykim force-pushed the suspend-rayclusters branch from c8c7acb to b326569 Compare December 12, 2023 00:27

Support suspension of RayClusters

c5325ed

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

andrewsykim force-pushed the suspend-rayclusters branch from b326569 to c5325ed Compare December 12, 2023 00:34

kevin85421 approved these changes Dec 12, 2023

View reviewed changes

kevin85421 merged commit 86abaab into ray-project:master Dec 12, 2023
25 checks passed

kevin85421 mentioned this pull request Dec 12, 2023

[Hotfix][Bug] suspend is not a stateless operation #1741

Merged

4 tasks

andrewsykim mentioned this pull request Dec 13, 2023

Check existing pods for suspended RayCluster before calling DeleteCollection #1745

Merged

4 tasks

andrewsykim mentioned this pull request Dec 28, 2023

RayCluster integration with Kueue kubernetes-sigs/kueue#1520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support suspension of RayClusters #1711

Support suspension of RayClusters #1711

andrewsykim commented Dec 5, 2023 •

edited

andrewsykim commented Dec 5, 2023

andrewsykim commented Dec 5, 2023

astefanutti left a comment

astefanutti Dec 6, 2023

andrewsykim Dec 6, 2023

astefanutti Dec 6, 2023

astefanutti Dec 6, 2023

andrewsykim Dec 6, 2023

astefanutti Dec 6, 2023

andrewsykim Dec 7, 2023

astefanutti Dec 7, 2023

astefanutti Dec 7, 2023 •

edited

andrewsykim Dec 7, 2023

astefanutti Dec 7, 2023

astefanutti commented Dec 7, 2023

andrewsykim Dec 7, 2023

kevin85421 Dec 8, 2023

andrewsykim commented Dec 7, 2023

andrewsykim commented Dec 7, 2023

astefanutti commented Dec 7, 2023

astefanutti left a comment

kevin85421 left a comment

andrewsykim commented Dec 8, 2023

kevin85421 left a comment

kevin85421 commented Dec 11, 2023

andrewsykim commented Dec 12, 2023

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023 •

edited

kevin85421 left a comment

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

	r.Log.Info(
	fmt.Sprintf("The RayCluster with GCS enabled, %s, is being deleted. Start to handle the Redis cleanup finalizer %s.",
	instance.Name, common.GCSFaultToleranceRedisCleanupFinalizer),
	"DeletionTimestamp", instance.ObjectMeta.DeletionTimestamp)

	// Delete the head Pod if it exists.
	headPods := corev1.PodList{}
	filterLabels := client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeTypeLabelKey: string(rayv1.HeadNode)}
	if err := r.List(ctx, &headPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {
	return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
	}

	for _, headPod := range headPods.Items {
	if !headPod.DeletionTimestamp.IsZero() {
	r.Log.Info(fmt.Sprintf("The head Pod %s is already being deleted. Skip deleting this Pod.", headPod.Name))
	continue
	}
	r.Log.Info(fmt.Sprintf(
	"Delete the head Pod %s before the Redis cleanup. "+
	"The storage namespace %s in Redis cannot be fully deleted if the GCS process on the head Pod is still writing to it.",
	headPod.Name, headPod.Annotations[common.RayExternalStorageNSAnnotationKey]))
	if err := r.Delete(ctx, &headPod); err != nil {
	return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
	}
	}

	// Delete all worker Pods if they exist.
	for _, workerGroup := range instance.Spec.WorkerGroupSpecs {
	workerPods := corev1.PodList{}
	filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: workerGroup.GroupName}
	if err := r.List(ctx, &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {
	return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
	}

	for _, workerPod := range workerPods.Items {
	if !workerPod.DeletionTimestamp.IsZero() {
	r.Log.Info(fmt.Sprintf("The worker Pod %s is already being deleted. Skip deleting this Pod.", workerPod.Name))
	continue
	}
	r.Log.Info(fmt.Sprintf(
	"Delete the worker Pod %s. This step isn't necessary for initiating the Redis cleanup process.", workerPod.Name))
	if err := r.Delete(ctx, &workerPod); err != nil {
	return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
	}
	}
	}

	// If the number of head Pods is not 0, wait for it to be terminated before initiating the Redis cleanup process.
	if len(headPods.Items) != 0 {
	r.Log.Info(fmt.Sprintf(
	"Wait for the head Pod %s to be terminated before initiating the Redis cleanup process. "+
	"The storage namespace %s in Redis cannot be fully deleted if the GCS process on the head Pod is still writing to it.",
	headPods.Items[0].Name, headPods.Items[0].Annotations[common.RayExternalStorageNSAnnotationKey]))
	// Requeue after 10 seconds because it takes much longer than DefaultRequeueDuration (2 seconds) for the head Pod to be terminated.
	return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
	}

	// We can start the Redis cleanup process now because the head Pod has been terminated.

	if instance.DeletionTimestamp != nil && !instance.DeletionTimestamp.IsZero() {
	r.Log.Info("RayCluster is being deleted, just ignore", "cluster name", request.Name)
	return ctrl.Result{}, nil
	}

Support suspension of RayClusters #1711

Support suspension of RayClusters #1711

Conversation

andrewsykim commented Dec 5, 2023 • edited

Why are these changes needed?

Related issue number

Checks

andrewsykim commented Dec 5, 2023

andrewsykim commented Dec 5, 2023

astefanutti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefanutti Dec 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefanutti commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim commented Dec 7, 2023

andrewsykim commented Dec 7, 2023

astefanutti commented Dec 7, 2023

astefanutti left a comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

andrewsykim commented Dec 8, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Dec 11, 2023

andrewsykim commented Dec 12, 2023

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023 • edited

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Dec 12, 2023

andrewsykim commented Dec 12, 2023

andrewsykim commented Dec 5, 2023 •

edited

astefanutti Dec 7, 2023 •

edited

andrewsykim commented Dec 12, 2023 •

edited