[Bug][Autoscaler] Operator does not remove workers #1139

kevin85421 · 2023-06-03T22:58:08Z

Why are these changes needed?

The function updateLocalWorkersToDelete will update WorkersToDelete by filtering out Pods that are not included in runningPods. Hence, Pods with the status PodFailed are not deleted permanently.

In the autoscaler scale-down process, the autoscaler adds a failed Pod to WorkersToDelete and waits for all Pods in WorkersToDelete to be deleted before making the next decision. You can find the corresponding code here. However, the failed Pods will not be deleted by operator because of updateLocalWorkersToDelete.

Note (optional): The definition of non_terminated_nodes in Autoscaler (link)

Case 1: Without this PR + `PodFailed` Pod + Autoscaling

If the autoscaler adds a PodFailed Pod to WorkersToDelete, it will remain stuck there until the Pod is manually deleted by the user. This is because the function updateLocalWorkersToDelete ignores PodFailed Pods.

Case 2: Without this PR + `PodFailed` Pod + No autoscaling

The failed Pod will not be deleted. The KubeRay operator will attempt to maintain the cluster with the desired number of worker Pods (Spec.WorkerGroupSpecs[*].Replicas) that are in the RUNNING or PENDING status.

Case 3: With this PR + `PodFailed` Pod + Autoscaling

If the autoscaler adds a PodFailed Pod to WorkersToDelete, the KubeRay operator will delete all Pods in WorkersToDelete, regardless of their statuses.

Case 4: With this PR + `PodFailed` Pod + No autoscaling

Same as Case 2, we need to have a better approach considering the discussion in [Bug][Autoscaler] Operator does not remove workers #1139 (comment).

Related issue number

Closes #942

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kind create cluster --image=kindest/node:v1.23.0

# path: ray-operator/
make docker-image
kind load docker-image controller:latest

# Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Create an autoscaling RayCluster via a gist.
# https://gist.github.com/kevin85421/97b1241228add6ab04a30c7113479248
# Set `restartPolicy` to `Never`.
kubectl apply -f gist.yaml

# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Check the `ray start` process PID. In the following screenshot, the PID is 14.
ps aux
kill 14

# The worker Pod successfully become `PodFailed`.
kubectl get pods ${WORKER_POD} -o jsonpath='{.status.phase}'
# expected output: Failed

# The failed Pod should be deleted.

Reproduction

kind create cluster --image=kindest/node:v1.23.0
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Create an autoscaling RayCluster via a gist.
# https://gist.github.com/kevin85421/97b1241228add6ab04a30c7113479248
# Set `restartPolicy` to `Never`.
kubectl apply -f gist.yaml

# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Check the `ray start` process PID. In the following screenshot, the PID is 14.
ps aux
kill 14

# The worker Pod successfully become `PodFailed`.
kubectl get pods ${WORKER_POD} -o jsonpath='{.status.phase}'
# expected output: Failed

# The failed Pod will not be deleted.

kevin85421 · 2023-06-03T23:04:43Z

ray-operator/controllers/ray/raycluster_controller.go

 			}
 		}
-		r.updateLocalWorkersToDelete(&worker, runningPods.Items)


This function will update WorkersToDelete by filtering out Pods that are not included in runningPods.

kevin85421 · 2023-06-03T23:33:10Z

cc @qizzzh

kevin85421 · 2023-06-04T00:28:40Z

cc @Yicheng-Lu-llll

qizzzh · 2023-06-04T06:57:58Z

Does it mean that before this change is released we would need to manually delete failed pods in order for the autoscaler scale-down to work? Would scale-up be affected? I'm trying to understand how to work around the issue for now.

Yicheng-Lu-llll · 2023-06-04T17:23:47Z

ray-operator/controllers/ray/raycluster_controller.go

-					diff++
+					// For example, the failed Pod (Status: corev1.PodFailed) is not counted in the `runningPods` variable.
+					// Therefore, we should not update `diff` when we delete a failed Pod.
+					if isPodRunningOrPendingAndNotDeleting(pod) {


I am wondering whether it would be more straightforward to retrieve runningPods and calculate diff after PrioritizeWorkersToDelete has been processed. That way, we wouldn't need to modify diff at this stage.

I cannot understand the point you are trying to make. We still need to calculate the number of Pods in WorkersToDelete that overlap with runningPods. Would you mind providing more details about the point you're trying to make? We can continue the discussion here or chat offline.

Sync with @Yicheng-Lu-llll offline:

[Yicheng's point]

This loop will send Delete requests to the Kubernetes API Server to delete Pods in WorkersToDelete' Afterward, we can list the worker Pods so that we do not need to maintain the diff within the loop.

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Lines 472 to 476 in c420135

workerPods := corev1.PodList{}

filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName}

if err := r.List(context.TODO(), &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {

return err

}

[Conclusion]

We cannot guarantee that the informer cache will be consistent with the Kubernetes API Server when listing worker Pods. Therefore, it is still necessary to calculate the diff within the loop.

Yicheng-Lu-llll · 2023-06-04T18:26:28Z

ray-operator/controllers/ray/raycluster_controller.go

 		diff := workerReplicas - int32(len(runningPods.Items))
-
 		if PrioritizeWorkersToDelete {


Just for my education, Is there any conditions that we need to set PrioritizeWorkersToDelete to false?

The reason I am asking this is because without PrioritizeWorkersToDelete, if diff >= 0, the kuberay operator will not respect the spec field(here kuberay operator will never delete pods in WorkersToDelete) set by autoscaler.

Perhaps I'm missing something, but for now, It seems to me:

If the autoscaler is enabled, it is necessary to set PrioritizeWorkersToDelete to true to ensure that the autoscaler functions properly.

if autoscaler is disabled, then the user is the only one who can update the WorkersToDelete field. In this case, the user effectively acts as the autoscaler.

So, I can not think a situation where we would need to set PrioritizeWorkersToDelete to false. And a little bit doubt the correctness with PrioritizeWorkersToDelete to false.

Consider a situation with PrioritizeWorkersToDelete set to false and with autoscaler disabled:

at T0, we have 8 running workers.

at T1, user decides to remove worker A. So, he sets replica = replica -1, and adds worker A to WorkersToDelete in the yaml file.

at the same time T1, worker B and worker C fails because of OOM(let's assume the restart policy is never).

at time T2, kuberay operator tries to collect all running workers and compare with the expected replica. it will find that the num current running pods is 8-2 = 6, the expected replica is 8 - 1 = 7. So, kuberay operator will add a new workers and never delete worker A.

Is there any conditions that we need to set PrioritizeWorkersToDelete to false?

PrioritizeWorkersToDelete is a feature gate (#208), and its default value is set to true. You can refer to the lessons we learned from #973. If we have a feature that allows users to enable or disable it, the cost of addressing the issue after the v0.5.0 release will not be that expensive. When the feature is stable enough, we can remove it.

DmitriGekhtman · 2023-06-04T19:33:42Z

ray-operator/controllers/ray/raycluster_controller.go

-			if (aPod.Status.Phase == corev1.PodRunning || aPod.Status.Phase == corev1.PodPending) && aPod.ObjectMeta.DeletionTimestamp == nil {
-				runningPods.Items = append(runningPods.Items, aPod)
+		for _, pod := range workerPods.Items {
+			// TODO (kevin85421): We also need to have a clear story of all the Pod status phases, especially for PodFailed.


Currently, we delete PodFailed pods if the reason is Eviction.
No particular story there -- someone complained that the evicted pods aren't deleted, so we did that.
Maybe we should just delete all failed pods?

Currently, we delete PodFailed pods if the reason is Eviction.

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Line 407 in c420135

} else if headPod.Status.Phase == corev1.PodFailed && strings.Contains(headPod.Status.Reason, "Evicted") {

Based on the provided code snippet, it appears that only the head Pod with a status of PodFailed and reason Evicted will be deleted. The function updateLocalWorkersToDelete filters out the Pods with a status of PodFailed, so the PodFailed workers will be completely excluded from the deletion process. Do I miss anything?

Maybe we should just delete all failed pods?

I consider to do that. Because the change could have a significant impact on stability, either positive or negative, I will create a separate PR with more careful considerations at a later time. Some points that I need to figure out:

With GCS fault tolerance enabled, KubeRay will label certain Pods as unhealthy if their probes generate unhealthy events. Then, KubeRay will delete the Pods with the unhealthy label. This behavior may be refactored soon, and it is a bit overlapped with removing PodFailed Pods.

Some users prefer to retain certain unhealthy Pods or Pods in a PodFailed state for troubleshooting purposes. (e.g. [Feature] Ensure the number of healthy workers while keep the abnormal worker for troubleshooting #1022)

kevin85421 · 2023-06-05T20:21:23Z

/rebase

architkulkarni

This looks good to me. Could you please add a bit to the PR description to describe the fix that's made in this PR?

(Before reading the code, based on the PR description I guessed that the PR would change the behavior to delete all pods including failed pods, but actually what the PR does is to correct the value of diff to fix the autoscaling bug)
Update: discussed offline, actually diff was correct before, but after deleting the filtering function, the PR needs to account for that in diff to make it correct again.

kevin85421 · 2023-06-06T00:21:00Z

@architkulkarni I have updated the PR description. Is it clear now? Thanks!

architkulkarni · 2023-06-06T16:51:10Z

@kevin85421 it's clear, thanks for updating!

kevin85421 · 2023-06-06T17:17:42Z

Create a follow up issue: #1144

Operator does not remove workers

kevin85421 commented Jun 3, 2023

View reviewed changes

kevin85421 changed the title ~~WIP~~ [Bug] Operator does not remove workers Jun 3, 2023

kevin85421 requested a review from DmitriGekhtman June 3, 2023 23:30

kevin85421 mentioned this pull request Jun 3, 2023

[Bug] Operator does not remove workers #942

Closed

2 tasks

kevin85421 marked this pull request as ready for review June 3, 2023 23:32

kevin85421 changed the title ~~[Bug] Operator does not remove workers~~ [Bug][Autoscaler] Operator does not remove workers Jun 3, 2023

Yicheng-Lu-llll reviewed Jun 4, 2023

View reviewed changes

DmitriGekhtman approved these changes Jun 4, 2023

View reviewed changes

DmitriGekhtman reviewed Jun 4, 2023

View reviewed changes

Yicheng-Lu-llll approved these changes Jun 4, 2023

View reviewed changes

kevin85421 requested review from architkulkarni and gvspraveen June 5, 2023 18:15

architkulkarni self-assigned this Jun 5, 2023

update

e4dc603

kevin85421 force-pushed the autoscaler-race-condition branch from d7aa996 to e4dc603 Compare June 5, 2023 20:21

architkulkarni approved these changes Jun 5, 2023

View reviewed changes

kevin85421 mentioned this pull request Jun 6, 2023

[Feature] How should we handle PodFailed Pods? #1144

Closed

2 tasks

kevin85421 merged commit bc6be0e into ray-project:master Jun 6, 2023
19 checks passed

peterghaddad mentioned this pull request Jun 16, 2023

[Bug] Ray autoscaler spawns hundreds of pods #1108

Closed

2 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug][Autoscaler] Operator does not remove workers (ray-project#1139)

ecf1431

Operator does not remove workers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Autoscaler] Operator does not remove workers #1139

[Bug][Autoscaler] Operator does not remove workers #1139

kevin85421 commented Jun 3, 2023 •

edited

kevin85421 Jun 3, 2023

kevin85421 commented Jun 3, 2023

kevin85421 commented Jun 4, 2023

qizzzh commented Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023

kevin85421 Jun 4, 2023

kevin85421 Jun 5, 2023

Yicheng-Lu-llll Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023 •

edited

kevin85421 Jun 4, 2023

DmitriGekhtman Jun 4, 2023

kevin85421 Jun 4, 2023

kevin85421 commented Jun 5, 2023

architkulkarni left a comment •

edited

kevin85421 commented Jun 6, 2023

architkulkarni commented Jun 6, 2023

kevin85421 commented Jun 6, 2023

	workerPods := corev1.PodList{}
	filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName}
	if err := r.List(context.TODO(), &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {
	return err
	}

		diff := workerReplicas - int32(len(runningPods.Items))

		if PrioritizeWorkersToDelete {

[Bug][Autoscaler] Operator does not remove workers #1139

[Bug][Autoscaler] Operator does not remove workers #1139

Conversation

kevin85421 commented Jun 3, 2023 • edited

Why are these changes needed?

Case 1: Without this PR + PodFailed Pod + Autoscaling

Case 2: Without this PR + PodFailed Pod + No autoscaling

Case 3: With this PR + PodFailed Pod + Autoscaling

Case 4: With this PR + PodFailed Pod + No autoscaling

Related issue number

Checks

Reproduction

kevin85421 Jun 3, 2023

Choose a reason for hiding this comment

kevin85421 commented Jun 3, 2023

kevin85421 commented Jun 4, 2023

qizzzh commented Jun 4, 2023 • edited

Yicheng-Lu-llll Jun 4, 2023

Choose a reason for hiding this comment

kevin85421 Jun 4, 2023

Choose a reason for hiding this comment

kevin85421 Jun 5, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll Jun 4, 2023 • edited

Choose a reason for hiding this comment

Yicheng-Lu-llll Jun 4, 2023 • edited

Choose a reason for hiding this comment

Yicheng-Lu-llll Jun 4, 2023 • edited

Choose a reason for hiding this comment

kevin85421 Jun 4, 2023

Choose a reason for hiding this comment

DmitriGekhtman Jun 4, 2023

Choose a reason for hiding this comment

kevin85421 Jun 4, 2023

Choose a reason for hiding this comment

kevin85421 commented Jun 5, 2023

architkulkarni left a comment • edited

Choose a reason for hiding this comment

kevin85421 commented Jun 6, 2023

architkulkarni commented Jun 6, 2023

kevin85421 commented Jun 6, 2023

kevin85421 commented Jun 3, 2023 •

edited

Case 1: Without this PR + `PodFailed` Pod + Autoscaling

Case 2: Without this PR + `PodFailed` Pod + No autoscaling

Case 3: With this PR + `PodFailed` Pod + Autoscaling

Case 4: With this PR + `PodFailed` Pod + No autoscaling

qizzzh commented Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023 •

edited

Yicheng-Lu-llll Jun 4, 2023 •

edited

architkulkarni left a comment •

edited