Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Autoscaler] Operator does not remove workers #1139

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jun 3, 2023

Why are these changes needed?

The function updateLocalWorkersToDelete will update WorkersToDelete by filtering out Pods that are not included in runningPods. Hence, Pods with the status PodFailed are not deleted permanently.

In the autoscaler scale-down process, the autoscaler adds a failed Pod to WorkersToDelete and waits for all Pods in WorkersToDelete to be deleted before making the next decision. You can find the corresponding code here. However, the failed Pods will not be deleted by operator because of updateLocalWorkersToDelete.

  • Note (optional): The definition of non_terminated_nodes in Autoscaler (link)

Case 1: Without this PR + PodFailed Pod + Autoscaling

  • If the autoscaler adds a PodFailed Pod to WorkersToDelete, it will remain stuck there until the Pod is manually deleted by the user. This is because the function updateLocalWorkersToDelete ignores PodFailed Pods.

Case 2: Without this PR + PodFailed Pod + No autoscaling

  • The failed Pod will not be deleted. The KubeRay operator will attempt to maintain the cluster with the desired number of worker Pods (Spec.WorkerGroupSpecs[*].Replicas) that are in the RUNNING or PENDING status.

Case 3: With this PR + PodFailed Pod + Autoscaling

  • If the autoscaler adds a PodFailed Pod to WorkersToDelete, the KubeRay operator will delete all Pods in WorkersToDelete, regardless of their statuses.

Case 4: With this PR + PodFailed Pod + No autoscaling

Related issue number

Closes #942

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
kind create cluster --image=kindest/node:v1.23.0

# path: ray-operator/
make docker-image
kind load docker-image controller:latest

# Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Create an autoscaling RayCluster via a gist.
# https://gist.github.com/kevin85421/97b1241228add6ab04a30c7113479248
# Set `restartPolicy` to `Never`.
kubectl apply -f gist.yaml

# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Check the `ray start` process PID. In the following screenshot, the PID is 14.
ps aux
kill 14

# The worker Pod successfully become `PodFailed`.
kubectl get pods ${WORKER_POD} -o jsonpath='{.status.phase}'
# expected output: Failed

# The failed Pod should be deleted.

Reproduction

kind create cluster --image=kindest/node:v1.23.0
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Create an autoscaling RayCluster via a gist.
# https://gist.github.com/kevin85421/97b1241228add6ab04a30c7113479248
# Set `restartPolicy` to `Never`.
kubectl apply -f gist.yaml

# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Check the `ray start` process PID. In the following screenshot, the PID is 14.
ps aux
kill 14

# The worker Pod successfully become `PodFailed`.
kubectl get pods ${WORKER_POD} -o jsonpath='{.status.phase}'
# expected output: Failed

# The failed Pod will not be deleted.
Screen Shot 2023-06-03 at 4 17 42 PM Screen Shot 2023-06-03 at 4 23 25 PM

}
}
r.updateLocalWorkersToDelete(&worker, runningPods.Items)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will update WorkersToDelete by filtering out Pods that are not included in runningPods.

@kevin85421 kevin85421 changed the title WIP [Bug] Operator does not remove workers Jun 3, 2023
@kevin85421 kevin85421 marked this pull request as ready for review June 3, 2023 23:32
@kevin85421
Copy link
Member Author

cc @qizzzh

@kevin85421 kevin85421 changed the title [Bug] Operator does not remove workers [Bug][Autoscaler] Operator does not remove workers Jun 3, 2023
@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

@qizzzh
Copy link

qizzzh commented Jun 4, 2023

Does it mean that before this change is released we would need to manually delete failed pods in order for the autoscaler scale-down to work? Would scale-up be affected? I'm trying to understand how to work around the issue for now.

diff++
// For example, the failed Pod (Status: corev1.PodFailed) is not counted in the `runningPods` variable.
// Therefore, we should not update `diff` when we delete a failed Pod.
if isPodRunningOrPendingAndNotDeleting(pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether it would be more straightforward to retrieve runningPods and calculate diff after PrioritizeWorkersToDelete has been processed. That way, we wouldn't need to modify diff at this stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot understand the point you are trying to make. We still need to calculate the number of Pods in WorkersToDelete that overlap with runningPods. Would you mind providing more details about the point you're trying to make? We can continue the discussion here or chat offline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync with @Yicheng-Lu-llll offline:

[Yicheng's point]

This loop will send Delete requests to the Kubernetes API Server to delete Pods in WorkersToDelete' Afterward, we can list the worker Pods so that we do not need to maintain the diff within the loop.

workerPods := corev1.PodList{}
filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName}
if err := r.List(context.TODO(), &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil {
return err
}

[Conclusion]

We cannot guarantee that the informer cache will be consistent with the Kubernetes API Server when listing worker Pods. Therefore, it is still necessary to calculate the diff within the loop.

diff := workerReplicas - int32(len(runningPods.Items))

if PrioritizeWorkersToDelete {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education, Is there any conditions that we need to set PrioritizeWorkersToDelete to false?

The reason I am asking this is because without PrioritizeWorkersToDelete, if diff >= 0, the kuberay operator will not respect the spec field(here kuberay operator will never delete pods in WorkersToDelete) set by autoscaler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing something, but for now, It seems to me:

  • If the autoscaler is enabled, it is necessary to set PrioritizeWorkersToDelete to true to ensure that the autoscaler functions properly.
  • if autoscaler is disabled, then the user is the only one who can update the WorkersToDelete field. In this case, the user effectively acts as the autoscaler.

So, I can not think a situation where we would need to set PrioritizeWorkersToDelete to false. And a little bit doubt the correctness with PrioritizeWorkersToDelete to false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a situation with PrioritizeWorkersToDelete set to false and with autoscaler disabled:

  1. at T0, we have 8 running workers.
  2. at T1, user decides to remove worker A. So, he sets replica = replica -1, and adds worker A to WorkersToDelete in the yaml file.
  3. at the same time T1, worker B and worker C fails because of OOM(let's assume the restart policy is never).
  4. at time T2, kuberay operator tries to collect all running workers and compare with the expected replica. it will find that the num current running pods is 8-2 = 6, the expected replica is 8 - 1 = 7. So, kuberay operator will add a new workers and never delete worker A.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any conditions that we need to set PrioritizeWorkersToDelete to false?

PrioritizeWorkersToDelete is a feature gate (#208), and its default value is set to true. You can refer to the lessons we learned from #973. If we have a feature that allows users to enable or disable it, the cost of addressing the issue after the v0.5.0 release will not be that expensive. When the feature is stable enough, we can remove it.

if (aPod.Status.Phase == corev1.PodRunning || aPod.Status.Phase == corev1.PodPending) && aPod.ObjectMeta.DeletionTimestamp == nil {
runningPods.Items = append(runningPods.Items, aPod)
for _, pod := range workerPods.Items {
// TODO (kevin85421): We also need to have a clear story of all the Pod status phases, especially for PodFailed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we delete PodFailed pods if the reason is Eviction.
No particular story there -- someone complained that the evicted pods aren't deleted, so we did that.
Maybe we should just delete all failed pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we delete PodFailed pods if the reason is Eviction.

} else if headPod.Status.Phase == corev1.PodFailed && strings.Contains(headPod.Status.Reason, "Evicted") {

Based on the provided code snippet, it appears that only the head Pod with a status of PodFailed and reason Evicted will be deleted. The function updateLocalWorkersToDelete filters out the Pods with a status of PodFailed, so the PodFailed workers will be completely excluded from the deletion process. Do I miss anything?

Maybe we should just delete all failed pods?

I consider to do that. Because the change could have a significant impact on stability, either positive or negative, I will create a separate PR with more careful considerations at a later time. Some points that I need to figure out:

@kevin85421
Copy link
Member Author

/rebase

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Could you please add a bit to the PR description to describe the fix that's made in this PR?

(Before reading the code, based on the PR description I guessed that the PR would change the behavior to delete all pods including failed pods, but actually what the PR does is to correct the value of diff to fix the autoscaling bug)
Update: discussed offline, actually diff was correct before, but after deleting the filtering function, the PR needs to account for that in diff to make it correct again.

@kevin85421
Copy link
Member Author

@architkulkarni I have updated the PR description. Is it clear now? Thanks!

@architkulkarni
Copy link
Contributor

@kevin85421 it's clear, thanks for updating!

@kevin85421
Copy link
Member Author

Create a follow up issue: #1144

@kevin85421 kevin85421 merged commit bc6be0e into ray-project:master Jun 6, 2023
19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Operator does not remove workers
5 participants