-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Autoscaler] Operator does not remove workers #1139
[Bug][Autoscaler] Operator does not remove workers #1139
Conversation
} | ||
} | ||
r.updateLocalWorkersToDelete(&worker, runningPods.Items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will update WorkersToDelete
by filtering out Pods that are not included in runningPods
.
cc @qizzzh |
Does it mean that before this change is released we would need to manually delete failed pods in order for the autoscaler scale-down to work? Would scale-up be affected? I'm trying to understand how to work around the issue for now. |
diff++ | ||
// For example, the failed Pod (Status: corev1.PodFailed) is not counted in the `runningPods` variable. | ||
// Therefore, we should not update `diff` when we delete a failed Pod. | ||
if isPodRunningOrPendingAndNotDeleting(pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether it would be more straightforward to retrieve runningPods
and calculate diff
after PrioritizeWorkersToDelete
has been processed. That way, we wouldn't need to modify diff at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot understand the point you are trying to make. We still need to calculate the number of Pods in WorkersToDelete
that overlap with runningPods
. Would you mind providing more details about the point you're trying to make? We can continue the discussion here or chat offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync with @Yicheng-Lu-llll offline:
[Yicheng's point]
This loop will send Delete
requests to the Kubernetes API Server to delete Pods in WorkersToDelete'
Afterward, we can list the worker Pods so that we do not need to maintain the diff
within the loop.
kuberay/ray-operator/controllers/ray/raycluster_controller.go
Lines 472 to 476 in c420135
workerPods := corev1.PodList{} | |
filterLabels = client.MatchingLabels{common.RayClusterLabelKey: instance.Name, common.RayNodeGroupLabelKey: worker.GroupName} | |
if err := r.List(context.TODO(), &workerPods, client.InNamespace(instance.Namespace), filterLabels); err != nil { | |
return err | |
} |
[Conclusion]
We cannot guarantee that the informer cache will be consistent with the Kubernetes API Server when listing worker Pods. Therefore, it is still necessary to calculate the diff
within the loop.
diff := workerReplicas - int32(len(runningPods.Items)) | ||
|
||
if PrioritizeWorkersToDelete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my education, Is there any conditions that we need to set PrioritizeWorkersToDelete
to false
?
The reason I am asking this is because without PrioritizeWorkersToDelete
, if diff >= 0
, the kuberay operator will not respect the spec field(here kuberay operator will never delete pods in WorkersToDelete
) set by autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm missing something, but for now, It seems to me:
- If the autoscaler is enabled, it is necessary to set
PrioritizeWorkersToDelete
totrue
to ensure that the autoscaler functions properly. - if autoscaler is disabled, then the user is the only one who can update the
WorkersToDelete
field. In this case, the user effectively acts as the autoscaler.
So, I can not think a situation where we would need to set PrioritizeWorkersToDelete
to false
. And a little bit doubt the correctness with PrioritizeWorkersToDelete
to false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a situation with PrioritizeWorkersToDelete
set to false
and with autoscaler disabled:
- at T0, we have 8 running workers.
- at T1, user decides to remove worker A. So, he sets replica = replica -1, and adds worker A to
WorkersToDelete
in the yaml file. - at the same time T1, worker B and worker C fails because of OOM(let's assume the restart policy is never).
- at time T2, kuberay operator tries to collect all running workers and compare with the expected replica. it will find that the num current running pods is 8-2 = 6, the expected replica is 8 - 1 = 7. So, kuberay operator will add a new workers and never delete worker A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any conditions that we need to set PrioritizeWorkersToDelete to false?
PrioritizeWorkersToDelete
is a feature gate (#208), and its default value is set to true. You can refer to the lessons we learned from #973. If we have a feature that allows users to enable or disable it, the cost of addressing the issue after the v0.5.0 release will not be that expensive. When the feature is stable enough, we can remove it.
if (aPod.Status.Phase == corev1.PodRunning || aPod.Status.Phase == corev1.PodPending) && aPod.ObjectMeta.DeletionTimestamp == nil { | ||
runningPods.Items = append(runningPods.Items, aPod) | ||
for _, pod := range workerPods.Items { | ||
// TODO (kevin85421): We also need to have a clear story of all the Pod status phases, especially for PodFailed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we delete PodFailed pods if the reason is Eviction.
No particular story there -- someone complained that the evicted pods aren't deleted, so we did that.
Maybe we should just delete all failed pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we delete PodFailed pods if the reason is Eviction.
} else if headPod.Status.Phase == corev1.PodFailed && strings.Contains(headPod.Status.Reason, "Evicted") { |
Based on the provided code snippet, it appears that only the head Pod with a status of PodFailed
and reason Evicted
will be deleted. The function updateLocalWorkersToDelete
filters out the Pods with a status of PodFailed, so the PodFailed workers will be completely excluded from the deletion process. Do I miss anything?
Maybe we should just delete all failed pods?
I consider to do that. Because the change could have a significant impact on stability, either positive or negative, I will create a separate PR with more careful considerations at a later time. Some points that I need to figure out:
-
With GCS fault tolerance enabled, KubeRay will label certain Pods as unhealthy if their probes generate unhealthy events. Then, KubeRay will delete the Pods with the unhealthy label. This behavior may be refactored soon, and it is a bit overlapped with removing
PodFailed
Pods. -
Some users prefer to retain certain unhealthy Pods or Pods in a PodFailed state for troubleshooting purposes. (e.g. [Feature] Ensure the number of healthy workers while keep the abnormal worker for troubleshooting #1022)
d7aa996
to
e4dc603
Compare
/rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Could you please add a bit to the PR description to describe the fix that's made in this PR?
(Before reading the code, based on the PR description I guessed that the PR would change the behavior to delete all pods including failed pods, but actually what the PR does is to correct the value of diff
to fix the autoscaling bug)
Update: discussed offline, actually diff
was correct before, but after deleting the filtering function, the PR needs to account for that in diff
to make it correct again.
@architkulkarni I have updated the PR description. Is it clear now? Thanks! |
@kevin85421 it's clear, thanks for updating! |
Create a follow up issue: #1144 |
Operator does not remove workers
Why are these changes needed?
The function
updateLocalWorkersToDelete
will updateWorkersToDelete
by filtering out Pods that are not included inrunningPods
. Hence, Pods with the statusPodFailed
are not deleted permanently.In the autoscaler scale-down process, the autoscaler adds a failed Pod to
WorkersToDelete
and waits for all Pods inWorkersToDelete
to be deleted before making the next decision. You can find the corresponding code here. However, the failed Pods will not be deleted by operator because ofupdateLocalWorkersToDelete
.non_terminated_nodes
in Autoscaler (link)Case 1: Without this PR +
PodFailed
Pod + AutoscalingPodFailed
Pod toWorkersToDelete
, it will remain stuck there until the Pod is manually deleted by the user. This is because the functionupdateLocalWorkersToDelete
ignoresPodFailed
Pods.Case 2: Without this PR +
PodFailed
Pod + No autoscalingSpec.WorkerGroupSpecs[*].Replicas
) that are in theRUNNING
orPENDING
status.Case 3: With this PR +
PodFailed
Pod + AutoscalingPodFailed
Pod toWorkersToDelete
, the KubeRay operator will delete all Pods inWorkersToDelete
, regardless of their statuses.Case 4: With this PR +
PodFailed
Pod + No autoscalingRelated issue number
Closes #942
Checks
Reproduction