-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125
Comments
Well, at least the log message clearly expresses the authors' misunderstanding of Kubernetes eviction behavior (which to be fair is extremely confusing). |
The pods expelled by the K8S node will be re-scheduled. If not, you need to check if your pod has some kind of affinity set or something other that make pod not re-scheduled |
That is not true. A pod evicted due to disk pressure is marked failed and will never be rescheduled. If the pod is part of a replicaset, then a new pod will be created to replace it. This issue is asking for the KubeRay operator to do the same thing as the replicaset controller. |
In the case where a Node is down for long enough and Kubernetes evicts the pod (i.e. deletes it), there should be some logic in KubeRay operator to recreate the head pod. As Dmitri already said, in the case of disk pressure, the Pod object is never deleted so the recreate logic never kicks in. However, per the stackoverflow discussion shared earlier, evicted pods from disk pressure usually have a @xjhust it seems like if you set restart policy of the head pod to |
I suspect that if we ran the head pod as a single-replica StatefulSet, we may be able to automatically recover from some of these scenarios. @kevin85421 have we ever considered this before? |
This might not be the right move, since some users will expect the head pod to exit on failure. (You wouldn't be able to support restart policy Never.) My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state. |
The comments here implies pods with restart Policy Always never reach Failed state, but maybe we never tested this with disk pressure eviction. Is someone able to verify this before we change this behavior? |
I'm pretty sure the comment is incorrect. I agree that someone should confirm -- try inducing a disk pressure execution on a pod with restart policy Always, as described in the issue description. |
@xjhust can you help confirm this? |
@DmitriGekhtman: currently, if the head Pod's restartPolicy is |
Note the issue description includes a suggestion for reproduction (run the fallocate command). If you have easy access to a Kubernetes cluster, it should be straightforward to reproduce. I've observed this particular behavior (pod with restart policy always failing due to node pressure conditions) in production systems -- usually due to a large core dump after a segfault. |
I would feel okay with removing the isRestartPolicyAlways check. It seems like in most normal conditions the Pod phase never flips to |
Try to summarize the current discussion:
Is my understanding correct? cc @DmitriGekhtman @andrewsykim |
Yes, that's my understanding too, but I personally have not tested the pod status behavior during disk eviction, it'd be good to verify this behavior. Other than that, it seems like an improvement to handle |
Thanks for clarifying this -- I misunderstood the current behavior of the Never restart policy. It sounds good to keep the current behavior in the Never restart policy case.
That sounds like a good idea. Context on why pods are marked failed in node pressure situations: the kubelet needs to protect the node from the faulty pod's resource usage. The Kubernetes docs do not directly address the case of restart policy always. "If the pods are managed by a workload management object (such as StatefulSet or Deployment) that replaces failed pods, the control plane (kube-controller-manager) creates new pods in place of the evicted pods." You can infer that if a controller does not replace the failed pod, there will be no replacement. |
If it's possible to use a 1-pod replicaset with a deterministic name for the Ray head, it would also resolve #715, for the Ray head. |
What's the reason for a 1 pod replicaset instead of a statefulset? |
Likely either would work -- we could discuss the relative merits. |
I think 1-pod replicaset or deployment for head and statefulset for work will be ok. But if use them to reconcile headnode or worknode, |
If we ignore the container restart strategy, I feel that there will be a scenario where the container is already restarting or even successful, but we are deleting it according to its previous failed state, whether it will cause a temporary exception to the node of the ray head. Is it possible to make a simple judgment about the eviction scenario? isPodEvicted := pod.Status.Reason == "Evicted"
if isRestartPolicyAlways && !isPodEvicted {
reason := fmt.Sprintf(
"The status of the %s Pod %s is %s. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' "+
"and it should be able to restart automatically.", nodeType, pod.Name, pod.Status.Phase)
return false, reason
} |
StatefulSet for workers wouldn't work for the autoscaling use-case. When scaling down, we need to specify the specific worker that's being removed. So, we need to manage individual pods for workers. |
Yeah, it'd be best for now to make the simplest fix to the code that's causing the issue :) |
If a Pod is Succeeded or Failed, the Pod will not restart based on this doc. In addition, the head Pod of a RayCluster should always be on if the CR's |
Pods that have an auto-restart policy set may undergo a transition between the Failed and Running states, but usually return to the Running state quickly |
Got it. It makes sense. Thanks! |
@kevin85421 Can I work on this issue? After some trial and errors I am finally able to reproduce the eviction behavior on my own computer. (I tried to let the head pod be evicted due to low memory on the node) Here are the steps: # Create k3d cluster with 2 agent nodes, each with 3GB memory and hard eviction limit 1GiB
k3d cluster create \ [20:31:01]
--agents 2 \
--k3s-arg "--disable=traefik@server:0" \
--agents-memory 3g \
--k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:0" \
--k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:1"
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY_MEMORY:.status.capacity.memory,ALLOCATABLE_MEMORY:.status.allocatable.memory
# Output:
#
# NAME CAPACITY_MEMORY ALLOCATABLE_MEMORY
# k3d-k3s-default-agent-1 3221225Ki 2172649Ki
# k3d-k3s-default-agent-0 3221225Ki 2172649Ki
# k3d-k3s-default-server-0 32590664Ki 32590664Ki
# Taint agent 0 and 1 so that pods will only be deployed to server-0
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule
# Install Kuberay operator
helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --version 1.1.1 --create-namespace
# Taint server 0 and untaint agent 0 and 1 so that pods will only be deployed to agent 0 or 1
kubectl taint nodes k3d-k3s-default-server-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule-
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule-
# Install RayCluster
# Note that the head and worker pods will be deployed to different nodes
# because the memory resource request for head pod is 2G
# and the memory resource request for worker pod is 1G
# and agent 0 and 1 each only has 2G allocatable memory
helm install raycluster kuberay/ray-cluster --version 1.1.1
# Copy statically linked "stress-ng" binary into head pod
kubectl cp ./stress-ng <head-pod>:/home/ray
# Open a shell on the head pod
kubectl exec -it <head-pod> -- bash
# Simulate memory stress
./stress-ng --vm 4 --vm-bytes 2G --vm-keep Result:
|
I also tried |
@MortalHappiness what's the restart policy in your head pod? (it wasn't in the describe output) If it's |
@andrewsykim I used the default helm chart without specifying any values. So the restartPolicy is |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I have deploy kuberay-operator and ray-cluster in my k8s cluster, it works fine most of the time. When I use fallocate command to enforce the k8s node to disk-pressure status where the raycluster head pod runs, the raycluster head pod will be evicted by kubelet. There is no new head pod created on the other k8s node with normal status after long time. And when I clear large file created by fallocate commad and relieve the k8s node disk pressure, the raycluster pod will by still evicted and no new head pod created.
So I have to manual delete the evicted head pod to make it work again, and obviously this will make our production environment unstable and the service will not be highly available.
My expected behavior is that the raycluster controller works well like deployment, when the head pod been evicted, it will auto recreate new head pod.
Reproduction script
Basic version information
Kubernetes: v1.20.15
ray-operator: v1.0.0
raycluster(ray): 2.9.0
Reproduction steps
fallocate -l 1000G tmpfile
to make the imagefs full, and then the raycluster head pod will be evicted by kubelet as disk pressure;Anything else
This will happen every time when we make the raycluster head evicted.
We can find the relative log of the ray-operator pod, such as:
2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"Found 1 head Pod": "ray-cluster-head-dqtxb", "Pod status": "Failed", "Pod restart policy": "Always", "Ray container terminated status": "nil"} 2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"head Pod": "ray-cluster-head-dqtxb", "shouldDelete": false, "reason": "The status of the head Pod ray-cluster-head-dqtxb is Failed. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' and it should be able to restart automatically."}
So I read the relative code in raycluster controller, I found that the raycluster controller rely on the Pod's restart policy to restart the Pod when its status is failed. But In this case, the pod has been evicted by kubelet will not restart, so the raycluster will not work after evicted.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: