Description
Description
Observed Behavior:
The following node sees consistent disruption of the client-production pod:
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
client-production client-pod-production-cc75cfdcf-6n989 60750m (94%) 0 (0%) 238216Mi (95%) 237780Mi (94%) 7m24s
kube-system daemonset1-pod-k2rgd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 48m
kube-system daemonset2-pod-xvgdb 200m (0%) 0 (0%) 835Mi (0%) 835Mi (0%) 48m
kube-system daemonset3-pod-dwvlf 165m (0%) 0 (0%) 612Mi (0%) 512Mi (0%) 48m
kube-system daemonset4-pod-zqwkq 200m (0%) 0 (0%) 464Mi (0%) 464Mi (0%) 48m
kube-system daemonset5-pod-qgqtd 25m (0%) 0 (0%) 0 (0%) 0 (0%) 48m
kube-system daemonset6-pod-dgtgl 30m (0%) 0 (0%) 120Mi (0%) 768Mi (0%) 48m
kube-system daemonset7-pod-gfdbj 10m (0%) 0 (0%) 128Mi (0%) 128Mi (0%) 48m
kube-system static-pod1-node-name 200m (0%) 200m (0%) 64Mi (0%) 64Mi (0%) 48m
kube-system static-pod2-node-name 0 (0%) 0 (0%) 0 (0%) 0 (0%) 48m
kube-system daemonset8-pod-cwsz2 100m (0%) 0 (0%) 64Mi (0%) 64Mi (0%) 48m
kube-system daemonset9-pod-mgtlm 300m (0%) 0 (0%) 128Mi (0%) 128Mi (0%) 48m
kube-system daemonset0-pod-wp4jg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 48m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 61980m (96%) 200m (0%)
memory 240631Mi (96%) 240743Mi (96%)
ephemeral-storage 83072Mi (36%) 76928Mi (34%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
The pod is consistently evicted across all nodes (AWS m6i.16xlarge) in the cluster. It is in a nodepool with only on-demand AWS instances: c6i.16xlarge, m6i.16xlarge, and r6i.16xlarge. c6i.16xlarge, m6i.16xlarge, and r6i.16xlarge are ordered in ascending price, but the client-production pod cannot fit on a c6i.16xlarge and should settle on a m6i.16xlarge, the next cheapest instance type.
There is another workload in this cluster that uses reserved instance types in a separate nodepool. This workload scales beyond its reserved capacity to double its reserved capacity, taking from on-demand (e.g. 200 total nodes, 100 reserved, 100 on-demand). This second workload is evicted throughout the day as it moves on-demand capacity back to reserved after scaling back down to minimum replicas.
There are effectively no other workloads in this cluster that differentiate it from our other clusters.
Expected Behavior:
Node with high allocation efficiency should not be disrupted if it cannot find cheaper node.
Reproduction Steps (Please include YAML):
See above. Will continue debugging on our end until then.
Versions:
- Chart Version: v1.5.0
- Kubernetes Version (
kubectl version
): v1.29.15
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment