Skip to content

Constant Underutilized eviction on highly allocated nodes #2319

Open
@panicstevenson

Description

@panicstevenson

Description

Observed Behavior:
The following node sees consistent disruption of the client-production pod:

Non-terminated Pods:  (13 in total)
  Namespace           Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits   Age
  ---------           ----                                   ------------  ----------  ---------------  -------------   ---
  client-production   client-pod-production-cc75cfdcf-6n989  60750m (94%)  0 (0%)      238216Mi (95%)   237780Mi (94%)  7m24s
  kube-system         daemonset1-pod-k2rgd                   0 (0%)        0 (0%)      0 (0%)           0 (0%)          48m
  kube-system         daemonset2-pod-xvgdb                   200m (0%)     0 (0%)      835Mi (0%)       835Mi (0%)      48m
  kube-system         daemonset3-pod-dwvlf                   165m (0%)     0 (0%)      612Mi (0%)       512Mi (0%)      48m
  kube-system         daemonset4-pod-zqwkq                   200m (0%)     0 (0%)      464Mi (0%)       464Mi (0%)      48m
  kube-system         daemonset5-pod-qgqtd                   25m (0%)      0 (0%)      0 (0%)           0 (0%)          48m
  kube-system         daemonset6-pod-dgtgl                   30m (0%)      0 (0%)      120Mi (0%)       768Mi (0%)      48m
  kube-system         daemonset7-pod-gfdbj                   10m (0%)      0 (0%)      128Mi (0%)       128Mi (0%)      48m
  kube-system         static-pod1-node-name                  200m (0%)     200m (0%)   64Mi (0%)        64Mi (0%)       48m
  kube-system         static-pod2-node-name                  0 (0%)        0 (0%)      0 (0%)           0 (0%)          48m
  kube-system         daemonset8-pod-cwsz2                   100m (0%)     0 (0%)      64Mi (0%)        64Mi (0%)       48m
  kube-system         daemonset9-pod-mgtlm                   300m (0%)     0 (0%)      128Mi (0%)       128Mi (0%)      48m
  kube-system         daemonset0-pod-wp4jg                   0 (0%)        0 (0%)      0 (0%)           0 (0%)          48m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                61980m (96%)    200m (0%)
  memory             240631Mi (96%)  240743Mi (96%)
  ephemeral-storage  83072Mi (36%)   76928Mi (34%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)

The pod is consistently evicted across all nodes (AWS m6i.16xlarge) in the cluster. It is in a nodepool with only on-demand AWS instances: c6i.16xlarge, m6i.16xlarge, and r6i.16xlarge. c6i.16xlarge, m6i.16xlarge, and r6i.16xlarge are ordered in ascending price, but the client-production pod cannot fit on a c6i.16xlarge and should settle on a m6i.16xlarge, the next cheapest instance type.

There is another workload in this cluster that uses reserved instance types in a separate nodepool. This workload scales beyond its reserved capacity to double its reserved capacity, taking from on-demand (e.g. 200 total nodes, 100 reserved, 100 on-demand). This second workload is evicted throughout the day as it moves on-demand capacity back to reserved after scaling back down to minimum replicas.

There are effectively no other workloads in this cluster that differentiate it from our other clusters.

Expected Behavior:
Node with high allocation efficiency should not be disrupted if it cannot find cheaper node.

Reproduction Steps (Please include YAML):
See above. Will continue debugging on our end until then.

Versions:

  • Chart Version: v1.5.0
  • Kubernetes Version (kubectl version): v1.29.15
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions