Skip to content

Nodes Disrupted Outside Scheduled Window #2329

Open
@pijain

Description

@pijain

Description

Observed Behavior:
We have a Karpenter NodePool with disruption budgets defined, including a scheduled disruption window for the Underutilized reason. However, we’re observing that nodes are being disrupted outside the defined schedule.

Additionally, nodes with a single high-memory pod (e.g., 24Gi out of 32Gi) are still being marked as underutilized rather than fully utilized or non-empty.

Karpenter Logs:

{"level":"INFO","time":"2025-06-22T17:36:58.825Z","logger":"controller","message":"disrupting node(s)","reason":"underutilized","decision":"delete","disrupted-node-count":1,"pod-count":1}
{"level":"INFO","time":"2025-06-22T17:36:59.679Z","logger":"controller","message":"tainted node","taint.Key":"karpenter.sh/disrupted"}
{"level":"INFO","time":"2025-06-22T17:43:35.214Z","logger":"controller","message":"deleted nodeclaim","NodeClaim":{"name":"<masked>"}} 

Expected Behavior:

  • Nodes should only be disrupted for the Underutilized reason during the scheduled window.
  • Nodes with a single high-memory pod (e.g., 75% of node memory requested) should not be marked underutilized unless criteria are clearly defined.

Reproduction Steps (Please include YAML):
Here is a simplified version of the NodePool configuration:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-g6
spec:
  disruption:
    budgets:
      - duration: 4h
        nodes: 10%
        reasons: [Drifted]
        schedule: 0 4 1-7 1,4,7,10 1
      - duration: 4h
        nodes: 10%
        reasons: [Underutilized]
        schedule: 0 4 * * *
      - nodes: 70%
        reasons: [Empty]
    consolidateAfter: 30m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '1000'
    memory: 5000Gi
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: al2023
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [g6.2xlarge]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand]
      taints:
        - effect: NoSchedule
          key: nvidia.com/gpu
          value: 'true'
  weight: 25

Versions:

  • Chart Version: 1.5.0
  • Kubernetes Version (kubectl version): 1.32 eks
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions