Skip to content

intermittent failed to get cpu resource metric error #1641

Open
@gurpalw

Description

@gurpalw

What happened:
When we deploy a new image for a deployment, it's associated hpa has event errors like:

2025-04-09 12:15:29.521	2025-04-09T12:15:29Z DBG bitnami/blacksmith-sandox/kubernetes-event-exporter-0.10.0/src/github.com/opsgenie/kubernetes-event-exporter/pkg/kube/watcher.go:64 > Received event involvedObject=my-app-example-eu-west-1 msg="invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)" namespace=example reason=FailedComputeMetricsReplicas
2025-04-09 12:15:29.041	2025-04-09T12:15:29Z DBG bitnami/blacksmith-sandox/kubernetes-event-exporter-0.10.0/src/github.com/opsgenie/kubernetes-event-exporter/pkg/kube/watcher.go:64 > Received event involvedObject=my-app-example-eu-west-1 msg="failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)" namespace=otc reason=FailedGetResourceMetric

The errors resolve within 5 minutes, but that's long enough for us to receive alerts from prometheus(KubernetesHpaMetricsUnavailability). manual telnet test by me to check if the node where the metrics-server is running can reach the kubelet of the node where the deployment is running confirms connectivity is ok.

HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app: my-app-example-eu-west-1
  name: my-app-example-eu-west-1
  namespace: example
spec:
  maxReplicas: 2
  metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-example-eu-west-1

deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  labels:
    app: my-app-example-eu-west-1
  name: my-app-example-eu-west-1
  namespace: app
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app.kubernetes.io/instance: my-app-example-eu-west-1
      app.kubernetes.io/name: my-app-example-eu-west-1
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 0%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: my-app-example-eu-west-1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: lifecycle
                    operator: In
                    values:
                      - spot
      automountServiceAccountToken: false
      containers:
        - command:
            - python
            - manage.py
            - app_order_consumer
            - --healthcheck_port=8080
          image: xxxxx.dkr.ecr.eu-west-1.amazonaws.com/my-app-example/my-app-example:67096
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 2
            httpGet:
              path: /healthcheck
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 30
            successThreshold: 1
            timeoutSeconds: 15
          name: my-app-example-eu-west-1
          resources:
            limits:
              cpu: "1"
              memory: 2Gi
            requests:
              cpu: 500m
              memory: 1Gi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsGroup: 2000
            runAsNonRoot: true
            runAsUser: 2000
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      initContainers:
        - command:
            - bash
            - -c
            - until python manage.py migrate --check ; do echo Waiting for startup checks...; done
          image: x.dkr.ecr.eu-west-1.amazonaws.com/my-app-example/my-app-example:67096
          imagePullPolicy: Always
          name: appapidjango-init
          resources:
            limits:
              cpu: "1"
              memory: 1Gi
            requests:
              cpu: "1"
              memory: 280Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsGroup: 2000
            runAsNonRoot: true
            runAsUser: 2000
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: app-restadmin
      serviceAccountName: app-restadmin
      terminationGracePeriodSeconds: 30

What you expected to happen:

Anything else we need to know?:

Environment:

Client Version: v1.31.6
Server Version: v1.31.6-eks-bc803b4

  • Metrics Server manifest
spoiler for Metrics Server manifest:
ontainers:
        - args:
            - --cert-dir=/tmp
            - --secure-port=10250
            - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
            - --kubelet-use-node-status-port
            - --metric-resolution=15s
          image: registry.k8s.io/metrics-server/metrics-server:v0.7.2
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /livez
              port: https
              scheme: HTTPS
            periodSeconds: 10
          name: metrics-server
          ports:
            - containerPort: 10250
              name: https
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: https
              scheme: HTTPS
            initialDelaySeconds: 20
            periodSeconds: 10
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
            seccompProfile:
              type: RuntimeDefault
          volumeMounts:
            - mountPath: /tmp
              name: tmp-dir
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      serviceAccountName: metrics-server
      volumes:
        - emptyDir: {}
          name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    k8s-app: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100
  • Kubelet config:
spoiler for Kubelet config:
  • Metrics server logs:
spoiler for Metrics Server logs:

No logs at the times where this issue occurs. other logs are present at different periods, mostly stuff like this:
E0409 13:01:16.209490 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.17.142.53:10250/metrics/resource\": context deadline exceeded" node="ip-10-17-142-53.eu-west-1.compute.internal" timeout="10s"

but only when a spot node is being terminated.

  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:    
Labels:       argocd.argoproj.io/instance=metrics-server
              k8s-app=metrics-server
Annotations:  <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2022-04-22T14:59:30Z
  Resource Version:    1898508864
  UID:                 8d8c95c8-5656-4dc5-a380-91f288c8e03b
Spec:
  Group:                     metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            metrics-server
    Namespace:       kube-system
    Port:            443
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2025-03-30T20:15:03Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available
Events:                    <none>

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions