Description
What happened:
When we deploy a new image for a deployment, it's associated hpa has event errors like:
2025-04-09 12:15:29.521 2025-04-09T12:15:29Z DBG bitnami/blacksmith-sandox/kubernetes-event-exporter-0.10.0/src/github.com/opsgenie/kubernetes-event-exporter/pkg/kube/watcher.go:64 > Received event involvedObject=my-app-example-eu-west-1 msg="invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)" namespace=example reason=FailedComputeMetricsReplicas
2025-04-09 12:15:29.041 2025-04-09T12:15:29Z DBG bitnami/blacksmith-sandox/kubernetes-event-exporter-0.10.0/src/github.com/opsgenie/kubernetes-event-exporter/pkg/kube/watcher.go:64 > Received event involvedObject=my-app-example-eu-west-1 msg="failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)" namespace=otc reason=FailedGetResourceMetric
The errors resolve within 5 minutes, but that's long enough for us to receive alerts from prometheus(KubernetesHpaMetricsUnavailability). manual telnet test by me to check if the node where the metrics-server is running can reach the kubelet of the node where the deployment is running confirms connectivity is ok.
HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
labels:
app: my-app-example-eu-west-1
name: my-app-example-eu-west-1
namespace: example
spec:
maxReplicas: 2
metrics:
- resource:
name: cpu
target:
averageUtilization: 80
type: Utilization
type: Resource
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app-example-eu-west-1
deployment yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
labels:
app: my-app-example-eu-west-1
name: my-app-example-eu-west-1
namespace: app
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 5
selector:
matchLabels:
app.kubernetes.io/instance: my-app-example-eu-west-1
app.kubernetes.io/name: my-app-example-eu-west-1
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: my-app-example-eu-west-1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: lifecycle
operator: In
values:
- spot
automountServiceAccountToken: false
containers:
- command:
- python
- manage.py
- app_order_consumer
- --healthcheck_port=8080
image: xxxxx.dkr.ecr.eu-west-1.amazonaws.com/my-app-example/my-app-example:67096
imagePullPolicy: Always
livenessProbe:
failureThreshold: 2
httpGet:
path: /healthcheck
port: 8080
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 15
name: my-app-example-eu-west-1
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 2000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
initContainers:
- command:
- bash
- -c
- until python manage.py migrate --check ; do echo Waiting for startup checks...; done
image: x.dkr.ecr.eu-west-1.amazonaws.com/my-app-example/my-app-example:67096
imagePullPolicy: Always
name: appapidjango-init
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 280Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 2000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
serviceAccount: app-restadmin
serviceAccountName: app-restadmin
terminationGracePeriodSeconds: 30
What you expected to happen:
Anything else we need to know?:
Environment:
Client Version: v1.31.6
Server Version: v1.31.6-eks-bc803b4
- Metrics Server manifest
spoiler for Metrics Server manifest:
ontainers:
- args:
- --cert-dir=/tmp
- --secure-port=10250
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
image: registry.k8s.io/metrics-server/metrics-server:v0.7.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
name: metrics-server
ports:
- containerPort: 10250
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 200Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: /tmp
name: tmp-dir
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
serviceAccountName: metrics-server
volumes:
- emptyDir: {}
name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
labels:
k8s-app: metrics-server
name: v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
version: v1beta1
versionPriority: 100
- Kubelet config:
spoiler for Kubelet config:
- Metrics server logs:
spoiler for Metrics Server logs:
No logs at the times where this issue occurs. other logs are present at different periods, mostly stuff like this:
E0409 13:01:16.209490 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.17.142.53:10250/metrics/resource\": context deadline exceeded" node="ip-10-17-142-53.eu-west-1.compute.internal" timeout="10s"
but only when a spot node is being terminated.
- Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: argocd.argoproj.io/instance=metrics-server
k8s-app=metrics-server
Annotations: <none>
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2022-04-22T14:59:30Z
Resource Version: 1898508864
UID: 8d8c95c8-5656-4dc5-a380-91f288c8e03b
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2025-03-30T20:15:03Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events: <none>
/kind bug