KubernetesPodNotHealthy expr problem #94

yydance · 2020-03-24T10:31:11Z

- alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
      description: "Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I want to use this ,but the "expr" Doesn't seem right. I get the error like:

Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors

if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) , the result is OK。

The text was updated successfully, but these errors were encountered:

samber · 2020-03-25T14:49:13Z

Hi @youpai, thanks for reporting this issue.

I tested the initial query, and I got the following error:

Error executing query: parse error at char 107: range specification must be preceded by a metric selector, but follows a *promql.AggregateExpr instead

Regarding this message, I think min_over_time does not support subquery.

Adding ':' does not work either, on my Prometheus instance.

What version of Prometheus are you using ?

yydance · 2020-03-26T02:23:17Z

The version of Prometheus is 2.16.0。
Maybe there are some differences between different versions，but I think min_over_time does support subquery,the ref "https://prometheus.io/blog/2019/01/28/subquery-support/"

samber · 2020-03-26T15:20:02Z

Ok, my Prometheus server was too old then.

I'll prepare a PR ;)

lgg42 · 2020-10-26T14:52:38Z

Hi! I think this expression query still needs some love. We just began using it and we're getting false positives. That means, short-lived pods that go trough the following phases: Pending, Running, Succeeded, Failed, Unknown get marked as unhealthy even if they did their work. Example:

trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Pending             0          0s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     ContainerCreating   0          1s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                1/1     Running             0          3s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Completed           0          16s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Terminating         0          17s

From: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

Failed | All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.

In our case they are terminated by the system, but everything was correct. I'll be thinking on a better expression but if you have any ideas I'm all eyes!

djhoese · 2020-11-25T20:05:58Z

@lgg42 I'm running into this too. I have gitlab CI runners and unit tests running on my cluster and they all trigger this because they are short lived. Did you ever come up with a better query?

lgg42 · 2020-11-26T11:44:51Z

@djhoese Nope, hadn't the time yet. But is good to see I'm not the only one, sorry you're also suffering it 🙃

samber mentioned this issue Mar 26, 2020

Fix kubernetes pod not health alert #97

Merged

samber closed this as completed Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesPodNotHealthy expr problem #94

KubernetesPodNotHealthy expr problem #94

yydance commented Mar 24, 2020

samber commented Mar 25, 2020

yydance commented Mar 26, 2020

samber commented Mar 26, 2020

lgg42 commented Oct 26, 2020

djhoese commented Nov 25, 2020

lgg42 commented Nov 26, 2020

KubernetesPodNotHealthy expr problem #94

KubernetesPodNotHealthy expr problem #94

Comments

yydance commented Mar 24, 2020

samber commented Mar 25, 2020

yydance commented Mar 26, 2020

samber commented Mar 26, 2020

lgg42 commented Oct 26, 2020

djhoese commented Nov 25, 2020

lgg42 commented Nov 26, 2020