Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesPodNotHealthy expr problem #94

Closed
yydance opened this issue Mar 24, 2020 · 6 comments
Closed

KubernetesPodNotHealthy expr problem #94

yydance opened this issue Mar 24, 2020 · 6 comments

Comments

@yydance
Copy link

yydance commented Mar 24, 2020

- alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
      description: "Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I want to use this ,but the "expr" Doesn't seem right. I get the error like:

Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors

if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) , the result is OK。

@samber
Copy link
Owner

samber commented Mar 25, 2020

Hi @youpai, thanks for reporting this issue.

I tested the initial query, and I got the following error:

Error executing query: parse error at char 107: range specification must be preceded by a metric selector, but follows a *promql.AggregateExpr instead

Regarding this message, I think min_over_time does not support subquery.

Adding ':' does not work either, on my Prometheus instance.

What version of Prometheus are you using ?

@yydance
Copy link
Author

yydance commented Mar 26, 2020

The version of Prometheus is 2.16.0。
Maybe there are some differences between different versions,but I think min_over_time does support subquery,the ref "https://prometheus.io/blog/2019/01/28/subquery-support/"

@samber
Copy link
Owner

samber commented Mar 26, 2020

Ok, my Prometheus server was too old then.

I'll prepare a PR ;)

@lgg42
Copy link

lgg42 commented Oct 26, 2020

Hi! I think this expression query still needs some love. We just began using it and we're getting false positives. That means, short-lived pods that go trough the following phases: Pending, Running, Succeeded, Failed, Unknown get marked as unhealthy even if they did their work. Example:

trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Pending             0          0s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     ContainerCreating   0          1s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                1/1     Running             0          3s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Completed           0          16s
trafficdatalakeingestionespdrivingsensor-12de660039ef47959b6b90046c978ff2                0/1     Terminating         0          17s

From: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

Failed | All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.

In our case they are terminated by the system, but everything was correct. I'll be thinking on a better expression but if you have any ideas I'm all eyes!

@djhoese
Copy link

djhoese commented Nov 25, 2020

@lgg42 I'm running into this too. I have gitlab CI runners and unit tests running on my cluster and they all trigger this because they are short lived. Did you ever come up with a better query?

@lgg42
Copy link

lgg42 commented Nov 26, 2020

@djhoese Nope, hadn't the time yet. But is good to see I'm not the only one, sorry you're also suffering it 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants