Probe checks don't consider timeout values #296

Iristyle · 2020-08-11T19:02:43Z

Which version of kube-score are you using?

kube-score version: 1.8.1, commit: cdab99b, built: 2020-08-11T08:12:42Z

What did you do?

When defining a livenessProbe that uses the same command as a readinessProbe but with longer timeouts, kube-score generates a CRITICAL

For instance:

livenessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30
readinessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 5

What did you expect to see?

Based on the recommendations at https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html that it's OK to use the same command, but with longer timeout values, I'd expect this to not be a critical error.

What did you see instead?

    [CRITICAL] Pod Probes
        · Container has the same readiness and liveness probe
            Using the same probe for liveness and readiness is very likely
            dangerous. Generally it's better to avoid the livenessProbe than
            re-using the readinessProbe.
            More information: https://github.com/zegl/kube-score/blob/master/README_PROBES.md

The text was updated successfully, but these errors were encountered:

zegl · 2020-08-12T08:41:27Z

Hey!

This topic is recurring, and I'm a bit torn about it. For me, it's important that kube-score only recommends changes if that changes has a high chance of increasing the reliability and availability of the overall system.

The general approach that I've taken with regards to livenessProbes recently is that, you should probably not have one for the vast majority of workloads. And if you do need one, and understand the risks associated, you can easily add it in a way that is compliant with kube-score.

I think that there are a few problematic issues with re-using the probes, even if the time-to-fail is longer:

1: Before startupProbes where available, it was a fairly common pattern to use a failing readinessProbe during app-boot to allow it to warmup before accepting traffic. If this pattern is used in combination with an identical livenessProbe, even if the time-to-fail is used, the system is at risk of failing the livenessProbe if the warmup takes longer than usual.

2: The probes might "accidentally" depend on an external system, such as a database, or another service, in their probe. When this is done in a readinessProbe the downsides are manageable, the Pods/Endpoints will be removed from the Service, but the containers are still there and ready to serve traffic immediately once the probe is succeeding again.

If the livenessProbe has the same "accidental" behavior, the containers will be restarted, which will in turn lead to more downtime than expected, as the containers are not ready to serve traffic once the connections to the downstreams are healthy again.

Many applications that are not built explicitly for Kubernetes, only expose one type of health check. In those cases I believe that it's tempting to re-use the health check endpoint for both types of probes, without putting too much thought into why, or understanding the risks involved.

3: If the application handles liveness and readiness properly, and the operator knows the risks associated, it's easy to add a very simple liveness endpoint to almost any application, as it should do as little as possible.

4: If the kube-score recommendation doesn't fit your workload, you can add a kube-score.com/ignore: pod-probes annotation to the object, and this check will be skipped.

This is of course a highly opinionated topic, and it's hard to make general recommendations for very specific issues. I try to make sure that kube-score makes good recommendations that are easy to understand for everyone, and that I can personally feel comfortable recommending. I don't want a kube-score recommendation to be what's causing an incident somewhere.

/ Gustav

Iristyle · 2020-08-13T05:17:10Z

Thank you for the detailed response. This definitely seems like a topic of debate.

You make a lot of great points, and I totally understand your perspective with how the tooling should behave.

I'm going to close this ticket, but FYI I did notice one quirk with kube-score.com/ignore in a deployment. The ignore applies to initContainers and containers, which is probably undesirable. Not sure there's a good way to fix that though!

Thanks again.

Iristyle closed this as completed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probe checks don't consider timeout values #296

Probe checks don't consider timeout values #296

Iristyle commented Aug 11, 2020

zegl commented Aug 12, 2020

Iristyle commented Aug 13, 2020

Probe checks don't consider timeout values #296

Probe checks don't consider timeout values #296

Comments

Iristyle commented Aug 11, 2020

zegl commented Aug 12, 2020

Iristyle commented Aug 13, 2020