Better controls for Health Indicators #18753

worldtiki · 2019-10-26T18:54:31Z

Hi 👋

This issue was discussed a few years back (#7626 in 2016) but since then things have changed.
With the jump in adoption of systems like Kubernetes/... it seems that the way that Health Indicators operate should be revisited (an example of this is the new health indicator groups feature to support different probes (liveness and readiness) but imo this is still not enough.)

I would like to suggest another change which is the ability to specify if the status of a specific health indicator should affect the overall health check.

This is already being done by some implementations, one of the most high profile ones being the Hystrix health indicator, where when a circuit breaker is open the health endpoint still returns a 200 "UP".

Having the ability to specify this behaviour would allow us to report the status of dependent systems and why they are failing (withDetails, withException) without actually forcing the overall health status to fail.
It would also make the usage less ambiguous and less error prone. For eg: the Hystrix and the Resilience4j health indicators have opposite behaviours when dealing with failures: one results in a 200 UP and the other in a 503 DOWN.

I'm not sure if this could be done with a condition like management.health.foo.some-name-here or if it would have to be manually configured for each of the indicators included in the spring-boot-actuator, but I believe this is the right time to discuss if this change has merit.

The text was updated successfully, but these errors were encountered:

ckoutsouridis · 2019-10-27T11:52:47Z

i think i have suggested this (or something very similar) during discussions of #14022 (comment)

I am happy more people are interested in this.

In my opinion the ability to specify a "threshold" on the a specific health check. e.g.

management.health.db.threshold=OUT_OF_SERVICE

is easy to implement. On top of that the actuator endpoints can also support querying by those thresholds e.g.

/actuator/health?threshold=OUT_OF_SERVICE

This way it is very straightforward to configure something like:

management.health.db.threshold=UP

Which is actually disabling the database health check, without losing visibility.

People can then map liveness, readiness probes to (/actuator/health? threshold =DOWN,/actuator/health?threshold=OUT_OF_SERVICE)

All these will work nice, since heath statuses are already ordered.
(threshold might not be the best term, but i couldn't find anything better now)

ckoutsouridis · 2019-10-31T11:29:09Z

as a sample health results when someone configures management.health.db.threshold=UP

it could look something like this:

{
  "status": "UP",
  "details": {
    "db": {
      "status": "DOWN",
      "contributionStatus": "UP",
      "details": {
        "exception": "some"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 500068036608,
        "free": 340689059840,
        "threshold": 10485760
      }
    }
  }
}

philwebb · 2019-10-31T18:43:52Z

Thanks for raising these suggestions again but we're not keen to add any more complexity to the health indicator endpoint at this time. We feel that having health contributors that don't actually affect the overall status might cause quite a bit of confusion.

A couple of specific points have guided our thinking on this:

We want to keep `/actuator/health` exclusively for the applications view of the health.

We think there are better solutions to monitoring the actual health of infrastructure components such as database servers that should considered outside of a Boot application. Likewise, there are useful metrics based techniques that can be used to asses how infrastructure components are behaving.

We think most users can get quite far by using health groups

It should be possible to use health groups to solve quite a few use-cases that would overlap with having the health indicator threshold idea. For example, you could create critical and informational groups then use a different status aggregation rules for information so that the overall status isn't effected by the members.

I know this isn't quite as flexible as the threshold idea, and it might result in more than a single call to the health endpoint, but it is relatively easy to understand.

We can certainly reconsider things again in the future, but for now we'd like to see how far people can get with health indicator groups. We'll also keep monitoring this issue to see if other users add comments.

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Oct 26, 2019

worldtiki mentioned this issue Oct 26, 2019

Spring Boot health status switch to DOWN when circuit breaker is opened resilience4j/resilience4j#607

Closed

snicoll changed the title ~~[Feature Request] Better controls for Health Indicators~~ Better controls for Health Indicators Oct 27, 2019

mbhave added the for: team-attention An issue we'd like other members of the team to review label Oct 29, 2019

philwebb closed this as completed Oct 31, 2019

philwebb added status: declined A suggestion or change that we don't feel we should currently apply and removed for: team-attention An issue we'd like other members of the team to review status: waiting-for-triage An issue we've not yet triaged labels Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better controls for Health Indicators #18753

Better controls for Health Indicators #18753

worldtiki commented Oct 26, 2019

ckoutsouridis commented Oct 27, 2019 •

edited

Loading

ckoutsouridis commented Oct 31, 2019

philwebb commented Oct 31, 2019

Better controls for Health Indicators #18753

Better controls for Health Indicators #18753

Comments

worldtiki commented Oct 26, 2019

ckoutsouridis commented Oct 27, 2019 • edited Loading

ckoutsouridis commented Oct 31, 2019

philwebb commented Oct 31, 2019

We want to keep /actuator/health exclusively for the applications view of the health.

We think most users can get quite far by using health groups

ckoutsouridis commented Oct 27, 2019 •

edited

Loading

We want to keep `/actuator/health` exclusively for the applications view of the health.