Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better controls for Health Indicators #18753

Closed
worldtiki opened this issue Oct 26, 2019 · 3 comments
Closed

Better controls for Health Indicators #18753

worldtiki opened this issue Oct 26, 2019 · 3 comments
Labels
status: declined A suggestion or change that we don't feel we should currently apply

Comments

@worldtiki
Copy link

Hi 馃憢

This issue was discussed a few years back (#7626 in 2016) but since then things have changed.
With the jump in adoption of systems like Kubernetes/... it seems that the way that Health Indicators operate should be revisited (an example of this is the new health indicator groups feature to support different probes (liveness and readiness) but imo this is still not enough.)

I would like to suggest another change which is the ability to specify if the status of a specific health indicator should affect the overall health check.

This is already being done by some implementations, one of the most high profile ones being the Hystrix health indicator, where when a circuit breaker is open the health endpoint still returns a 200 "UP".

Having the ability to specify this behaviour would allow us to report the status of dependent systems and why they are failing (withDetails, withException) without actually forcing the overall health status to fail.
It would also make the usage less ambiguous and less error prone. For eg: the Hystrix and the Resilience4j health indicators have opposite behaviours when dealing with failures: one results in a 200 UP and the other in a 503 DOWN.

I'm not sure if this could be done with a condition like management.health.foo.some-name-here or if it would have to be manually configured for each of the indicators included in the spring-boot-actuator, but I believe this is the right time to discuss if this change has merit.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Oct 26, 2019
@snicoll snicoll changed the title [Feature Request] Better controls for Health Indicators Better controls for Health Indicators Oct 27, 2019
@ckoutsouridis
Copy link

ckoutsouridis commented Oct 27, 2019

i think i have suggested this (or something very similar) during discussions of #14022 (comment)

I am happy more people are interested in this.

In my opinion the ability to specify a "threshold" on the a specific health check. e.g.

management.health.db.threshold=OUT_OF_SERVICE

is easy to implement. On top of that the actuator endpoints can also support querying by those thresholds e.g.

/actuator/health?threshold=OUT_OF_SERVICE

This way it is very straightforward to configure something like:

management.health.db.threshold=UP

Which is actually disabling the database health check, without losing visibility.

People can then map liveness, readiness probes to (/actuator/health? threshold =DOWN,/actuator/health?threshold=OUT_OF_SERVICE)

All these will work nice, since heath statuses are already ordered.
(threshold might not be the best term, but i couldn't find anything better now)

@mbhave mbhave added the for: team-attention An issue we'd like other members of the team to review label Oct 29, 2019
@ckoutsouridis
Copy link

as a sample health results when someone configures management.health.db.threshold=UP

it could look something like this:

{
  "status": "UP",
  "details": {
    "db": {
      "status": "DOWN",
      "contributionStatus": "UP",
      "details": {
        "exception": "some"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 500068036608,
        "free": 340689059840,
        "threshold": 10485760
      }
    }
  }
}

@philwebb
Copy link
Member

Thanks for raising these suggestions again but we're not keen to add any more complexity to the health indicator endpoint at this time. We feel that having health contributors that don't actually affect the overall status might cause quite a bit of confusion.

A couple of specific points have guided our thinking on this:

We want to keep /actuator/health exclusively for the applications view of the health.

We think there are better solutions to monitoring the actual health of infrastructure components such as database servers that should considered outside of a Boot application. Likewise, there are useful metrics based techniques that can be used to asses how infrastructure components are behaving.

We think most users can get quite far by using health groups

It should be possible to use health groups to solve quite a few use-cases that would overlap with having the health indicator threshold idea. For example, you could create critical and informational groups then use a different status aggregation rules for information so that the overall status isn't effected by the members.

I know this isn't quite as flexible as the threshold idea, and it might result in more than a single call to the health endpoint, but it is relatively easy to understand.

We can certainly reconsider things again in the future, but for now we'd like to see how far people can get with health indicator groups. We'll also keep monitoring this issue to see if other users add comments.

@philwebb philwebb added status: declined A suggestion or change that we don't feel we should currently apply and removed for: team-attention An issue we'd like other members of the team to review status: waiting-for-triage An issue we've not yet triaged labels Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: declined A suggestion or change that we don't feel we should currently apply
Projects
None yet
Development

No branches or pull requests

5 participants