Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional Execution of Health Checks #3441

Closed
coreyjv opened this issue Jul 8, 2015 · 13 comments
Closed

Conditional Execution of Health Checks #3441

coreyjv opened this issue Jul 8, 2015 · 13 comments
Labels
status: declined A suggestion or change that we don't feel we should currently apply type: enhancement A general enhancement

Comments

@coreyjv
Copy link

coreyjv commented Jul 8, 2015

I've been using the Spring Boot Actuator health check functionality for a while now and really enjoy the easy of writing custom HealthIndicators. However, there is one need that Spring Boot Actuator does not currently support (see this Stack Overflow thread for reference here) that I'd like to suggest as a new feature but first I'd like to explain our challenge.

For our applications the /health endpoint is queried very frequently (say every 5 seconds) to determine if fail over to another site needs to occur. For frequently executed queries the /heath endpoint should report the health of the application, and not execute any HealthIndicators that check the health of external dependencies such as external web services because those calls are expensive.

Additionally, there are times where health checks against external dependencies are needed during post deployment or troubleshooting scenarios. It's very easy to write custom HealthIndicators and having the ability to consolidate it all under one /health endpoint is advantageous instead of writing additional code to expose this through another endpoint.

If this feature request is accepted I'd like to propose that the implementation is opt-in so by default custom HealthIndicators are executed like they are today so users don't find themselves in a situation where a HealthIndicator is not run when they expect it to be.

Finally, if possible, I'd be willing to make the contributions to implement this feature as well.

Thank you!

@snicoll
Copy link
Member

snicoll commented Jul 8, 2015

Are you aware that there is a endpoints.health.time-to-livethat caches the result of the health check. Would that help?

@coreyjv
Copy link
Author

coreyjv commented Jul 8, 2015

@snicoll --

Yes, I'm aware of the caching capabilities of the health result but I do not think it's the right solution for this need since it unilaterally caches all the results of the health check instead of an identified subset.

Setting a high time-to-live to mitigate expensive calls by HealthIndicators may give a false sense of the actual health of the application where I think a solution that allows a developer to distinguish and execute particular health indicators conditionally may be more clear.

Thanks!

@snicoll
Copy link
Member

snicoll commented Jul 8, 2015

Either way, this sounds like a nice feature. If you want to give that a try, go ahead and we can discuss more based on the code (make sure to sign the CLA).

@wilkinsona
Copy link
Member

As I said on Stack Overflow, I'm not convinced that this is a good idea. The description above hasn't changed that opinion – I still think this feels like premature optimisation.

@coreyjv You've said that some calls can be "expensive". Can you define that more concretely? Checking that a service is available, even if it's remote, should have a far lower cost than actually using that service for any real work.

I also don't really understand the point of having a health check that, based on some condition, might not actually check anything. Beyond a cache with predictable TTL, aren't you in danger of believing that your app is healthy when, in fact, it's not?

@coreyjv
Copy link
Author

coreyjv commented Jul 8, 2015

@wilkinsona,

Thanks for taking the time to respond!

Can you define that more concretely?

Sure thing. One example would be checking connectivity to a directory service such as LDAP or Active Directory. In these cases one would want to validate they could query certain parts of the tree successfully.

Another example would be a remote call on a web service. This may not only be expensive from a network perspective since it's a remote call but it presents another challenge. These downstream dependencies also have to be able to support the load that these health check calls are placing on which could slow down actual services, and be significant if the calling application is actually multiple application instances.

Beyond a cache with predictable TTL, aren't you in danger of believing that your app is healthy when, in fact, it's not?

I think there's a distinction between the application health, and the health of external dependencies. While an external dependency such as a remote web service may report as DOWN it does not necessarily indicate a problem with the application. It may be a transient failure where the network is at fault and fail over of the application should not occur in my opinion.

I hope this provides a bit of clarity and thanks for taking a critical look at this request.

@wilkinsona
Copy link
Member

I remain unconvinced by the expensive argument. A periodic health check, even if it's happening every five seconds from multiple instances, should be very lightweight in comparison to the load that those instances are producing doing real work. If those health checks are sufficient to cause a noticeable degradation in performance then you have bigger problems.

I think there's a distinction between the application health, and the health of external dependencies. While an external dependency such as a remote web service may report as DOWN it does not necessarily indicate a problem with the application. It may be a transient failure where the network is at fault and fail over of the application should not occur in my opinion.

I find this part of the reasoning more compelling. It sounds like you want the health indicators to have an error threshold such that it takes multiple errors in a given sampling window for a service to be considered down, very much like what's supported by Hystrix.

@snicoll
Copy link
Member

snicoll commented Jul 8, 2015

For the record, I thought this issue was about tuning the time-to-live of certain checks, not allowing for custom conditions on a given indicator.

@mvitz
Copy link
Contributor

mvitz commented Jan 11, 2017

FYI: We use a similar approach in our current project. If written a single class which decouples returning the health result from calculating it. All that is needed then is to schedule a trigger for the calculation method. If there is any interest I could upload an example.

@philwebb
Copy link
Member

Supporting this would require a lot of additional code on our end. We're not convinced there's enough of a need. You can implement this directly by writing a custom health aggregator.

@philwebb philwebb added the status: declined A suggestion or change that we don't feel we should currently apply label Mar 22, 2018
@ngaya-ll
Copy link

ngaya-ll commented Aug 3, 2018

@philwebb I'd like to reopen this for consideration to support the following use case.

My organization uses Spring Boot services running in a Kubernetes environment. Kubernetes has a concept of two types of health probes: liveness and readiness probes. Liveness probes are used to detect if the service needs to be restarted, while readiness probes detect if the service is ready to be added to load balancers.

It would be great if there were a way to perform different checks for readiness and liveness. For example, if an application temporarily loses connectivity to its data source it should be taken out of load balancing, but doesn't need to be restarted. So we would like to include the DataSourceHealthIndicator in the readiness probe but not the liveness probe.

Note that #8685 already allows querying a single health indicator, but not multiple indicators. Perhaps Spring Boot could provide a way to configure named groups of indicators that could be accessed as /actuator/health/{group}.

@wilkinsona wilkinsona added the for: team-attention An issue we'd like other members of the team to review label Aug 6, 2018
@philwebb philwebb reopened this Aug 8, 2018
@philwebb philwebb closed this as completed Aug 8, 2018
@snicoll snicoll removed the for: team-attention An issue we'd like other members of the team to review label Aug 8, 2018
@wilkinsona
Copy link
Member

@ngaya-ll That's an interesting use case, thank you. It feels a bit different to what was proposed in this issue so can you please open a new issue so that we can track it separately?

@ngaya-ll
Copy link

ngaya-ll commented Aug 8, 2018

@wilkinsona Sure, filed #14022

@wilkinsona
Copy link
Member

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: declined A suggestion or change that we don't feel we should currently apply type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

6 participants