Allow health groups to be configured at an additional path #25471

bric3 · 2021-03-02T10:12:04Z

Motivation

In a Kubernetes production with Istio and prometheus metrics.

Istio "installs" a sidecar (also known as istio-proxy) that intercepts inbound and outbound traffic
Istio is used to monitor the success rate from outside the JVM
Servicemonitor (Prometheus) is configured to look at /actuator/prometheus

We noticed that servicemonitor starts trying to fetch prometheus metrics very early, before the application is ready, this results in a noticeable 503 during the rollout.

Since grabbing the metrics is a separate concern than serving metrics, we wanted to expose those on a different port in order to exclude the port from Istio.

    annotations:
      # Make Istio not listening on management.server.port 8081
      traffic.sidecar.istio.io/excludeInboundPorts: "8081"

This is possible as documented here : https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-customizing-management-server-port

There's always the possibility to change the management (actuator) port, but the documentation actually warns about trusting the health endpoints with this setup: https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes

If your Actuator endpoints are deployed on a separate management context, be aware that endpoints are then not using the same web infrastructure (port, connection pools, framework components) as the main application. In this case, a probe check could be successful even if the main application does not work properly (for example, it cannot accept new connections).

Suggestion

Could it be worth to distinguish two class of actuators ?

The ones that represent the health of the service, and exposed on they same port, connection pool, framework than the service.
The others that are here to provide other features, like metrics, and can be declared to use a different web infrastructure without health endpoint.

The text was updated successfully, but these errors were encountered:

wilkinsona · 2021-03-06T13:21:53Z

We discussed this today and have a few ideas that we'd like to explore. In the meantime, I don't think you need to worry about the warning in the documentation. If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

bric3 · 2021-03-11T15:51:09Z

Hi, thank you for taking this in consideration.

If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

Indeed, the monitoring by Istio is certainly mitigating this warning, yet the lurking issue is about Istio allowing real request while the application main web infrastructure is not ready or live. Retry policy could help, but this may probably not what we want for non GET/HEAD requests. Imagine a change in configuration, that leads to more requests waiting on IO, this may lead to (Tomcat) connector saturation issues.

wilkinsona · 2021-03-11T16:21:24Z

Irrespective of the management port that's being used, the readiness probe won't report that the application is ready to handle traffic until it really is ready. As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

bric3 · 2021-03-29T10:20:39Z

Just to weigh in, the following statement may not hold well when the process allows to deploy often, like multiple time a day.

As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

And typically we got caught with an incorrect configuration change that was deployed too early (before the correct docker image was deployed), resulting in many unsatisfied requests waiting on the third party dependency that was misconfigured, this lead to saturation on the connector of the main application. The problem would have been detected earlier if the liveness probe failed at this time.

EDIT:
Another pod, another story. This time for some unknown reason a single pod decided to go wrong. Newrelic agents failed to instrument properly our JAX-RS/Jersey servlet, and as such the servlet couldn't handle traffic, yet the health probe was reporting a 200 OK status as it was exposed on a different servlet.

wilkinsona · 2021-08-12T06:29:15Z

Reopening to remind us to update the release notes.

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Mar 2, 2021

mbhave added the for: team-meeting An issue we'd like to discuss as a team to make progress label Mar 2, 2021

wilkinsona self-assigned this Mar 5, 2021

wilkinsona removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Mar 5, 2021

wilkinsona added status: pending-design-work Needs design work before any code can be developed type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged labels Mar 6, 2021

wilkinsona added this to the 2.x milestone Mar 6, 2021

wilkinsona removed their assignment Mar 8, 2021

snicoll changed the title ~~Enhancement: Allow to expose metrics actuator on a different port than health actuator~~ Allow to expose metrics actuator on a different port than health actuator Mar 24, 2021

mbhave self-assigned this Jul 9, 2021

mbhave changed the title ~~Allow to expose metrics actuator on a different port than health actuator~~ Allow health groups to be configured at an additional path Aug 11, 2021

mbhave removed the status: pending-design-work Needs design work before any code can be developed label Aug 12, 2021

mbhave modified the milestones: 2.x, 2.6.0-M2 Aug 12, 2021

mbhave closed this as completed in 49c86e6 Aug 12, 2021

wilkinsona reopened this Aug 12, 2021

mbhave closed this as completed Aug 16, 2021

ThomasVitale mentioned this issue Sep 25, 2021

Application fails to start with excluded health endpoint when JMX is enabled #28131

Closed

maxxedev mentioned this issue Dec 1, 2021

Setting cache time-to-live for the health endpoint has no effect #28882

Closed

wilkinsona mentioned this issue Dec 6, 2021

Extends Additional Path on Main or Management Port for actuator endpoints #28914

Closed

leovx mentioned this issue Jan 22, 2022

Health Web Endpoint Extension Failed to Initialize When Some Conditions Hit #29532

Closed

chicobento mentioned this issue Aug 30, 2023

Configurable Health Indicator to observe HTTP Endpoints #36953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow health groups to be configured at an additional path #25471

Allow health groups to be configured at an additional path #25471

bric3 commented Mar 2, 2021 •

edited

Loading

wilkinsona commented Mar 6, 2021 •

edited

Loading

bric3 commented Mar 11, 2021 •

edited

Loading

wilkinsona commented Mar 11, 2021

bric3 commented Mar 29, 2021 •

edited

Loading

wilkinsona commented Aug 12, 2021

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Allow health groups to be configured at an additional path #25471

Allow health groups to be configured at an additional path #25471

Comments

bric3 commented Mar 2, 2021 • edited Loading

wilkinsona commented Mar 6, 2021 • edited Loading

bric3 commented Mar 11, 2021 • edited Loading

wilkinsona commented Mar 11, 2021

bric3 commented Mar 29, 2021 • edited Loading

wilkinsona commented Aug 12, 2021

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

bric3 commented Mar 2, 2021 •

edited

Loading

wilkinsona commented Mar 6, 2021 •

edited

Loading

bric3 commented Mar 11, 2021 •

edited

Loading

bric3 commented Mar 29, 2021 •

edited

Loading