Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow health groups to be configured at an additional path #25471

Closed
bric3 opened this issue Mar 2, 2021 · 9 comments
Closed

Allow health groups to be configured at an additional path #25471

bric3 opened this issue Mar 2, 2021 · 9 comments
Assignees
Labels
type: enhancement A general enhancement
Milestone

Comments

@bric3
Copy link

bric3 commented Mar 2, 2021

Motivation

In a Kubernetes production with Istio and prometheus metrics.

  • Istio "installs" a sidecar (also known as istio-proxy) that intercepts inbound and outbound traffic
  • Istio is used to monitor the success rate from outside the JVM
  • Servicemonitor (Prometheus) is configured to look at /actuator/prometheus

We noticed that servicemonitor starts trying to fetch prometheus metrics very early, before the application is ready, this results in a noticeable 503 during the rollout.

Since grabbing the metrics is a separate concern than serving metrics, we wanted to expose those on a different port in order to exclude the port from Istio.

    annotations:
      # Make Istio not listening on management.server.port 8081
      traffic.sidecar.istio.io/excludeInboundPorts: "8081"

This is possible as documented here : https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-customizing-management-server-port

There's always the possibility to change the management (actuator) port, but the documentation actually warns about trusting the health endpoints with this setup: https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes

If your Actuator endpoints are deployed on a separate management context, be aware that endpoints are then not using the same web infrastructure (port, connection pools, framework components) as the main application. In this case, a probe check could be successful even if the main application does not work properly (for example, it cannot accept new connections).

Suggestion

Could it be worth to distinguish two class of actuators ?

  1. The ones that represent the health of the service, and exposed on they same port, connection pool, framework than the service.
  2. The others that are here to provide other features, like metrics, and can be declared to use a different web infrastructure without health endpoint.
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Mar 2, 2021
@mbhave mbhave added the for: team-meeting An issue we'd like to discuss as a team to make progress label Mar 2, 2021
@wilkinsona wilkinsona self-assigned this Mar 5, 2021
@wilkinsona wilkinsona removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Mar 5, 2021
@wilkinsona
Copy link
Member

wilkinsona commented Mar 6, 2021

We discussed this today and have a few ideas that we'd like to explore. In the meantime, I don't think you need to worry about the warning in the documentation. If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

@wilkinsona wilkinsona added status: pending-design-work Needs design work before any code can be developed type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged labels Mar 6, 2021
@wilkinsona wilkinsona added this to the 2.x milestone Mar 6, 2021
@wilkinsona wilkinsona removed their assignment Mar 8, 2021
@bric3
Copy link
Author

bric3 commented Mar 11, 2021

Hi, thank you for taking this in consideration.

If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

Indeed, the monitoring by Istio is certainly mitigating this warning, yet the lurking issue is about Istio allowing real request while the application main web infrastructure is not ready or live. Retry policy could help, but this may probably not what we want for non GET/HEAD requests. Imagine a change in configuration, that leads to more requests waiting on IO, this may lead to (Tomcat) connector saturation issues.

@wilkinsona
Copy link
Member

Irrespective of the management port that's being used, the readiness probe won't report that the application is ready to handle traffic until it really is ready. As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

@snicoll snicoll changed the title Enhancement: Allow to expose metrics actuator on a different port than health actuator Allow to expose metrics actuator on a different port than health actuator Mar 24, 2021
@bric3
Copy link
Author

bric3 commented Mar 29, 2021

Just to weigh in, the following statement may not hold well when the process allows to deploy often, like multiple time a day.

As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

And typically we got caught with an incorrect configuration change that was deployed too early (before the correct docker image was deployed), resulting in many unsatisfied requests waiting on the third party dependency that was misconfigured, this lead to saturation on the connector of the main application. The problem would have been detected earlier if the liveness probe failed at this time.

EDIT:
Another pod, another story. This time for some unknown reason a single pod decided to go wrong. Newrelic agents failed to instrument properly our JAX-RS/Jersey servlet, and as such the servlet couldn't handle traffic, yet the health probe was reporting a 200 OK status as it was exposed on a different servlet.

@mbhave mbhave self-assigned this Jul 9, 2021
@mbhave mbhave changed the title Allow to expose metrics actuator on a different port than health actuator Allow health groups to be configured at an additional path Aug 11, 2021
@mbhave mbhave removed the status: pending-design-work Needs design work before any code can be developed label Aug 12, 2021
@mbhave mbhave modified the milestones: 2.x, 2.6.0-M2 Aug 12, 2021
@mbhave mbhave closed this as completed in 49c86e6 Aug 12, 2021
@wilkinsona
Copy link
Member

Reopening to remind us to update the release notes.

@meiyese

This comment was marked as resolved.

@philwebb

This comment was marked as resolved.

@meiyese

This comment was marked as resolved.

@philwebb

This comment was marked as resolved.

@meiyese meiyese mentioned this issue Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

6 participants