Healthcheck: add support at the load-balancers of services level #8057

mpl · 2021-04-13T10:31:26Z

What does this PR do?

This change adds support for automatic self-healthcheck for the
WeightedRoundRobin type of service, in order to make the whole tree of
load-balancing aware of status changes at the "leaf" level. That is, let the
load-balancing algorithm adjust globally when a server status changes (gets down
or up).

Motivation

So far healthcheck was supported only at the "leaf" level, i.e. a load-balancer of
servers (at the bottom of the load-balancing tree) was able to do active health
checks on his own servers to adjust its load-balancing algorithm. However when
e.g. all of its servers would go down, there was no propagation upwards of this
status change, which means requests would still arrive to this load-balancer
even though it would in effect be down, and should have been ignored by its
parent(s).

Therefore this change adds a mechanism so that all status changes can be
propagated upwards to let all parents know (and by extension, the whole tree)
when a service is in effect down.

The corresponding configuration change is the introduction of the HealthCheck
option to the WeightedRoundRobin element. When the HealthCheck option is present
in a WeightedRoundRobin, automatic propagation of a status change of its
children is enabled. None of the fields of HealthCheck are relevant in this
context, so they are ignored if present.

More

Added/updated tests
Added/updated documentation

Additional Notes

Fixes #7693

Co-authored-by: Dmitry Sharshakov d3dx12.xx@gmail.com
Co-authored-by: Julien Salleyron julien.salleyron@gmail.com
Co-authored-by: Jean-Baptiste Doumenjou 925513+jbdoumenjou@users.noreply.github.com
Co-authored-by: Romain rtribotte@users.noreply.github.com
Co-authored-by: Tom Moulard tom.moulard@traefik.io

ddtmachado · 2021-04-13T20:12:56Z

Thanks, I was really eager to try this one ;)
So far I only found one anomaly that I think it's worth further investigation and maybe a unit test.

Here is the base static config used:

http:
  services:
    whoami-public:
      weighted:
        #healthCheck: {}
        services:
        - name: whoami-dc1
          weight: 1
        - name: whoami-dc2
          weight: 1

    whoami-dc1:
      loadBalancer:
        servers:
          - url: "http://127.0.0.1:8090"
          - url: "http://127.0.0.1:8091"
        healthCheck:
          scheme: http
          interval: 2s
          timeout: 1s
          path: /ping

    whoami-dc2:
      loadBalancer:
        servers:
          - url: "http://127.0.0.1:8092"
          - url: "http://127.0.0.1:8093"
        healthCheck:
          scheme: http
          interval: 2s
          timeout: 1s
          path: /ping

  routers:
    dashboard:
      rule: Host(`traefik.docker.localhost`)
      service: api@internal

    whoami-router:
      rule: Host(`whoami.docker.localhost`)
      service: whoami-public

Step / Status:

✔️ Setup a WRR service without the healthcheck enabled (whoami-public)
✔️ Allow one of the underlying services on it to fail all servers HC (whoami-dc2)
✔️ Enable the healthcheck on the root WRR (whoami-public) and let Traefik reload the config
❌ Check the failed service (whoami-dc2) is not getting requests

Even after enabling the healthcheck the failed service was still considered during round robin from the root WRR, resulting in Service Unavailable errors when sending requests.

I'm also attaching logs starting from the config reload as behavior confirmation.

Other than that, everything work as expected when I start it with healthcheck already enabled!

PS.: whoami is not actually whoami here, but a custom backend that allows me to control the /ping endpoint

mpl · 2021-04-14T12:16:48Z

Thanks @ddtmachado , I will investigate.
Ah I see what you mean, you enabled the healthcheck on the top service after things had already started to fail below. I indeed didn't think of that case.

pkg/server/service/loadbalancer/wrr/wrr.go

mpl · 2021-04-28T14:54:54Z

Thanks @ddtmachado , I will investigate.
Ah I see what you mean, you enabled the healthcheck on the top service after things had already started to fail below. I indeed didn't think of that case.

@ddtmachado , I've updated the PR with a new, simpler, design. From what I've tested it seems it should also fix the problem you reported. But could you please test it out and confirm?

pkg/server/service/service.go

ddtmachado · 2021-04-30T18:13:50Z

@mpl just tested again and I confirm it does work for that edge case as well, nice work!

ddtmachado · 2021-04-30T18:20:05Z

pkg/healthcheck/healthcheck.go

+
+	if !upBefore {
+		// we were already down, and we still are, no need to propagate.
+		log.WithoutContext().Debugf("child %s now DOWN, but we were already DOWN, so no need to propagate.", u.String())


maybe we could add context for the log messages?
IMO it's easier to follow when there are other log messages in between

yes, I also already had the same dilemma with SetStatus (in wrr.go), where the only reason I pass balancerName as an argument is for better logging. Julien and I were considering passing a context instead (with the balancerName inside it), but it didn't seem much better.
But yeah, we probably should. Adding a TODO for now.
Ah, and in the case of RemoveServer (and UpsertServer) maybe we can't even do that, because the signatures have to match the ones from oxy. I have to double-check.

yeah, as I was suspecting, at the moment we're kindof blocked because of the fixed signatures of RemoveServer and UpsertServer, because they are coming from oxy. It might be doable to work-around that, but I didn't want to delay the PR further with that. Especially since we hope to rework the part depending on oxy soon.
So for now I half-way addressed your suggestion: the code within the method already uses a (useless for now) context, but we'll pass around as argument the actual context later.

This change adds support for automatic self-healthcheck for the WeightedRoundRobin type of service, in order to make the whole tree of load-balancing aware of status changes at the "leaf" level. That is, let the load-balancing algorithm adjust globally when a server status changes (gets down or up). So far healthcheck was supported at the "leaf" level, i.e. a load-balancer of servers (at the bottom of the load-balancing tree) was able to do active health checks on his own servers to adjust its load-balancing algorithm. However when e.g. all of its servers would go down, there was no propagation upwards of this status change, which means requests would still arrive to this load-balancer even though it would in effect be down, and should have been ignored by its parent(s). Therefore this change adds a mechanism so that all status changes can be propagated upwards to let all parents know (and by extension, the whole tree) when a service is in effect down. The corresponding configuration change is the introduction of the HealthCheck option to the WeightedRoundRobin element. When the HealthCheck option is present in a WeightedRoundRobin, automatic propagation of a status change of its children is enabled. None of the fields of HealthCheck are relevant in this context, so they are ignored if present.

Also added logic to break if healthcheck is enabled for some part of the tree, but not everywhere. Still a WIP, needs more doc, and more testing.

Also finish mirroring support, and improve logging.

traefiker added size/L status/0-needs-triage labels Apr 13, 2021

mpl added area/healthcheck kind/enhancement a new or improved feature. status/1-needs-design-review and removed status/0-needs-triage labels Apr 13, 2021

ldez requested a review from juliens April 13, 2021 11:38

ldez self-requested a review April 13, 2021 12:30

mpl force-pushed the hcwrr branch from 45513a3 to 0739c76 Compare April 13, 2021 15:56

mpl force-pushed the hcwrr branch from 0739c76 to 66eef94 Compare April 14, 2021 14:04

mpl commented Apr 20, 2021

View reviewed changes

pkg/server/service/loadbalancer/wrr/wrr.go Outdated Show resolved Hide resolved

mpl force-pushed the hcwrr branch 2 times, most recently from dae94e6 to 0a1126d Compare April 28, 2021 14:52

mpl commented Apr 29, 2021

View reviewed changes

pkg/server/service/service.go Outdated Show resolved Hide resolved

ddtmachado reviewed Apr 30, 2021

View reviewed changes

mpl closed this May 3, 2021

mpl added this to the next milestone May 3, 2021

ldez reopened this May 3, 2021

ldez closed this May 3, 2021

ldez reopened this May 3, 2021

mpl force-pushed the hcwrr branch from 6d0e750 to 8a4033e Compare May 4, 2021 08:41

mpl force-pushed the hcwrr branch 4 times, most recently from eab0332 to 3e72b00 Compare June 22, 2021 16:00

rtribotte force-pushed the hcwrr branch from 3e72b00 to 8e0214f Compare June 24, 2021 09:08

ldez added status/3-needs-merge and removed status/2-needs-review labels Jun 25, 2021

traefiker added the status/4-merge-in-progress label Jun 25, 2021

mpl and others added 13 commits June 25, 2021 18:50

Added support in mirroring as well

43ebb7f

Also added logic to break if healthcheck is enabled for some part of the tree, but not everywhere. Still a WIP, needs more doc, and more testing.

Handle broken configuration chain

06f0c80

Also finish mirroring support, and improve logging.

linter and doc

094c4f6

integrate review comments

95a1bda

remove TODOs

200dd6b

cleanup tests

252b461

doc: address review comments

acd8f3f

emptybackendhandler: force it to take an BalancerStatusHandler

b45c210

in sticky cookie case, break early as next iterations are useless

86cfc09

config: use specific healthcheck type

2bb4e20

review: fix documentation

47081ae

Review: integration test.

cbdbd33

traefiker force-pushed the hcwrr branch from 16c1c1f to cbdbd33 Compare June 25, 2021 18:50

traefiker removed the status/4-merge-in-progress label Jun 25, 2021

traefiker merged commit 838a8e1 into traefik:master Jun 25, 2021

traefiker removed the status/3-needs-merge label Jun 25, 2021

rtribotte changed the title ~~healthcheck: add support at the load-balancers of services level~~ Healthcheck: add support at the load-balancers of services level Jun 28, 2021

mpl deleted the hcwrr branch April 13, 2022 07:57

ErikLundJensen mentioned this pull request Nov 25, 2022

Traefik returns 404 when service with 0 pods and service with ExternalName are combined #9547

Closed

2 tasks

mmatur mentioned this pull request Mar 29, 2024

feat: add dockerfile microsoft/service-fabric-traefik#33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck: add support at the load-balancers of services level #8057

Healthcheck: add support at the load-balancers of services level #8057

mpl commented Apr 13, 2021 •

edited

Loading

ddtmachado commented Apr 13, 2021

mpl commented Apr 14, 2021 •

edited by ldez

Loading

mpl commented Apr 28, 2021

ddtmachado commented Apr 30, 2021

ddtmachado Apr 30, 2021

mpl May 3, 2021

mpl May 6, 2021

Healthcheck: add support at the load-balancers of services level #8057

Healthcheck: add support at the load-balancers of services level #8057

Conversation

mpl commented Apr 13, 2021 • edited Loading

What does this PR do?

Motivation

More

Additional Notes

ddtmachado commented Apr 13, 2021

mpl commented Apr 14, 2021 • edited by ldez Loading

mpl commented Apr 28, 2021

ddtmachado commented Apr 30, 2021

ddtmachado Apr 30, 2021

Choose a reason for hiding this comment

mpl May 3, 2021

Choose a reason for hiding this comment

mpl May 6, 2021

Choose a reason for hiding this comment

mpl commented Apr 13, 2021 •

edited

Loading

mpl commented Apr 14, 2021 •

edited by ldez

Loading