Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck: add support at the load-balancers of services level #8057

Merged
merged 13 commits into from
Jun 25, 2021

Conversation

mpl
Copy link
Collaborator

@mpl mpl commented Apr 13, 2021

What does this PR do?

This change adds support for automatic self-healthcheck for the
WeightedRoundRobin type of service, in order to make the whole tree of
load-balancing aware of status changes at the "leaf" level. That is, let the
load-balancing algorithm adjust globally when a server status changes (gets down
or up).

Motivation

So far healthcheck was supported only at the "leaf" level, i.e. a load-balancer of
servers (at the bottom of the load-balancing tree) was able to do active health
checks on his own servers to adjust its load-balancing algorithm. However when
e.g. all of its servers would go down, there was no propagation upwards of this
status change, which means requests would still arrive to this load-balancer
even though it would in effect be down, and should have been ignored by its
parent(s).

Therefore this change adds a mechanism so that all status changes can be
propagated upwards to let all parents know (and by extension, the whole tree)
when a service is in effect down.

The corresponding configuration change is the introduction of the HealthCheck
option to the WeightedRoundRobin element. When the HealthCheck option is present
in a WeightedRoundRobin, automatic propagation of a status change of its
children is enabled. None of the fields of HealthCheck are relevant in this
context, so they are ignored if present.

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

Fixes #7693

Co-authored-by: Dmitry Sharshakov d3dx12.xx@gmail.com
Co-authored-by: Julien Salleyron julien.salleyron@gmail.com
Co-authored-by: Jean-Baptiste Doumenjou 925513+jbdoumenjou@users.noreply.github.com
Co-authored-by: Romain rtribotte@users.noreply.github.com
Co-authored-by: Tom Moulard tom.moulard@traefik.io

@ddtmachado
Copy link
Contributor

Thanks, I was really eager to try this one ;)
So far I only found one anomaly that I think it's worth further investigation and maybe a unit test.

Here is the base static config used:

http:
  services:
    whoami-public:
      weighted:
        #healthCheck: {}
        services:
        - name: whoami-dc1
          weight: 1
        - name: whoami-dc2
          weight: 1

    whoami-dc1:
      loadBalancer:
        servers:
          - url: "http://127.0.0.1:8090"
          - url: "http://127.0.0.1:8091"
        healthCheck:
          scheme: http
          interval: 2s
          timeout: 1s
          path: /ping

    whoami-dc2:
      loadBalancer:
        servers:
          - url: "http://127.0.0.1:8092"
          - url: "http://127.0.0.1:8093"
        healthCheck:
          scheme: http
          interval: 2s
          timeout: 1s
          path: /ping

  routers:
    dashboard:
      rule: Host(`traefik.docker.localhost`)
      service: api@internal

    whoami-router:
      rule: Host(`whoami.docker.localhost`)
      service: whoami-public

Step / Status:

  1. ✔️ Setup a WRR service without the healthcheck enabled (whoami-public)
  2. ✔️ Allow one of the underlying services on it to fail all servers HC (whoami-dc2)
  3. ✔️ Enable the healthcheck on the root WRR (whoami-public) and let Traefik reload the config
  4. ❌ Check the failed service (whoami-dc2) is not getting requests

Even after enabling the healthcheck the failed service was still considered during round robin from the root WRR, resulting in Service Unavailable errors when sending requests.

I'm also attaching logs starting from the config reload as behavior confirmation.

Other than that, everything work as expected when I start it with healthcheck already enabled!

PS.: whoami is not actually whoami here, but a custom backend that allows me to control the /ping endpoint

@mpl
Copy link
Collaborator Author

mpl commented Apr 14, 2021

Thanks @ddtmachado , I will investigate.
Ah I see what you mean, you enabled the healthcheck on the top service after things had already started to fail below. I indeed didn't think of that case.

@mpl mpl force-pushed the hcwrr branch 2 times, most recently from dae94e6 to 0a1126d Compare April 28, 2021 14:52
@mpl
Copy link
Collaborator Author

mpl commented Apr 28, 2021

Thanks @ddtmachado , I will investigate.
Ah I see what you mean, you enabled the healthcheck on the top service after things had already started to fail below. I indeed didn't think of that case.

@ddtmachado , I've updated the PR with a new, simpler, design. From what I've tested it seems it should also fix the problem you reported. But could you please test it out and confirm?

pkg/server/service/service.go Outdated Show resolved Hide resolved
@ddtmachado
Copy link
Contributor

@mpl just tested again and I confirm it does work for that edge case as well, nice work!


if !upBefore {
// we were already down, and we still are, no need to propagate.
log.WithoutContext().Debugf("child %s now DOWN, but we were already DOWN, so no need to propagate.", u.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could add context for the log messages?
IMO it's easier to follow when there are other log messages in between

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I also already had the same dilemma with SetStatus (in wrr.go), where the only reason I pass balancerName as an argument is for better logging. Julien and I were considering passing a context instead (with the balancerName inside it), but it didn't seem much better.
But yeah, we probably should. Adding a TODO for now.
Ah, and in the case of RemoveServer (and UpsertServer) maybe we can't even do that, because the signatures have to match the ones from oxy. I have to double-check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as I was suspecting, at the moment we're kindof blocked because of the fixed signatures of RemoveServer and UpsertServer, because they are coming from oxy. It might be doable to work-around that, but I didn't want to delay the PR further with that. Especially since we hope to rework the part depending on oxy soon.
So for now I half-way addressed your suggestion: the code within the method already uses a (useless for now) context, but we'll pass around as argument the actual context later.

@mpl mpl closed this May 3, 2021
@mpl mpl added this to the next milestone May 3, 2021
@ldez ldez reopened this May 3, 2021
@ldez ldez closed this May 3, 2021
@ldez ldez reopened this May 3, 2021
mpl and others added 13 commits June 25, 2021 18:50
This change adds support for automatic self-healthcheck for the
WeightedRoundRobin type of service, in order to make the whole tree of
load-balancing aware of status changes at the "leaf" level. That is, let the
load-balancing algorithm adjust globally when a server status changes (gets down
or up).

So far healthcheck was supported at the "leaf" level, i.e. a load-balancer of
servers (at the bottom of the load-balancing tree) was able to do active health
checks on his own servers to adjust its load-balancing algorithm. However when
e.g. all of its servers would go down, there was no propagation upwards of this
status change, which means requests would still arrive to this load-balancer
even though it would in effect be down, and should have been ignored by its
parent(s).

Therefore this change adds a mechanism so that all status changes can be
propagated upwards to let all parents know (and by extension, the whole tree)
when a service is in effect down.

The corresponding configuration change is the introduction of the HealthCheck
option to the WeightedRoundRobin element. When the HealthCheck option is present
in a WeightedRoundRobin, automatic propagation of a status change of its
children is enabled. None of the fields of HealthCheck are relevant in this
context, so they are ignored if present.
Also added logic to break if healthcheck is enabled for some part of the
tree, but not everywhere.

Still a WIP, needs more doc, and more testing.
Also finish mirroring support, and improve logging.
@traefiker traefiker merged commit 838a8e1 into traefik:master Jun 25, 2021
@rtribotte rtribotte changed the title healthcheck: add support at the load-balancers of services level Healthcheck: add support at the load-balancers of services level Jun 28, 2021
@mpl mpl deleted the hcwrr branch April 13, 2022 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weighted round robin with healthcheck
7 participants