-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traefik healthcheck should not pass by default for newly discovered backends #4544
Comments
Hi @andyspiers, thanks for your bug report. Marking Marathon tasks as unhealthy for which no health check result has been collected yet has been requested a few times in the past. The major problem is that certain scenarios ask for the inverted logic: specifically, when a new leader is elected, all health check results are reset and need to be recomputed first. If the default health state was unhealthy in such cases, all running tasks would immediately start to generate errors until health check states were recovered. In fact, that's the behavior Traefik exhibited some time ago but was changed to address the associated uptime concerns. A section in the Marathon guide also touches on the subject. Now, I should explain that I have stopped working with Marathon/Mesos at least one year ago. I'm not sure if the situation has possibly changed on Marathon's end (for instance, whether the cluster shares state on health checks results across all masters). Also, I don't have experience with Mesos-based health checks, only Marathon-based ones which may or may not make a difference. What I do remember is that we frequently rolling-restarted our Mesos clusters for upgrades and other reasons, and managed to maintain full service uptime (at least for those applications that implemented proper graceful termination handlers and readiness checks). I can't recall anymore on how we did it exactly, but I believe we never touched agents directly but replaced nodes as a whole (immutable infrastructure). @gottwald, could you chime in if your memory is better than mine on this one? Either way, I'd be very open to discuss ideas you may have on how we could address your case without regressing on the scenarios described above. Thanks! |
Hi Timo, Thanks for your response and apologies for the long time in replying to you. To clarify, I'm not talking about modifying the behaviour of the Marathon provider. We have read the Traefik docs and understand that on Marathon leader election the healthcheck status becomes unknown. We did a test to confirm if this is still the case and yes, it is. During a marathon master re-election, healthCheckResults[] went empty (unknown)
This means they are still unusable for Traefik, as expected. The issue/bug report is about Traefik healthchecks not Marathon healthchecks: Is there a way that when a new backend server is discovered and populated into the Traefik config AND it has a Traefik healthcheck configured, that this service is populated but the initial state of the Traefik healthcheck is presumed unhealthy and the new server will only then receive requests after a successful Traefik healthcheck result? Hope this clarifies the issue Thanks, |
.... possibly i shouldn't have mentioned marathon :) - this may be a generalised issue for Traefik healthchecks for any type of provider Thanks, |
@andyspiers Have you tried setting the following label in Marathon: It could be that it does not work in current latest version, because of this report: |
Oh darn, this one fell completely off my radar, so sorry! @andyspiers sorry for the confusion on my end, I believe you have a good point. Happy to see if adjustments on Traefik's end are reasonable to make if the matter is still important to you. |
@a-nldisr thanks for the input - I'll bear that in mind but that's not the issue in our particular case as a healthcheck to @timoreimann - thanks for getting back to me and no worries, I know how it is :) I've also been very busy with other stuff so I haven't chased this up but it is still relevant to us, and I would still consider it a bug that a new backend (with a healthcheck defined) is populated and presumed to be healthy before the healthcheck has actually run for the first time. If you're able to look into adjusting Traefik's behaviour in this area that would be awesome, thanks again. Andy |
@andyspiers based on our and additional discussions I've had with other maintainers, it seems reasonable to allow for the default health check state to be unhealthy as you requested. Popular management platforms (like Kubernetes and Docker) also default to that state. Additionally, there doesn't seem to be any harm in changing the default behavior (as opposed to making it configurable) except for a slight increase in delay. If that's really problematic to anybody, we can contemplate adding parameters to tune health check behavior (e.g., periodicity and initial delay) where we do not have them yet. This would be beneficial beyond the issue at hand. I can drive the necessary change myself when I have some free bandwidth. If anyone is happy to pick up the PR effort, however, I won't resist much. :) Thanks! |
cool - thanks so much @timoreimann for digging into this :) |
Is there any way to configure this now? Or do you have hints for me to submit a PR? 😅 |
AFAIK @tennox nobody has started work on this |
I am testing a blue/green deployment scenario using the docker backend, aiming for zero dropped requests, and am hitting this same issue. The new docker container is immediately considered healthy by traefik and is sent requests before the container is fully initialized and able to process requests. The proposed fix (allowing for the default health check state to be unhealthy) would solve this problem for me as well. |
I won't have time to tackle this one in the foreseeable future myself -- whoever wants to drive this is more than welcome! |
Hi @timoreimann! I'd like to contribute here. I need to fix this for 1.7, because I need support for dynamodb and ecs. I might try to fix 2.2 later. Let me see if you can help to confirm some assumptions and help me with some ideas :) please. If I understand correctly, every time there is a dynamic configuration update, all the configuration is replaced with the information from ConfigMessage. So, Traefik is not storing information from previous configuration, just parsing and replacing old config. To implement "default false" behavior, I'm thinking on adding all the backend urls to backend.disabledURLs by default at start time. Healthcheck should move them to the Servers list. The tricky part is to store the healthy urls, so we keep them in the Servers list on config reload. I don't see what's the right place to store this if we are replacing the whole config on each config update. Do you have any hint? Not sure what's the right global place to store this info. |
This is also a problem with Azure Service Fabric. Service Fabric publishes container endpoints immediately as the container starts, and we have not found any way to prevent it from doing this. This leads to Traefik routing traffic to the endpoints until Traefik healthcheck then runs and recognizes they are down and removes them. It would be great if Traefik would treat the endpoints as unhealthy until a health check has passed, or would at least provide an option in health checks to choose this behavior. |
👋 Curious if anyone knows any workarounds to resolve brief moments of requests being routed to unhealthy servers before the health check runs? |
Is there appetite to change the default behavior here from up to down? Looking through the load balancer code I think it may be derived from this line: traefik/pkg/healthcheck/healthcheck.go Line 111 in 063f8fa
Alternatively, this could be implemented closer to the Docker provider |
By default health checks are up, which is problematic for new backends that may not yet be healthy. This commit updates the health check code to default to down. Fixes traefik#4544.
any workaround ? |
Do you want to request a feature or report a bug?
Bug
What did you do?
Background: we deploy changes to our marathon masters and mesos agents by replacing the AWS instances with CloudFormation rolling updates. We co-locate Traefik on each of the mesos agents. Production environment is about 200 marathon apps, 800 containers over 25 EC2 instances, serving generally around 1K to 4K req/s.
During a rolling update to production, we drop almost 1% of requests. Our aim is to not drop any requests (but we might settle for 0.01%) so in a test environment, we're working through various points of lifecycle of marathon tasks and Traefik configuration to analyse the reasons for dropped requests.
With the help of the great manual - https://docs.traefik.io/user-guide/marathon/ - we have made some progress:
However, when doing a rolling update of the mesos agents the task lifecycle is different.
Tasks are evacuated from each instance and recreated by marathon on other agents without a deployment occurring ... which means that there are no readiness checks to prevent new tasks going live in Traefik before they are ready. This means we receive a lot of 502 errors when Traefik tries to send requests to the newly added backends.
To try and deal with this cause of dropped requests and 502 errors we have enabled Traefik healthchecks on our test service. This gives a significant improvement but we're still dropping some requests.
It seems that this is because when Traefik is triggered (by a marathon event) to reconfigure itself and discovers a new backend, it initially puts it in active rotation .... in other words the initial state of the Traefik healthcheck defaults to "passing".
I think this is arguably a bug and that any newly discovered backend with a healthcheck should initially be presumed to be failing its healthcheck until such time as the healthcheck actually passes.
Output of
traefik version
: (What version of Traefik are you using?)What is your environment & configuration (arguments, toml, provider, platform, ...)?
Some uninteresting stuff like log headers removed
If applicable, please paste the log output in DEBUG level (
--logLevel=DEBUG
switch)Test scenario
mock-webapp
with 3 marathon tasks on 3 out of 5 different mesos agentsSo you can see that for almost 700ms between 12:32:53.319 and 12:32:54.000 the backend xxx.yyy.50.75:31518 is presumed to be healthy and is being sent requests, until such time as the first healthcheck request fails, and then it is marked unhealthy.
During this period, we unavoidably get 502 errors because Traefik is sending requests to a marathon task which is not yet listening on the port.
At 12:33:09 you can see the task go healthy in marathon.
At 12:33:14 this traefik node sees the service as healthy and puts it back in rotation
Here's a log of all the 502 errors received across all 5 Traefik + mesos-agent nodes:
You can see that the total period of errors is around 1145ms - presumably each individual node received the marathon event at a slightly different times and took a different amount of time to reconfigure and schedule the first healthcheck.
Before and after this period, we receive only 200 responses.
what could be done differently?
I think that the ideal behaviour here would be that the new backend that has not yet been healthchecked is marked as unhealthy until the first successful healthcheck result is processed.
Thanks and sorry for the very long bug report!
The text was updated successfully, but these errors were encountered: