Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haproxy: drain connections on target service upgrades #2777

Closed
alena1108 opened this issue Nov 23, 2015 · 41 comments
Closed

Haproxy: drain connections on target service upgrades #2777

alena1108 opened this issue Nov 23, 2015 · 41 comments
Assignees
Labels
internal kind/enhancement Issues that improve or augment existing functionality kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality version/1.6
Milestone

Comments

@alena1108
Copy link

alena1108 commented Nov 23, 2015

Today when reload haproxy config, we only ensure that new connections will never get dropped (https://github.com/rancher/cattle/blob/0c4066f9fd2652f99d29989c2a29065f0378c20e/resources/content/config-content/configscripts/common/scripts.sh#L176). But we do terminate all existing connections. We have to "drain" all existing connections first before reloading haproxy config. Here are several ways of implementing it:

@ibuildthecloud ^^

TODO for @leodotcloud: #9561

@alena1108 alena1108 added kind/enhancement Issues that improve or augment existing functionality kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality labels Nov 23, 2015
@alena1108 alena1108 self-assigned this Nov 23, 2015
@alena1108 alena1108 added this to the Release 1.0 milestone Nov 23, 2015
@Rucknar
Copy link

Rucknar commented Nov 24, 2015

+1

@taketnoi
Copy link

+1

@joshuakarjala
Copy link

Yes have been debugging and issue related to this all day! :)

@joshuakarjala
Copy link

@alena1108 is there anyway to receive an event when HAProxy has reloaded and ready for new connections (that won't be dropped). Until this issue is resolved it would at least be nice to know when I can make new requests safely.

@joshuakarjala
Copy link

Alternative approach from Unbounce - http://inside.unbounce.com/product-dev/haproxy-reloads/

@CBR09
Copy link

CBR09 commented Feb 24, 2016

+1
Any update for this feature?. I think it's very important when using rancher in production enviroment.

@phucvinh52
Copy link

+1

@deniseschannon deniseschannon removed this from the Release 1.0 milestone Mar 1, 2016
@deniseschannon
Copy link

With GA coming up very soon (aiming for end of March), we are trying to fit in as much as possible, but with other customer priorities, we can only try our best for this feature.

@rogeralsing
Copy link

This is pretty essential IMO, are there any ETA on when it might be implemented?

@lxhunter
Copy link

lxhunter commented Apr 9, 2016

+1

@olds463
Copy link

olds463 commented Apr 22, 2016

+1 this is essential

@fewbits
Copy link

fewbits commented May 13, 2016

+1 This is a blocker for a project we're working on. So, we'll need to probably use Nginx instead of Rancher's native LB with HAProxy. But I'm all about using builtin solutions.

@alena1108, one question: While "true zero downtime" is not yet implemented, is it possible to use Nginx as load balancer with Rancher's native Service Discovery? If the answer is No, I understand that we'll need to use something like Nginx + Consul + Consul Templates. Thanks.

@alena1108
Copy link
Author

@fewbits we are planning to re-work our load balancer to support pluggable providers. It will have controller->provider model where controller would read info from Rancher metadata, generate LB config and pass it to the provider to apply. It would require moving all haproxy specific code from Rancher to its own microservice. So if you would need to use other provider instead of haproxy, all you will have to do - write a provider implementation. I will update this ticket once the project skeleton is uploaded to github.

@naartjie
Copy link

After reading both the Yelp and Unbounce posts, nginx as a load balancer (with reload) looks like a good candidate.

It would be great to get it working with HAProxy, since it has been designed for this kind of scenario, my vote would be to go the route without having 1second latency on those unfortunate requests, if there is an alternative.

@bradjones1
Copy link

FYI the refactor mentioned above is #2179

@elan
Copy link

elan commented Jul 18, 2016

+1 would love to see this, not sure we'll be able to use it in production without, we can't afford to have lots of HTTP requests fail during an upgrade 😢

@alena1108
Copy link
Author

A correction to my initial comment:

Today when reload haproxy config, we only ensure that new connections will never get dropped (https://github.com/rancher/cattle/blob/0c4066f9fd2652f99d29989c2a29065f0378c20e/resources/content/config-content/configscripts/common/scripts.sh#L176). But we do terminate all existing connections

this is not quite correct. Per http://www.haproxy.org/download/1.2/doc/haproxy-en.txt:

haproxy -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)

-sf option drains the existing connections: "Otherwise, it will either ask them to finish (-sf) their work then softly exit, or immediately terminate (-st), breaking existing sessions."

But there is a chance that all the new connections might get blocked while haproxy config is getting reloaded.

Existing connections drop can be caused by the backend server picked by LB to forward the request to, going down in the middle of the request. So when backend service is stopped via Rancher, and the service acts as a target service in the LB(s), there is a chance that the existing connections to it can be dropped. Ideally we should drain the connections to it from all the LBs, and only then execute the stop. We'll have to think what would be the best way of implementing it.

@janeczku
Copy link
Contributor

janeczku commented Jul 8, 2017

@miguelpeixe does this describe the issue you are seeing when upgrading load balanced services with the "start before stopping"? #9287

@vkruoso
Copy link

vkruoso commented Jul 28, 2017

One would argue that this issue this not a feature/enhancement as the service upgrade process works correctly in old rancher + cattle setups. After creating stacks with the most recent versions I've started seeing this behavior: bad responses during a "zero downtime" service upgrade with "start before stopping" checked.

@miguelpeixe
Copy link

@janeczku yes, but I've seen your issue being reported before, rancher team keeps closing it and pointing as related to haproxy connection drain issue. Which I think it doesn't make much sense, even though I don't really understand this connection drain problem. So I still might be wrong about this...

@fondofdigital
Copy link

+1 (i'm wondering because in earlier versions there wasn't such a long downtime 503s from lb)

Using

  • rancher/lb-service-haproxy:v0.7.1
  • rancher 1.6

@Mrono
Copy link

Mrono commented Sep 1, 2017

+1 I was relying on this being a working featurefor our no downtime deployments

@kwaio
Copy link

kwaio commented Sep 25, 2017

I see this issue is old and often rescheduled/paused/resumed, but it is a quite important one.
Could we have a reliable ETA on a solution ?
Thanks for your work anyways !

@stavarengo
Copy link

Hi @deniseschannon.
I saw that you added the "status/resolved" label.
How is that solution? What do we have to do in order to use it?

@cjellick
Copy link

cjellick commented Oct 9, 2017

@stavarengo resolved doesn't mean its released yet. It is in testing now. When it is released, it will be in the release notes for rancher with sufficient instructions on how to use it.

@stavarengo
Copy link

Thanks for the clarification @cjellick
Waiting anxiously for this release 😃 😃

@prachidamle
Copy link
Member

@miguelpeixe @janeczku I think the issue reported here #9287 should be resolved after the following fix #8684 was released. Can you please check if it does?

@sangeethah
Copy link
Contributor

sangeethah commented Nov 2, 2017

This feature is available in v1.6.11-rc6.
We are now able to set drain timeout on services during service creation and upgrade .
This option is available in UI under "command" tab.

With Drain Timeout parameter is set for services that are backends for LB services , When this backend target server gets picked by a LoadBalancer to forward the request to, goes down in the middle of the request being served ( due to service being upgraded) , service will be put in Drain state so that it is able to serve the request that is currently in progress before it can be stopped.

Basic use case that would have resulted in user getting HTTP Bad Gateway (502) with out drain feature implemented would now return 200 OK with correct drainTimeoutMs set:

Create a service - S1 with scale 1 and drainTimeoutMs set 10000ms.
Have these service targets to respond for request with a delay of say 10 secs.
Create LB service with target S1.
Initiate connection to LB service .
When connection is still in progress , Upgrade service S1.
You will notice that the service instance is put in "Stopping" state until the connection to LB ip address succeeds after which instance gets to "stopped" state and service upgrade proceeds.

Full documentation for this feature - rancher/rancher.github.io#920

Some of the bugs that were found and validated during the development of this feature:
#10004
#10005
#10006
#10011
#10012
#10013
#10061
#10065
#10068
#10069
#10087
#10090

@robikovacs
Copy link

@sangeethah is this feature working also in Kubernetes Ingress ? I keep getting 504 Gateway time-out when updating a deployment image. Any thoughts?

@prachidamle
Copy link
Member

@robikovacs no, this feature is not supported for Kubernetes Ingress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal kind/enhancement Issues that improve or augment existing functionality kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality version/1.6
Projects
None yet
Development

No branches or pull requests