Haproxy: drain connections on target service upgrades #2777

alena1108 · 2015-11-23T19:36:13Z

Today when reload haproxy config, we only ensure that new connections will never get dropped (https://github.com/rancher/cattle/blob/0c4066f9fd2652f99d29989c2a29065f0378c20e/resources/content/config-content/configscripts/common/scripts.sh#L176). But we do terminate all existing connections. We have to "drain" all existing connections first before reloading haproxy config. Here are several ways of implementing it:

@ibuildthecloud ^^

TODO for @leodotcloud: #9561

Rucknar · 2015-11-24T09:48:51Z

+1

taketnoi · 2016-02-17T04:34:32Z

+1

joshuakarjala · 2016-02-22T14:48:12Z

Yes have been debugging and issue related to this all day! :)

joshuakarjala · 2016-02-23T10:20:48Z

@alena1108 is there anyway to receive an event when HAProxy has reloaded and ready for new connections (that won't be dropped). Until this issue is resolved it would at least be nice to know when I can make new requests safely.

joshuakarjala · 2016-02-23T12:14:53Z

Alternative approach from Unbounce - http://inside.unbounce.com/product-dev/haproxy-reloads/

CBR09 · 2016-02-24T09:30:47Z

+1
Any update for this feature?. I think it's very important when using rancher in production enviroment.

phucvinh52 · 2016-02-29T03:03:09Z

+1

deniseschannon · 2016-03-01T21:59:29Z

With GA coming up very soon (aiming for end of March), we are trying to fit in as much as possible, but with other customer priorities, we can only try our best for this feature.

rogeralsing · 2016-04-09T06:39:58Z

This is pretty essential IMO, are there any ETA on when it might be implemented?

lxhunter · 2016-04-09T22:25:27Z

+1

olds463 · 2016-04-22T18:31:50Z

+1 this is essential

fewbits · 2016-05-13T16:32:57Z

+1 This is a blocker for a project we're working on. So, we'll need to probably use Nginx instead of Rancher's native LB with HAProxy. But I'm all about using builtin solutions.

@alena1108, one question: While "true zero downtime" is not yet implemented, is it possible to use Nginx as load balancer with Rancher's native Service Discovery? If the answer is No, I understand that we'll need to use something like Nginx + Consul + Consul Templates. Thanks.

alena1108 · 2016-05-19T16:59:55Z

@fewbits we are planning to re-work our load balancer to support pluggable providers. It will have controller->provider model where controller would read info from Rancher metadata, generate LB config and pass it to the provider to apply. It would require moving all haproxy specific code from Rancher to its own microservice. So if you would need to use other provider instead of haproxy, all you will have to do - write a provider implementation. I will update this ticket once the project skeleton is uploaded to github.

naartjie · 2016-05-20T07:12:00Z

After reading both the Yelp and Unbounce posts, nginx as a load balancer (with reload) looks like a good candidate.

It would be great to get it working with HAProxy, since it has been designed for this kind of scenario, my vote would be to go the route without having 1second latency on those unfortunate requests, if there is an alternative.

bradjones1 · 2016-06-16T19:59:31Z

FYI the refactor mentioned above is #2179

elan · 2016-07-18T23:54:54Z

+1 would love to see this, not sure we'll be able to use it in production without, we can't afford to have lots of HTTP requests fail during an upgrade 😢

alena1108 · 2016-08-30T17:31:46Z

A correction to my initial comment:

Today when reload haproxy config, we only ensure that new connections will never get dropped (https://github.com/rancher/cattle/blob/0c4066f9fd2652f99d29989c2a29065f0378c20e/resources/content/config-content/configscripts/common/scripts.sh#L176). But we do terminate all existing connections

this is not quite correct. Per http://www.haproxy.org/download/1.2/doc/haproxy-en.txt:

haproxy -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)

-sf option drains the existing connections: "Otherwise, it will either ask them to finish (-sf) their work then softly exit, or immediately terminate (-st), breaking existing sessions."

But there is a chance that all the new connections might get blocked while haproxy config is getting reloaded.

Existing connections drop can be caused by the backend server picked by LB to forward the request to, going down in the middle of the request. So when backend service is stopped via Rancher, and the service acts as a target service in the LB(s), there is a chance that the existing connections to it can be dropped. Ideally we should drain the connections to it from all the LBs, and only then execute the stop. We'll have to think what would be the best way of implementing it.

janeczku · 2017-07-08T11:51:49Z

@miguelpeixe does this describe the issue you are seeing when upgrading load balanced services with the "start before stopping"? #9287

vkruoso · 2017-07-28T20:25:59Z

One would argue that this issue this not a feature/enhancement as the service upgrade process works correctly in old rancher + cattle setups. After creating stacks with the most recent versions I've started seeing this behavior: bad responses during a "zero downtime" service upgrade with "start before stopping" checked.

miguelpeixe · 2017-07-30T17:05:21Z

@janeczku yes, but I've seen your issue being reported before, rancher team keeps closing it and pointing as related to haproxy connection drain issue. Which I think it doesn't make much sense, even though I don't really understand this connection drain problem. So I still might be wrong about this...

fondofdigital · 2017-08-17T22:28:04Z

+1 (i'm wondering because in earlier versions there wasn't such a long downtime 503s from lb)

Using

rancher/lb-service-haproxy:v0.7.1
rancher 1.6

Mrono · 2017-09-01T22:08:57Z

+1 I was relying on this being a working featurefor our no downtime deployments

kwaio · 2017-09-25T09:05:51Z

I see this issue is old and often rescheduled/paused/resumed, but it is a quite important one.
Could we have a reliable ETA on a solution ?
Thanks for your work anyways !

stavarengo · 2017-10-06T18:46:51Z

Hi @deniseschannon.
I saw that you added the "status/resolved" label.
How is that solution? What do we have to do in order to use it?

cjellick · 2017-10-09T20:24:34Z

@stavarengo resolved doesn't mean its released yet. It is in testing now. When it is released, it will be in the release notes for rancher with sufficient instructions on how to use it.

stavarengo · 2017-10-09T21:04:39Z

Thanks for the clarification @cjellick
Waiting anxiously for this release 😃 😃

prachidamle · 2017-10-18T18:14:02Z

@miguelpeixe @janeczku I think the issue reported here #9287 should be resolved after the following fix #8684 was released. Can you please check if it does?

sangeethah · 2017-11-02T17:12:46Z

This feature is available in v1.6.11-rc6.
We are now able to set drain timeout on services during service creation and upgrade .
This option is available in UI under "command" tab.

With Drain Timeout parameter is set for services that are backends for LB services , When this backend target server gets picked by a LoadBalancer to forward the request to, goes down in the middle of the request being served ( due to service being upgraded) , service will be put in Drain state so that it is able to serve the request that is currently in progress before it can be stopped.

Basic use case that would have resulted in user getting HTTP Bad Gateway (502) with out drain feature implemented would now return 200 OK with correct drainTimeoutMs set:

Create a service - S1 with scale 1 and drainTimeoutMs set 10000ms.
Have these service targets to respond for request with a delay of say 10 secs.
Create LB service with target S1.
Initiate connection to LB service .
When connection is still in progress , Upgrade service S1.
You will notice that the service instance is put in "Stopping" state until the connection to LB ip address succeeds after which instance gets to "stopped" state and service upgrade proceeds.

Full documentation for this feature - rancher/rancher.github.io#920

Some of the bugs that were found and validated during the development of this feature:
#10004
#10005
#10006
#10011
#10012
#10013
#10061
#10065
#10068
#10069
#10087
#10090

robikovacs · 2018-01-11T09:11:47Z

@sangeethah is this feature working also in Kubernetes Ingress ? I keep getting 504 Gateway time-out when updating a deployment image. Any thoughts?

prachidamle · 2018-01-11T19:07:45Z

@robikovacs no, this feature is not supported for Kubernetes Ingress.

alena1108 added kind/enhancement Issues that improve or augment existing functionality kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality labels Nov 23, 2015

alena1108 self-assigned this Nov 23, 2015

alena1108 added this to the Release 1.0 milestone Nov 23, 2015

will-chan added the release/v1.0.0 label Dec 30, 2015

deniseschannon removed this from the Release 1.0 milestone Mar 1, 2016

deniseschannon removed the release/v1.0.0 label Mar 1, 2016

will-chan added the release/future label Mar 8, 2016

alena1108 mentioned this issue May 19, 2016

LB: zero-downtime service upgrade fails #4834

Closed

deniseschannon mentioned this issue Jun 7, 2016

"Start before stopping" not a zero downtime deploy #5000

Closed

deniseschannon modified the milestone: Unscheduled Jun 28, 2016

deniseschannon removed the release/future label Jun 28, 2016

deniseschannon modified the milestones: August 2017, July 2017 Jul 17, 2017

deniseschannon removed the PAUSED label Jul 21, 2017

leodotcloud mentioned this issue Aug 3, 2017

investigate the impact on "stopping" containers #9561

Closed

deniseschannon modified the milestones: September 2017, August 2017 Aug 14, 2017

pulberg mentioned this issue Aug 21, 2017

Remove host/container from load balancer before terminating container #9723

Closed

cjellick mentioned this issue Sep 14, 2017

Failed requests during service downscaling #9877

Closed

deniseschannon added status/resolved and removed status/reopened labels Oct 2, 2017

sangeethah closed this as completed Nov 2, 2017

0ff mentioned this issue Dec 15, 2017

Haproxy: DrainTimout breaks service-restart #10593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Haproxy: drain connections on target service upgrades #2777

Haproxy: drain connections on target service upgrades #2777

alena1108 commented Nov 23, 2015 •

edited by leodotcloud

Loading

Rucknar commented Nov 24, 2015

taketnoi commented Feb 17, 2016

joshuakarjala commented Feb 22, 2016

joshuakarjala commented Feb 23, 2016

joshuakarjala commented Feb 23, 2016

CBR09 commented Feb 24, 2016

phucvinh52 commented Feb 29, 2016

deniseschannon commented Mar 1, 2016

rogeralsing commented Apr 9, 2016

lxhunter commented Apr 9, 2016

olds463 commented Apr 22, 2016

fewbits commented May 13, 2016

alena1108 commented May 19, 2016

naartjie commented May 20, 2016

bradjones1 commented Jun 16, 2016

elan commented Jul 18, 2016

alena1108 commented Aug 30, 2016

janeczku commented Jul 8, 2017

vkruoso commented Jul 28, 2017

miguelpeixe commented Jul 30, 2017

fondofdigital commented Aug 17, 2017

Mrono commented Sep 1, 2017 •

edited

Loading

kwaio commented Sep 25, 2017 •

edited

Loading

stavarengo commented Oct 6, 2017

cjellick commented Oct 9, 2017

stavarengo commented Oct 9, 2017

prachidamle commented Oct 18, 2017

sangeethah commented Nov 2, 2017 •

edited

Loading

robikovacs commented Jan 11, 2018

prachidamle commented Jan 11, 2018

Haproxy: drain connections on target service upgrades #2777

Haproxy: drain connections on target service upgrades #2777

Comments

alena1108 commented Nov 23, 2015 • edited by leodotcloud Loading

Rucknar commented Nov 24, 2015

taketnoi commented Feb 17, 2016

joshuakarjala commented Feb 22, 2016

joshuakarjala commented Feb 23, 2016

joshuakarjala commented Feb 23, 2016

CBR09 commented Feb 24, 2016

phucvinh52 commented Feb 29, 2016

deniseschannon commented Mar 1, 2016

rogeralsing commented Apr 9, 2016

lxhunter commented Apr 9, 2016

olds463 commented Apr 22, 2016

fewbits commented May 13, 2016

alena1108 commented May 19, 2016

naartjie commented May 20, 2016

bradjones1 commented Jun 16, 2016

elan commented Jul 18, 2016

alena1108 commented Aug 30, 2016

janeczku commented Jul 8, 2017

vkruoso commented Jul 28, 2017

miguelpeixe commented Jul 30, 2017

fondofdigital commented Aug 17, 2017

Mrono commented Sep 1, 2017 • edited Loading

kwaio commented Sep 25, 2017 • edited Loading

stavarengo commented Oct 6, 2017

cjellick commented Oct 9, 2017

stavarengo commented Oct 9, 2017

prachidamle commented Oct 18, 2017

sangeethah commented Nov 2, 2017 • edited Loading

robikovacs commented Jan 11, 2018

prachidamle commented Jan 11, 2018

alena1108 commented Nov 23, 2015 •

edited by leodotcloud

Loading

Mrono commented Sep 1, 2017 •

edited

Loading

kwaio commented Sep 25, 2017 •

edited

Loading

sangeethah commented Nov 2, 2017 •

edited

Loading