Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

hostingnuggets · 2017-04-21T22:08:38Z

What version of Traefik are you using (`traefik version`)?

v1.2.3

What is your environment & configuration (arguments, toml...)?

docker service create \
    --name traefik \
    --constraint=node.role==manager \
    --publish 80:80 --publish 8080:8080 \
    --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
    --network traefik-net \
    traefik \
    --docker \
    --docker.swarmmode \
    --docker.domain=traefik \
    --docker.watch \
    --web

What did you do?

I am following the swarm mode user guide (https://docs.traefik.io/user-guide/swarm-mode/) directly on my Docker manager node and have setup the whoami0 service and scaled it up to 2 tasks in order to have redundancy. Now I wanted to test a rolling update of the service using docker service update --force --update-delay=10s whoami0 and noticed that during the rolling update traefik returns twice a Gateway Timeout. As far as I understand the rolling updates in swarm mode there should be no downtime because only one container/task gets stopped at a time. So there is always one container running while the other one gets restarted.

What did you expect to see?

Zero downtime

What did you see instead?

Gateway Timeout

If applicable, please paste the log output in debug mode (`--debug` switch)

time="2017-04-21T21:47:22Z" level=warning msg="Error forwarding to http://10.0.0.5:80, err: dial tcp 10.0.0.5:80: i/o timeout"

The text was updated successfully, but these errors were encountered:

timoreimann · 2017-04-22T19:47:41Z

With other providers (such as Marathon and Kubernetes), rolling upgrades do not release you from the extra effort needed to achieve lossless handovers. I'm not familiar with Docker Swarm -- does it support something like request draining out of the box to get this extra level of convenience?

hostingnuggets · 2017-04-22T21:04:33Z

@timoreimann I hope you are not asking me this question ;-) because I have no clue what request draining is. How can I help in order to find this out?

timoreimann · 2017-04-22T21:14:42Z

@hostingnuggets I was asking generally -- you or anyone else you may know. :-) Maybe @vdemeester?

Request draining means that a proxy stops sending requests to a particular backend once it knows that the backend is going to be stopped/replaced soon. It's a way to achieve handovering without involvement from the backend application.

pascalandy · 2017-04-24T00:44:06Z

The assumption is wrong.

Docker Swarm does NOT support a real Zero-downtime deployments with rolling upgrades at the moment. moby/moby#30321

IMHO, it's not related to Traefik's project.

timoreimann · 2017-04-25T13:16:24Z

Thanks for chiming in and sharing the moby/docker link, @pascalandy.

So, as I understand, the situation in Docker Swarm is similar to other providers I'm familiar with: You'll need to teach your apps to handle shutdowns gracefully if you don't want to lose/timeout on requests.

Note there's already a long-standing feature request for draining backends: #41.

@hostingnuggets I'm going to close the issue as it doesn't seem like there's anything in particular this issue can or should track. Feel free to post again if you think otherwise.

hostingnuggets · 2017-04-26T19:21:01Z

@timoreimann thanks for the research. I hope a workaround or solution around this issue can be found one day, no matter if it is on the side of traefik or Docker Swarm.

timoreimann · 2017-04-26T19:24:26Z

@hostingnuggets I forgot to mention that a workaround is already available: You can enable retries so that requests against failing / terminating backends are retried with other backends. It's more of a probabilistic than a deterministic approach, however, but if you set the number of retries to a reasonably high value and don't mind paying the casual increase in latency, you should be okay.

hostingnuggets · 2017-04-27T15:09:49Z

@timoreimann thanks for the suggestion, I now set retries to 30 and indeed while updating my service with a rolling update I did not get any timeouts from traefik.io, just like you mention a latency/delay. I would have expected this delay to be around 10 seconds but it is around 35 seconds which is IMHO quite long because the containers restart really fast. Maybe I need to tune a timeout parameter somewhere?

mvdstam · 2017-04-27T15:37:27Z

@hostingnuggets This is probably still a problem in Docker. As demonstrated here, major packet loss can still occur when updating services. All Traefik does, is routing traffic through the VIP that is provided by the Docker engine.

hostingnuggets · 2017-04-27T15:45:05Z

@mvdstam That's exactly the behavior I experience with my tests. The worst part is that you get these timeouts for each service task (=replica) which needs to be restarted. So in my case with 2 replicas I get two time windows where the whole service does not answer to HTTP requests although the other container is running and ready to serve requests. I hope the Docker guys can find a solution or workaround this.

Yshayy · 2017-05-05T06:09:02Z

@mvdstam as far as I know, from version 1.2+, traefik by default does not use by swarm service VIP and instead use the IPs of each task/container. (that way it can achieve traefik features like stickiness)

From encountering a similar issue in the past, I noticed that using swarm service VIP load balancer (traefik.backend.loadbalancer.swarm=true) produce better results regarding availability in updates.

If I remember correctly, combining it with image/service health-checking actually achieved zero downtime updates, but the load balancing itself didn't work so good (not equally distributed between containers).

In order to achieve zero downtime availability, we are currently using two services and changing the label priority between them to do rolling updates (style blue-green deployments), but the retry approach sounds interesting as well.

@hostingnuggets I think the reason for the long delays is due to the implementation of traefik's docker provider, the backend endpoints are based on swarm task list (which contains status/desired status as well) and traefik polls this list quite slowly:
https://github.com/containous/traefik/blob/v1.3/provider/docker/docker.go
SwarmDefaultWatchTime = 15 * time.Second
Which means it can take 15 seconds to get the right list of tasks and each task status, for example, it might take several seconds to identify that task is shutting down, that a new task was created or got to healthy state.

When using (traefik.backend.loadbalancer.swarm=true), it can work instantly as the backend endpoint does not change at all (it's a service VIP) and swarm itself is responsible for managing the backends.

hostingnuggets · 2017-05-05T14:29:03Z

@Yshayy Thanks for your detailed input, very interesting! I can confirm that running a web app container in swarm mode with two replicas with the label traefik.backend.loadbalancer.swarm=true does not generate any noticeable interruptions/timeouts.

The downside as you say I have noticed that the load balancing among the replicas/tasks does not work at all. For instance I am running a GET every second on a web app which displays the hostname and after 30 minutes it is still querying the same container. Any ideas how to fix that behaviour? I do not care about session stickiness...

Yshayy · 2017-05-05T17:58:27Z

@hostingnuggets Unfortunately, we didn't found a suitable solution for the load-balancing behavior (we didn't care for stickiness as well) but the load-balancing issue was critical, we've noticed that when load-testing it without keep-alive support in the client (http) the load-balancing worked better. (we load-tested it with jmeter)
I think we've also tested the swarm VIP point directly (without traefik) and that worked ok, but it was some time ago, I'm not sure.

In the end, we've decided to not use swarm internal load-balancer and create two services and change priority label between them, although this solution is pretty stable it got the disadvantage of maintaining additional service.

Still, after seeing the retry trick, I think retrying failed requests might be the most convenient solution if it's possible to decrease the swarm polling interval to 1-2 seconds.

hostingnuggets · 2017-05-05T19:55:30Z

@Yshayy any idea if there is someone of the docker team is reading these issues and who could take a look into why the VIP does not load-balance nicely when used by traefik?

Regarding changing the swarm polling interval from 15 to 1-2 seconds, I can't find the parameter in the traefik.toml file for the Docker backend. It looks like most of the backends have the RefreshSeconds parameter but not the Docker backendt, or it is not mentioned in the documentation (http://docs.traefik.io/toml/#docker-backend).

timoreimann · 2017-05-05T20:26:48Z

@hostingnuggets looking at the code, the interval seems to be static. It should be fairly easy though to make it configurable.

@vdemeester is there anything that speaks against exposing the parameter to the configuration space?

Yshayy · 2017-05-06T01:55:02Z

@hostingnuggets I'm less familiar with that work that is been done in docker/moby/swarmkit and I haven't seen any issue regarding this behavior. frankly, I'm not sure if it's an issue with swarm, docker's network overlay driver, traefik or something else entirely.

@timoreimann, I think it can be useful, although I have no idea what should be the recommended thresholds or how long does a refresh request actually takes, maybe setting it too low can cause problems?
There's also this todo in the code:

// TODO: This need to be change. Linked to Swarm events docker/docker#23827

Event-based approach certainly sounds more efficient and probably credible.
It seems there's some progress in regards to exposing service events in moby/swarmkit projects:
moby/swarmkit#2034
moby/moby#32421

hostingnuggets · 2017-05-06T11:18:00Z

Thanks to both of you for your input and comments. As an idea it might be interesting for you guys to have a look into how the Docker Flow Proxy (http://proxy.dockerflow.com/) manages zero downtime when doing rolling updates of Docker Swarm services. Afaik they use HA-Proxy inside their container. So there might be a trick or further hints for traefik on how to also achieve zero downtime (no matter on which side the issue is really).

pascalandy · 2017-05-12T00:50:54Z

EDIT: It does not resolve this issue. It looks fancy thought

$ docker service update -d=false --image abiosoft/caddy:0.10.2 caddy_webapp-caddy-a
caddy_webapp-caddy-a
overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Waiting 1 seconds to verify that tasks are stable...

I still take 30 seconds to get our service back online.

+++

There is a new flag -d=false. We can make a synchronous service scale with the new synchronous service create/update feature.

BEFORE:
docker service scale http_http=5

NOW:
docker service update -d=false --replicas 10 http_http

Cheers!
Pascal

pascalandy · 2017-05-12T00:52:57Z

Follow this issue as well - moby/moby#30321 (comment)

Yshayy · 2017-05-29T04:46:41Z

@pascalandy I think lowering the polling interval in traefik or making it customizable (instead of just 15 seconds) can make some difference.
In the near future, Docker events API will probably provide an opportunity for a better solution.

pascalandy · 2017-05-30T01:54:01Z

I'll leave this to the pros :-)

ldez added the area/provider/docker label Apr 21, 2017

timoreimann closed this as completed Apr 25, 2017

ldez modified the milestone: 1.3 Apr 29, 2017

hostingnuggets mentioned this issue Jun 21, 2017

Swarm routing failures moby/moby#32079

Closed

timoreimann added the resolution/declined label Jul 11, 2017

timoreimann mentioned this issue Jul 11, 2017

While scaling Docker services we are getting 502 from traefik #1855

Closed

victormakmo mentioned this issue Jan 15, 2018

Traefik drop request when nginx trigger configure reload #2210

Closed

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

hostingnuggets commented Apr 21, 2017

timoreimann commented Apr 22, 2017

hostingnuggets commented Apr 22, 2017

timoreimann commented Apr 22, 2017

pascalandy commented Apr 24, 2017 •

edited

timoreimann commented Apr 25, 2017

hostingnuggets commented Apr 26, 2017

timoreimann commented Apr 26, 2017

hostingnuggets commented Apr 27, 2017

mvdstam commented Apr 27, 2017

hostingnuggets commented Apr 27, 2017

Yshayy commented May 5, 2017 •

edited

hostingnuggets commented May 5, 2017

Yshayy commented May 5, 2017

hostingnuggets commented May 5, 2017

timoreimann commented May 5, 2017

Yshayy commented May 6, 2017 •

edited

hostingnuggets commented May 6, 2017

pascalandy commented May 12, 2017 •

edited

pascalandy commented May 12, 2017

Yshayy commented May 29, 2017

pascalandy commented May 30, 2017

Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

Comments

hostingnuggets commented Apr 21, 2017

What version of Traefik are you using (traefik version)?

What is your environment & configuration (arguments, toml...)?

What did you do?

What did you expect to see?

What did you see instead?

If applicable, please paste the log output in debug mode (--debug switch)

timoreimann commented Apr 22, 2017

hostingnuggets commented Apr 22, 2017

timoreimann commented Apr 22, 2017

pascalandy commented Apr 24, 2017 • edited

timoreimann commented Apr 25, 2017

hostingnuggets commented Apr 26, 2017

timoreimann commented Apr 26, 2017

hostingnuggets commented Apr 27, 2017

mvdstam commented Apr 27, 2017

hostingnuggets commented Apr 27, 2017

Yshayy commented May 5, 2017 • edited

hostingnuggets commented May 5, 2017

Yshayy commented May 5, 2017

hostingnuggets commented May 5, 2017

timoreimann commented May 5, 2017

Yshayy commented May 6, 2017 • edited

hostingnuggets commented May 6, 2017

pascalandy commented May 12, 2017 • edited

pascalandy commented May 12, 2017

Yshayy commented May 29, 2017

pascalandy commented May 30, 2017

What version of Traefik are you using (`traefik version`)?

If applicable, please paste the log output in debug mode (`--debug` switch)

pascalandy commented Apr 24, 2017 •

edited

Yshayy commented May 5, 2017 •

edited

Yshayy commented May 6, 2017 •

edited

pascalandy commented May 12, 2017 •

edited